saving and loading embedding from Chroma #7175

Lufffya · 2023-07-05T06:52:10Z

Issue with current documentation:

# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# load the document and split it into chunks
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()
docs = db.similarity_search(query)

# load from disk
db3 = Chroma(persist_directory="./chroma_db")
docs = db.similarity_search(query)
print(docs[0].page_content)

Idea or request for content:

In above code, I find it difficult to understand this paragraph:

# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()
docs = db.similarity_search(query)

# load from disk
db3 = Chroma(persist_directory="./chroma_db")
docs = db.similarity_search(query)
print(docs[0].page_content)

Although db2 and db3 do demonstrate the saving and loading of Chroma,
But Two pieces of code( docs = db.similarity_search(query) ) have nothing to do with saving and loading,
and it still searches for answers from the db
Is this an error?

The text was updated successfully, but these errors were encountered:

rjarun8 · 2023-07-05T08:51:54Z

I feel the question makes a lot of sense. would you expect something like this?


# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# load the document and split it into chunks
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

# save to disk
# Note: The following code is demonstrating how to save the Chroma database to disk.
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()

# load from disk
# Note: The following code is demonstrating how to load the Chroma database from disk.
db3 = Chroma(persist_directory="./chroma_db")

# perform a similarity search on the loaded database
# Note: This is to demonstrate that the loaded database is functioning correctly.
docs = db3.similarity_search(query)
print(docs[0].page_content)

Lufffya · 2023-07-05T09:06:13Z

I tested it, need to pass a parameter(embedding_function) to Chroma
like this: Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
Then it can run

chenzhiang669 · 2023-07-07T03:50:14Z

yes，I have a similar question that when I load vectors from db, why I still need to pass an embedding params?
docSearch = Chroma(persist_directory="D:/vector_store", embedding_function=embeddings)

and I think the param "embedding_function" is unnecessary， isn't it?
but when I run the code, it will be failed without param "embedding_function", who can give me an answer why ?

jenswilms · 2023-07-07T19:33:41Z

I had the same issue here. Thanks @Lufffya !

But is very strange you have to load the embedding model into the Chroma database, rather than with the search query...

ajasingh · 2023-07-15T06:15:37Z

Yes I have similar question I just want to search the existing indexed docs why i need to pass the embedding_function ?

Lufffya · 2023-08-15T10:02:44Z

Yes I have similar question I just want to search the existing indexed docs why i need to pass the embedding_function ?

because input for search need call embedding_function to get input embeddings, I guess

baidurja · 2023-08-23T14:52:05Z

The following line of code
db2.persist()
is missing from the current langchain documentation (https://python.langchain.com/docs/integrations/vectorstores/chroma)

dosubot · 2023-11-22T16:03:02Z

Hi, @Lufffya! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you raised was about confusion regarding the code snippet provided for saving and loading embeddings from Chroma. You were finding it difficult to understand why the code still searches for answers from the db after saving and loading, and you questioned if this was an error. However, it seems that the issue has been resolved by passing a parameter embedding_function to Chroma. This resolves the confusion regarding the code snippet searching for answers from the db after saving and loading. You tested the code and confirmed that passing embedding_function resolves the issue. Other users also had similar questions and confirmed that passing embedding_function is necessary. Additionally, it was pointed out that the line db2.persist() is missing from the documentation.

Now, we would like to know if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or the issue will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository! Let us know if you have any further questions or concerns.

Bardo-Konrad · 2024-03-10T14:50:32Z

When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | local_llm
        | StrOutputParser()
    )

The model responds that context is empty.

If on the other hand I create the vectorstore using

vectorstore = Chroma.from_documents(
                            documents=documents,
                            collection_name=collection_name,
                            embedding=emb,
                            persist_directory=sys.argv[1]+"-db",
                        )

the model gets a context.

How come?

nidhin-krishnakumar · 2024-04-09T07:26:34Z

When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and
chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | local_llm
        | StrOutputParser()
    )
The model responds that context is empty.

If on the other hand I create the vectorstore using
vectorstore = Chroma.from_documents(
                            documents=documents,
                            collection_name=collection_name,
                            embedding=emb,
                            persist_directory=sys.argv[1]+"-db",
                        )
the model gets a context.

How come?

I am also facing the same issue! Any idea why it is so?

dosubot bot added 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder 🤖:question A specific question about the codebase, product, project, or how to use a feature labels Jul 5, 2023

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 22, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

saving and loading embedding from Chroma #7175

saving and loading embedding from Chroma #7175

Lufffya commented Jul 5, 2023

rjarun8 commented Jul 5, 2023

Lufffya commented Jul 5, 2023

chenzhiang669 commented Jul 7, 2023

jenswilms commented Jul 7, 2023

ajasingh commented Jul 15, 2023

Lufffya commented Aug 15, 2023

baidurja commented Aug 23, 2023 •

edited

dosubot bot commented Nov 22, 2023

Bardo-Konrad commented Mar 10, 2024

nidhin-krishnakumar commented Apr 9, 2024

saving and loading embedding from Chroma #7175

saving and loading embedding from Chroma #7175

Comments

Lufffya commented Jul 5, 2023

Issue with current documentation:

Idea or request for content:

rjarun8 commented Jul 5, 2023

Lufffya commented Jul 5, 2023

chenzhiang669 commented Jul 7, 2023

jenswilms commented Jul 7, 2023

ajasingh commented Jul 15, 2023

Lufffya commented Aug 15, 2023

baidurja commented Aug 23, 2023 • edited

dosubot bot commented Nov 22, 2023

Bardo-Konrad commented Mar 10, 2024

nidhin-krishnakumar commented Apr 9, 2024

baidurja commented Aug 23, 2023 •

edited