Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

saving and loading embedding from Chroma #7175

Closed
Lufffya opened this issue Jul 5, 2023 · 10 comments
Closed

saving and loading embedding from Chroma #7175

Lufffya opened this issue Jul 5, 2023 · 10 comments
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder 🤖:question A specific question about the codebase, product, project, or how to use a feature

Comments

@Lufffya
Copy link

Lufffya commented Jul 5, 2023

Issue with current documentation:

# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# load the document and split it into chunks
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)
# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()
docs = db.similarity_search(query)

# load from disk
db3 = Chroma(persist_directory="./chroma_db")
docs = db.similarity_search(query)
print(docs[0].page_content)

Idea or request for content:

In above code, I find it difficult to understand this paragraph:

# save to disk
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()
docs = db.similarity_search(query)

# load from disk
db3 = Chroma(persist_directory="./chroma_db")
docs = db.similarity_search(query)
print(docs[0].page_content)

Although db2 and db3 do demonstrate the saving and loading of Chroma,
But Two pieces of code( docs = db.similarity_search(query) ) have nothing to do with saving and loading,
and it still searches for answers from the db
Is this an error?

@dosubot dosubot bot added 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder 🤖:question A specific question about the codebase, product, project, or how to use a feature labels Jul 5, 2023
@rjarun8
Copy link

rjarun8 commented Jul 5, 2023

I feel the question makes a lot of sense. would you expect something like this?


# import
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader

# load the document and split it into chunks
loader = TextLoader("../../../state_of_the_union.txt")
documents = loader.load()

# split it into chunks
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# create the open-source embedding function
embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# load it into Chroma
db = Chroma.from_documents(docs, embedding_function)

# query it
query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

# print results
print(docs[0].page_content)

# save to disk
# Note: The following code is demonstrating how to save the Chroma database to disk.
db2 = Chroma.from_documents(docs, embedding_function, persist_directory="./chroma_db")
db2.persist()

# load from disk
# Note: The following code is demonstrating how to load the Chroma database from disk.
db3 = Chroma(persist_directory="./chroma_db")

# perform a similarity search on the loaded database
# Note: This is to demonstrate that the loaded database is functioning correctly.
docs = db3.similarity_search(query)
print(docs[0].page_content)

@Lufffya
Copy link
Author

Lufffya commented Jul 5, 2023

I tested it, need to pass a parameter(embedding_function) to Chroma
like this: Chroma(persist_directory="./chroma_db", embedding_function=embedding_function)
Then it can run

@chenzhiang669
Copy link

yes,I have a similar question that when I load vectors from db, why I still need to pass an embedding params?
docSearch = Chroma(persist_directory="D:/vector_store", embedding_function=embeddings)

and I think the param "embedding_function" is unnecessary, isn't it?
but when I run the code, it will be failed without param "embedding_function", who can give me an answer why ?

@jenswilms
Copy link

I had the same issue here. Thanks @Lufffya !

But is very strange you have to load the embedding model into the Chroma database, rather than with the search query...

@ajasingh
Copy link

Yes I have similar question I just want to search the existing indexed docs why i need to pass the embedding_function ?

@Lufffya
Copy link
Author

Lufffya commented Aug 15, 2023

Yes I have similar question I just want to search the existing indexed docs why i need to pass the embedding_function ?

because input for search need call embedding_function to get input embeddings, I guess

@baidurja
Copy link

baidurja commented Aug 23, 2023

The following line of code
db2.persist()
is missing from the current langchain documentation (https://python.langchain.com/docs/integrations/vectorstores/chroma)

image

Copy link

dosubot bot commented Nov 22, 2023

Hi, @Lufffya! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue you raised was about confusion regarding the code snippet provided for saving and loading embeddings from Chroma. You were finding it difficult to understand why the code still searches for answers from the db after saving and loading, and you questioned if this was an error. However, it seems that the issue has been resolved by passing a parameter embedding_function to Chroma. This resolves the confusion regarding the code snippet searching for answers from the db after saving and loading. You tested the code and confirmed that passing embedding_function resolves the issue. Other users also had similar questions and confirmed that passing embedding_function is necessary. Additionally, it was pointed out that the line db2.persist() is missing from the documentation.

Now, we would like to know if this issue is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or the issue will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository! Let us know if you have any further questions or concerns.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 22, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 29, 2023
@Bardo-Konrad
Copy link

When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | local_llm
        | StrOutputParser()
    )

The model responds that context is empty.

If on the other hand I create the vectorstore using

vectorstore = Chroma.from_documents(
                            documents=documents,
                            collection_name=collection_name,
                            embedding=emb,
                            persist_directory=sys.argv[1]+"-db",
                        )

the model gets a context.

How come?

@nidhin-krishnakumar
Copy link

When using vectorstore = Chroma(persist_directory=sys.argv[1]+"-db", embedding_function=emb) with emb = embeddings.ollama.OllamaEmbeddings(model='nomic-embed-text'), and retriever = vectorstore.as_retriever() and

chain = (
        {"context": retriever, "question": RunnablePassthrough()}
        | prompt
        | local_llm
        | StrOutputParser()
    )

The model responds that context is empty.

If on the other hand I create the vectorstore using

vectorstore = Chroma.from_documents(
                            documents=documents,
                            collection_name=collection_name,
                            embedding=emb,
                            persist_directory=sys.argv[1]+"-db",
                        )

the model gets a context.

How come?

I am also facing the same issue! Any idea why it is so?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder 🤖:question A specific question about the codebase, product, project, or how to use a feature
Projects
None yet
Development

No branches or pull requests

7 participants