-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue: UnicodeDecodeError when running .txt files using TextLoader #9005
Comments
Answer generated by a 🤖 AnswerHello, Thank you for reaching out with your issue. The error you're encountering is a In your case, it seems like the There are a couple of ways to resolve this issue:
loader = TextLoader("./Training/test2.txt", encoding='utf-8')
loader = TextLoader("./Training/test2.txt", autodetect_encoding=True) Here's the relevant part of the def __init__(
self,
file_path: str,
encoding: Optional[str] = None,
autodetect_encoding: bool = False,
):
"""Initialize with file path."""
self.file_path = file_path
self.encoding = encoding
self.autodetect_encoding = autodetect_encoding Also, there was a similar issue raised in the LangChain repository: Unicode error on Windows. The suggested solution there was to enable unicode system wide ( I hope this helps! Let me know if you have any other questions or if the issue persists. Best, Sources
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
This worked, thank you! |
Hi, @mepurgamentum! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale. From what I understand, you were encountering a UnicodeDecodeError when trying to run a UTF-8 encoded .txt file using the TextLoader. You received advice from me to either specify the correct encoding when initializing the TextLoader or set autodetect_encoding to True. It seems that the suggested solution worked for you. Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days. Thank you for your contribution to the LangChain repository! |
Hello, I used encoding='cp1251' in TextLoader. Everything is ok when I see the result of loading. But when I execute I got the error: 'charmap' codec can't decode byte 0x81 in position 33: character maps to Even when I removed the 100 first bytes of text - the error is the same. UnicodeDecodeError Traceback (most recent call last) File ~\AppData\Roaming\Python\Python310\site-packages\langchain\chains_init_.py:51, in File ~\AppData\Roaming\Python\Python310\site-packages\langchain\chains\llm_summarization_checker\base.py:20, in File ~\AppData\Roaming\Python\Python310\site-packages\langchain_core\prompts\prompt.py:203, in PromptTemplate.from_file(cls, template_file, input_variables, **kwargs) File c:\Program Files\Python310\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final) |
Issue you'd like to raise.
Hello - I receive a UnicodeDecodeError when running the below code:
from dotenv import load_dotenv
load_dotenv()
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
llm = OpenAI(temperature=0.1)
from langchain.document_loaders import TextLoader
loader = TextLoader("./Training/test2.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever(search_kwargs={"k": 1}))
query = "How much did the judge fine Twitter?"
qa.run(query)
The test2.txt is a 1,600-word UTF-8 encoded file. Here is the error text that I receive:
Traceback (most recent call last):
File "C:\Users\Admin\PycharmProjects\pythonProject\Misc\Testing.py", line 17, in
documents = loader.load()
^^^^^^^^^^^^^
File "C:\Users\Admin\PycharmProjects\pythonProject\Misc\venv\Lib\site-packages\langchain\document_loaders\text.py", line 18, in load
text = f.read()
^^^^^^^^
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1079: character maps to
Any advice would be greatly appreciated.
Suggestion:
No response
The text was updated successfully, but these errors were encountered: