Issue: UnicodeDecodeError when running .txt files using TextLoader #9005

mepurgamentum · 2023-08-09T20:48:00Z

Issue you'd like to raise.

Hello - I receive a UnicodeDecodeError when running the below code:

from dotenv import load_dotenv
load_dotenv()
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import RetrievalQA
llm = OpenAI(temperature=0.1)
from langchain.document_loaders import TextLoader
loader = TextLoader("./Training/test2.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever(search_kwargs={"k": 1}))
query = "How much did the judge fine Twitter?"
qa.run(query)

The test2.txt is a 1,600-word UTF-8 encoded file. Here is the error text that I receive:

Traceback (most recent call last):
File "C:\Users\Admin\PycharmProjects\pythonProject\Misc\Testing.py", line 17, in
documents = loader.load()
^^^^^^^^^^^^^
File "C:\Users\Admin\PycharmProjects\pythonProject\Misc\venv\Lib\site-packages\langchain\document_loaders\text.py", line 18, in load
text = f.read()
^^^^^^^^
File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1079: character maps to

Any advice would be greatly appreciated.

Suggestion:

No response

dosubot · 2023-08-10T08:13:49Z

Answer generated by a 🤖

Answer

Hello,

Thank you for reaching out with your issue. The error you're encountering is a UnicodeDecodeError, which typically occurs when the encoding of the file you're trying to load doesn't match the encoding specified in the TextLoader or the default system encoding if no encoding is specified.

In your case, it seems like the TextLoader is trying to read your file using the default system encoding, which is likely not UTF-8. This is causing the UnicodeDecodeError because your file is UTF-8 encoded.

There are a couple of ways to resolve this issue:

Specify the correct encoding when initializing the TextLoader. Since your file is UTF-8 encoded, you can specify 'utf-8' as the encoding. Here's how you can do it:

loader = TextLoader("./Training/test2.txt", encoding='utf-8')

Set autodetect_encoding to True when initializing the TextLoader. This will let the TextLoader automatically detect the file encoding. Here's how you can do it:

loader = TextLoader("./Training/test2.txt", autodetect_encoding=True)

Here's the relevant part of the TextLoader code for reference:

def __init__(
    self,
    file_path: str,
    encoding: Optional[str] = None,
    autodetect_encoding: bool = False,
):
    """Initialize with file path."""
    self.file_path = file_path
    self.encoding = encoding
    self.autodetect_encoding = autodetect_encoding

Also, there was a similar issue raised in the LangChain repository: Unicode error on Windows. The suggested solution there was to enable unicode system wide (PYTHONUTF8=1), which could be a temporary fix for your issue.

I hope this helps! Let me know if you have any other questions or if the issue persists.

Best,
Dosu

Sources

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

mepurgamentum · 2023-08-10T15:02:54Z

This worked, thank you!

dosubot · 2023-11-09T16:02:06Z

Hi, @mepurgamentum! I'm Dosu, and I'm helping the LangChain team manage our backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were encountering a UnicodeDecodeError when trying to run a UTF-8 encoded .txt file using the TextLoader. You received advice from me to either specify the correct encoding when initializing the TextLoader or set autodetect_encoding to True. It seems that the suggested solution worked for you.

Before we close this issue, we wanted to check if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on this issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository!

Warlib1975 · 2024-02-04T06:11:27Z

Hello,

I used encoding='cp1251' in TextLoader. Everything is ok when I see the result of loading. But when I execute
qa_chain = RetrievalQA.from_chain_type(llm, retriever=db.as_retriever())

I got the error: 'charmap' codec can't decode byte 0x81 in position 33: character maps to
I tried to convert source file into utf-8 encoding, but the error still the same. :-(
None of your recommendations doesn't work.

Even when I removed the 100 first bytes of text - the error is the same.

UnicodeDecodeError Traceback (most recent call last)
Input In [72], in
----> 1 from langchain.chains import RetrievalQA
3 qa_chain = RetrievalQA.from_chain_type(llm, retriever=db.as_retriever())

File ~\AppData\Roaming\Python\Python310\site-packages\langchain\chains_init_.py:51, in
49 from langchain.chains.llm_math.base import LLMMathChain
50 from langchain.chains.llm_requests import LLMRequestsChain
---> 51 from langchain.chains.llm_summarization_checker.base import LLMSummarizationCheckerChain
52 from langchain.chains.loading import load_chain
53 from langchain.chains.mapreduce import MapReduceChain

File ~\AppData\Roaming\Python\Python310\site-packages\langchain\chains\llm_summarization_checker\base.py:20, in
16 from langchain.chains.sequential import SequentialChain
18 PROMPTS_DIR = Path(file).parent / "prompts"
---> 20 CREATE_ASSERTIONS_PROMPT = PromptTemplate.from_file(PROMPTS_DIR / "create_facts.txt")
21 CHECK_ASSERTIONS_PROMPT = PromptTemplate.from_file(PROMPTS_DIR / "check_facts.txt")
22 REVISED_SUMMARY_PROMPT = PromptTemplate.from_file(PROMPTS_DIR / "revise_summary.txt")

File ~\AppData\Roaming\Python\Python310\site-packages\langchain_core\prompts\prompt.py:203, in PromptTemplate.from_file(cls, template_file, input_variables, **kwargs)
190 """Load a prompt from a file.
191
192 Args:
(...)
200 The prompt loaded from the file.
201 """
202 with open(str(template_file), "r") as f:
--> 203 template = f.read()
204 if input_variables:
205 warnings.warn(
206 "`input_variables' is deprecated and ignored.", DeprecationWarning
207 )

File c:\Program Files\Python310\lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Aug 9, 2023

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 9, 2023

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 16, 2023

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Nov 16, 2023

yuhenggh mentioned this issue Dec 6, 2023

load_prompt Unable to set encoding for JSON files #6900

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue: UnicodeDecodeError when running .txt files using TextLoader #9005

Issue: UnicodeDecodeError when running .txt files using TextLoader #9005

mepurgamentum commented Aug 9, 2023 •

edited

Loading

dosubot bot commented Aug 10, 2023

mepurgamentum commented Aug 10, 2023

dosubot bot commented Nov 9, 2023

Warlib1975 commented Feb 4, 2024 •

edited

Loading

Issue: UnicodeDecodeError when running .txt files using TextLoader #9005

Issue: UnicodeDecodeError when running .txt files using TextLoader #9005

Comments

mepurgamentum commented Aug 9, 2023 • edited Loading

Issue you'd like to raise.

Suggestion:

dosubot bot commented Aug 10, 2023

Answer

Sources

mepurgamentum commented Aug 10, 2023

dosubot bot commented Nov 9, 2023

Warlib1975 commented Feb 4, 2024 • edited Loading

mepurgamentum commented Aug 9, 2023 •

edited

Loading

Warlib1975 commented Feb 4, 2024 •

edited

Loading