Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to <undefined>. #121

Closed
marshalmiller opened this issue Jul 29, 2022 · 5 comments · Fixed by #159
Assignees
Labels
bug Something isn't working hacktoberfest Issues for the HacktoberFest event. help wanted Extra attention is needed python

Comments

@marshalmiller
Copy link
Collaborator

marshalmiller commented Jul 29, 2022

Receiving this error when running the file. Traceback Below. File Attached.

> Traceback (most recent call last):
>   File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main
>     return _run_code(code, main_globals, None,
>   File "c:\python38\lib\runpy.py", line 86, in _run_code
>     exec(code, run_globals)
>   File "C:\Python38\Scripts\linkrot.exe\__main__.py", line 7, in <module>
>   File "c:\python38\lib\site-packages\linkrot\cli.py", line 182, in main
>     pdf = linkrot.linkrot(args.pdf)
>   File "c:\python38\lib\site-packages\linkrot\__init__.py", line 131, in __init__
>     self.reader = PDFMinerBackend(self.stream)
>   File "c:\python38\lib\site-packages\linkrot\backends.py", line 204, in __init__
>     self.metadata[k] = make_compat_str(v)
>   File "c:\python38\lib\site-packages\linkrot\backends.py", line 67, in make_compat_str
>     out_str = in_str.decode(enc["encoding"])
>   File "c:\python38\lib\encodings\cp1254.py", line 15, in decode
>     return codecs.charmap_decode(input,errors,decoding_table)
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to <undefined>

ah-1.pdf

@marshalmiller
Copy link
Collaborator Author

At first look, this appears to be an encoding issue for Windows. Does anyone want to tackle this one?

@marshalmiller marshalmiller added the help wanted Extra attention is needed label Jul 29, 2022
@marshalmiller
Copy link
Collaborator Author

After further investigation, it looks like this is an issue in chardet itself. Does anyone know a workaround to resolve this issue?
Chardet issue located here - chardet/chardet#148

@marshalmiller
Copy link
Collaborator Author

Here is a project that might be able to replace chardet. https://github.com/Ousret/charset_normalizer

@marshalmiller marshalmiller added hacktoberfest Issues for the HacktoberFest event. and removed good first issue Good for newcomers labels Sep 19, 2022
@wiseaidev
Copy link
Contributor

Hey @marshalmiller, I think i can work on this issue as it seems pretty easy at first glance. Essentially, we can wrap this line:

https://github.com/marshalmiller/linkrot/blob/162c8956d8e7dee9f915d414c61bf9d17bad5e5d/linkrot/backends.py#L54

with try except:

    try:
        out_str = in_str.decode(enc["encoding"])
    except UnicodeDecodeError as err:
        out_str = ""

What do you think?

@marshalmiller
Copy link
Collaborator Author

Yeah. That would seem logical. Give it a try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working hacktoberfest Issues for the HacktoberFest event. help wanted Extra attention is needed python
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants