Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract arabic text from pdf #591

Closed
dihia-lanasri opened this issue Dec 2, 2020 · 7 comments
Closed

extract arabic text from pdf #591

dihia-lanasri opened this issue Dec 2, 2020 · 7 comments
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@dihia-lanasri
Copy link

Hello;

I have a pdf in arabic language text. I need to extract its text. but I obtain something like:
f˘£˘∏˘≤â GCh∫ GCeù¢, b˘Éa˘∏˘á J†°˘Ée˘æ«á eø h’já Gdû°∏∞ fëƒ
fl«˘ª˘Éä Gd˘ÓL˘ÄÚ Gdü°˘ëôGhjÚ eƒL¡á d∏û°©

How can I decode it? utf-8 doesn't work.

@Anas-jaf
Copy link

this is not utf-8
did you figured work around ?

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022
@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 16, 2022
@MartinThoma
Copy link
Member

#464 might help to solve this

@pubpub-zz
Copy link
Collaborator

@dihia-lanasri ,

can you provide a pdf to evaluate extraction. thanks

@MartinThoma
Copy link
Member

MartinThoma commented Jun 6, 2022

@pubpub-zz I've added one in #954 :-) The issue looks different now. I think it might be related to the different writing directions in one document.

@pubpub-zz
Copy link
Collaborator

Under analysis

@pubpub-zz
Copy link
Collaborator

@dihia-lanasri , some work has been done about text extraction but arabic being written right to left make me difficult to evaluate. can you give your feed backs?

@MartinThoma
Copy link
Member

I assume this works now. Please let us know if anybody encounters any issues!

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Oct 29, 2022
includes also reactivation of test_extract_text_hello_world as py-pdf#591 is closed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

4 participants