-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract arabic text from pdf #591
Comments
this is not utf-8 |
#464 might help to solve this |
can you provide a pdf to evaluate extraction. thanks |
@pubpub-zz I've added one in #954 :-) The issue looks different now. I think it might be related to the different writing directions in one document. |
Under analysis |
@dihia-lanasri , some work has been done about text extraction but arabic being written right to left make me difficult to evaluate. can you give your feed backs? |
I assume this works now. Please let us know if anybody encounters any issues! |
includes also reactivation of test_extract_text_hello_world as py-pdf#591 is closed
Hello;
I have a pdf in arabic language text. I need to extract its text. but I obtain something like:
f˘£˘∏˘≤â GCh∫ GCeù¢, b˘Éa˘∏˘á J†°˘Ée˘æ«á eø h’já Gdû°∏∞ fëƒ
fl«˘ª˘Éä Gd˘ÓL˘ÄÚ Gdü°˘ëôGhjÚ eƒL¡á d∏û°©
How can I decode it? utf-8 doesn't work.
The text was updated successfully, but these errors were encountered: