extract arabic text from pdf #591

dihia-lanasri · 2020-12-02T15:43:22Z

Hello;

I have a pdf in arabic language text. I need to extract its text. but I obtain something like:
f˘£˘∏˘≤â GCh∫ GCeù¢, b˘Éa˘∏˘á J†°˘Ée˘æ«á eø h’já Gdû°∏∞ fëƒ
fl«˘ª˘Éä Gd˘ÓL˘ÄÚ Gdü°˘ëôGhjÚ eƒL¡á d∏û°©

How can I decode it? utf-8 doesn't work.

Anas-jaf · 2021-06-29T15:38:57Z

this is not utf-8
did you figured work around ?

MartinThoma · 2022-04-16T11:07:07Z

#464 might help to solve this

pubpub-zz · 2022-06-04T17:45:33Z

@dihia-lanasri ,

can you provide a pdf to evaluate extraction. thanks

See #591

MartinThoma · 2022-06-06T12:52:33Z

@pubpub-zz I've added one in #954 :-) The issue looks different now. I think it might be related to the different writing directions in one document.

pubpub-zz · 2022-06-07T04:43:17Z

Under analysis

pubpub-zz · 2022-06-19T11:29:25Z

@dihia-lanasri , some work has been done about text extraction but arabic being written right to left make me difficult to evaluate. can you give your feed backs?

MartinThoma · 2022-06-27T21:19:27Z

I assume this works now. Please let us know if anybody encounters any issues!

includes also reactivation of test_extract_text_hello_world as py-pdf#591 is closed

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Apr 7, 2022

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Apr 16, 2022

MartinThoma added a commit that referenced this issue Jun 6, 2022

TST: Text extraction for non-latin alphabets

14344c9

See #591

MartinThoma mentioned this issue Jun 6, 2022

TST: Text extraction for non-latin alphabets #954

Merged

MartinThoma added a commit that referenced this issue Jun 6, 2022

TST: Text extraction for non-latin alphabets (#954)

babe32e

See #591

MartinThoma mentioned this issue Jun 7, 2022

v2.1 extract_text() misses newline characters #957

Closed

pubpub-zz mentioned this issue Jun 11, 2022

improved ExtractText(3) #969

Merged

MartinThoma closed this as completed Jun 27, 2022

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Oct 29, 2022

create PdfWriterInterface to prevent recursive import

d6efb16

includes also reactivation of test_extract_text_hello_world as py-pdf#591 is closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract arabic text from pdf #591

extract arabic text from pdf #591

dihia-lanasri commented Dec 2, 2020

Anas-jaf commented Jun 29, 2021

MartinThoma commented Apr 16, 2022

pubpub-zz commented Jun 4, 2022

MartinThoma commented Jun 6, 2022 •

edited

Loading

pubpub-zz commented Jun 7, 2022

pubpub-zz commented Jun 19, 2022

MartinThoma commented Jun 27, 2022

extract arabic text from pdf #591

extract arabic text from pdf #591

Comments

dihia-lanasri commented Dec 2, 2020

Anas-jaf commented Jun 29, 2021

MartinThoma commented Apr 16, 2022

pubpub-zz commented Jun 4, 2022

MartinThoma commented Jun 6, 2022 • edited Loading

pubpub-zz commented Jun 7, 2022

pubpub-zz commented Jun 19, 2022

MartinThoma commented Jun 27, 2022

MartinThoma commented Jun 6, 2022 •

edited

Loading