community: support advanced text extraction options for pdf documents #20265

bricefotzo · 2024-04-10T10:17:15Z

Description:

Updated constructors in PyPDFParser and PyPDFLoader to handle extraction_mode and additional kwargs, aligning with the capabilities of PageObject.extract_text() from pypdf.
Added test_pypdf_loader_with_layout along with a corresponding example text file to validate layout extraction from PDFs.

Issue: fixes #19735

Dependencies: This change requires updating the pypdf dependency from version 3.4.0 to at least 4.0.0.

Additional changes include the addition of a new test test_pypdf_loader_with_layout and an example text file to ensure the functionality of layout extraction from PDFs aligns with the new capabilities.

vercel · 2024-04-10T10:17:20Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 17, 2024 8:47pm

- Updated constructors in PyPDFParser and PyPDFLoader to handle `extraction_mode` and additional kwargs, aligning with the capabilities of `PageObject.extract_text()` from pypdf. - Added `test_pypdf_loader_with_layout` along with a corresponding example text file to validate layout extraction from PDFs.

bricefotzo · 2024-06-26T12:03:04Z

Hello Guys!
@ccurme @baskaryan @efriis @maximeperrindev
Hope you're all well!

Just to know that if there is something I can do to help as I saw an activity on last week

efriis

is there a way we can make this backwards compatible with version 3 of pypdf?

bricefotzo · 2024-07-04T16:02:14Z

is there a way we can make this backwards compatible with version 3 of pypdf?

@efriis do you think something like this can be ok?

if pypdf.__version__.startswith('4'):
    page_content = page.extract_text(
        extraction_mode=self.extraction_mode, **self.extraction_kwargs
    )
else:
    page_content = page.extract_text()

efriis · 2024-07-12T22:45:34Z

Could we reverse it to if pypdf.__version.startswith('3.'): do the old thing, otherwise new? Base assumption being future releases are additive

Will mark this a draft, feel free to mark ready to review when you want me to take a look!

bricefotzo · 2024-07-17T10:48:34Z

Hi @efriis,
I just added the changes. You can check it. I mentionned you on it.
Feel free to tell me if there is something to change or do better.
Thanks!

efriis added the template label Apr 10, 2024

efriis self-assigned this Apr 10, 2024

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 10, 2024

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 10, 2024

vercel bot deployed to Preview April 10, 2024 10:38 View deployment

bricefotzo force-pushed the master branch from a59526a to 64b522d Compare April 10, 2024 14:01

bricefotzo force-pushed the master branch from 64b522d to 445d866 Compare April 10, 2024 14:23

vercel bot deployed to Preview April 10, 2024 14:34 View deployment

Merge branch 'master' into master

e9c0bbd

vercel bot deployed to Preview April 10, 2024 14:48 View deployment

Merge branch 'master' into master

4c2a65e

vercel bot deployed to Preview April 10, 2024 16:41 View deployment

maximeperrindev approved these changes Apr 11, 2024

View reviewed changes

Merge branch 'master' into master

4f7f01d

vercel bot deployed to Preview April 25, 2024 09:43 View deployment

Merge branch 'master' into master

559c3c9

vercel bot temporarily deployed to Preview April 27, 2024 02:29 Inactive

baskaryan added 3 commits April 26, 2024 19:29

Merge branch 'master' into bricefotzo/master

4b098bd

fmt

c15da72

fmt

9df9c6a

vercel bot deployed to Preview April 27, 2024 02:45 View deployment

poetry

8078535

vercel bot deployed to Preview April 27, 2024 16:56 View deployment

ccurme added the community Related to langchain-community label Jun 18, 2024

efriis reviewed Jul 2, 2024

View reviewed changes

efriis marked this pull request as draft July 12, 2024 22:45

vercel bot had a problem deploying to Preview July 17, 2024 10:19 Failure

vercel bot had a problem deploying to Preview July 17, 2024 10:38 Failure

chore: add a method to extract from image given the pypdf version

9f1e5b5

bricefotzo force-pushed the master branch from bd5dc01 to 9f1e5b5 Compare July 17, 2024 10:41

bricefotzo marked this pull request as ready for review July 17, 2024 10:45

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jul 17, 2024

vercel bot had a problem deploying to Preview July 17, 2024 10:51 Failure

efriis added 2 commits July 17, 2024 13:30

x

dcc3bf7

x

4c88ca3

efriis approved these changes Jul 17, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 17, 2024

efriis enabled auto-merge (squash) July 17, 2024 20:41

vercel bot deployed to Preview July 17, 2024 20:47 View deployment

efriis merged commit 034a8c7 into langchain-ai:master Jul 17, 2024
44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: support advanced text extraction options for pdf documents #20265

community: support advanced text extraction options for pdf documents #20265

bricefotzo commented Apr 10, 2024

vercel bot commented Apr 10, 2024 •

edited

Loading

bricefotzo commented Jun 26, 2024

efriis left a comment

bricefotzo commented Jul 4, 2024

efriis commented Jul 12, 2024

bricefotzo commented Jul 17, 2024

community: support advanced text extraction options for pdf documents #20265

community: support advanced text extraction options for pdf documents #20265

Conversation

bricefotzo commented Apr 10, 2024

vercel bot commented Apr 10, 2024 • edited Loading

bricefotzo commented Jun 26, 2024

efriis left a comment

Choose a reason for hiding this comment

bricefotzo commented Jul 4, 2024

efriis commented Jul 12, 2024

bricefotzo commented Jul 17, 2024

vercel bot commented Apr 10, 2024 •

edited

Loading