Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: support advanced text extraction options for pdf documents #20265

Merged
merged 12 commits into from
Jul 17, 2024

Conversation

bricefotzo
Copy link
Contributor

Description:

  • Updated constructors in PyPDFParser and PyPDFLoader to handle extraction_mode and additional kwargs, aligning with the capabilities of PageObject.extract_text() from pypdf.

  • Added test_pypdf_loader_with_layout along with a corresponding example text file to validate layout extraction from PDFs.

Issue: fixes #19735

Dependencies: This change requires updating the pypdf dependency from version 3.4.0 to at least 4.0.0.

Additional changes include the addition of a new test test_pypdf_loader_with_layout and an example text file to ensure the functionality of layout extraction from PDFs aligns with the new capabilities.

@efriis efriis self-assigned this Apr 10, 2024
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Apr 10, 2024
Copy link

vercel bot commented Apr 10, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 17, 2024 8:47pm

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 10, 2024
- Updated constructors in PyPDFParser and PyPDFLoader to handle `extraction_mode` and additional kwargs, aligning with the capabilities of `PageObject.extract_text()` from pypdf.

- Added `test_pypdf_loader_with_layout` along with a corresponding example text file to validate layout extraction from PDFs.
@ccurme ccurme added the community Related to langchain-community label Jun 18, 2024
@bricefotzo
Copy link
Contributor Author

Hello Guys!
@ccurme @baskaryan @efriis @maximeperrindev
Hope you're all well!

Just to know that if there is something I can do to help as I saw an activity on last week

Copy link
Member

@efriis efriis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way we can make this backwards compatible with version 3 of pypdf?

@bricefotzo
Copy link
Contributor Author

is there a way we can make this backwards compatible with version 3 of pypdf?

@efriis do you think something like this can be ok?

if pypdf.__version__.startswith('4'):
    page_content = page.extract_text(
        extraction_mode=self.extraction_mode, **self.extraction_kwargs
    )
else:
    page_content = page.extract_text()

@efriis
Copy link
Member

efriis commented Jul 12, 2024

Could we reverse it to if pypdf.__version.startswith('3.'): do the old thing, otherwise new? Base assumption being future releases are additive

Will mark this a draft, feel free to mark ready to review when you want me to take a look!

@bricefotzo bricefotzo marked this pull request as ready for review July 17, 2024 10:45
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jul 17, 2024
@bricefotzo
Copy link
Contributor Author

Hi @efriis,
I just added the changes. You can check it. I mentionned you on it.
Feel free to tell me if there is something to change or do better.
Thanks!

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jul 17, 2024
@efriis efriis enabled auto-merge (squash) July 17, 2024 20:41
@efriis efriis merged commit 034a8c7 into langchain-ai:master Jul 17, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files. template
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants