-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: support advanced text extraction options for pdf documents #20265
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
- Updated constructors in PyPDFParser and PyPDFLoader to handle `extraction_mode` and additional kwargs, aligning with the capabilities of `PageObject.extract_text()` from pypdf. - Added `test_pypdf_loader_with_layout` along with a corresponding example text file to validate layout extraction from PDFs.
Hello Guys! Just to know that if there is something I can do to help as I saw an activity on last week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way we can make this backwards compatible with version 3 of pypdf?
@efriis do you think something like this can be ok? if pypdf.__version__.startswith('4'):
page_content = page.extract_text(
extraction_mode=self.extraction_mode, **self.extraction_kwargs
)
else:
page_content = page.extract_text() |
Could we reverse it to Will mark this a draft, feel free to mark ready to review when you want me to take a look! |
Hi @efriis, |
Description:
Updated constructors in PyPDFParser and PyPDFLoader to handle
extraction_mode
and additional kwargs, aligning with the capabilities ofPageObject.extract_text()
from pypdf.Added
test_pypdf_loader_with_layout
along with a corresponding example text file to validate layout extraction from PDFs.Issue: fixes #19735
Dependencies: This change requires updating the pypdf dependency from version 3.4.0 to at least 4.0.0.
Additional changes include the addition of a new test test_pypdf_loader_with_layout and an example text file to ensure the functionality of layout extraction from PDFs aligns with the new capabilities.