Want to code faster? Our Python Code Generator lets you create Python scripts with just a few clicks. Try it now!
At these times, companies of mid and large-scale have large amounts of PDF documents being used daily. Among them are invoices, receipts, documents, reports, and more.
In this tutorial, you will learn how you can extract text from PDF documents in Python using the PyMuPDF library.
This tutorial tackles the problem when the text isn't scanned, i.e., not an image within a PDF. If you want to extract text from images in PDF documents, this tutorial is for you.
Download: Practical Python PDF Processing EBook.
To get started, we need to install PyMuPDF:
Open up a new Python file, and let's import the libraries:
PyMuPDF has the name of fitz
when importing in Python, so keep that in mind.
Since we're going to make a Python script that extracts text from PDF documents, we have to use the argparse module to parse the passed parameters in the command line. The following function parses the arguments and does some processing:
First, we made our parser using ArgumentParser
And add the following parameters:
file
: The input PDF document to extract text from.-p
or --pages
: The page indices to extract, starting from 0, if you do not specify, the default will be all pages.-o
or --output-file
: The output text file to write the extracted text. If you do not specify, the content will be printed in the standard output (i.e., in the console).-b
or --by-page
: This is a boolean indicating whether to output text by page. If not specified, all text is joined in a single file (when -o
is specified).Second, we open our output_files
to write into if -b
is specified. Otherwise, a single file will be in the output_files
dictionary.
Finally, we return the necessary variables: PDF document, output files, and the list of page numbers.
Next, let's make a function that accepts the above parameters and extract text from PDF documents accordingly:
Master PDF Manipulation with Python by building PDF tools from scratch. Get your copy now!
Download EBookWe iterate over the pages; if the page we're on is in the pages
list, we extract the text of that page and write it to the specified file or standard output. Finally, we close the files.
Let's bring everything together and run the functions:
Awesome, let's try to extract the text from all pages of this file and write each page to a text file:
Output:
It worked perfectly. Here are the output files:
Now let's specify pages 0, 1, 2, 14, and 15:
We can also print in the console instead of saving it to a file by not setting the -o
option:
Or saving all the text of the PDF document into a single text file:
The output file will appear in the current directory:
Alright, that's it for this tutorial. As mentioned earlier, you can always extract text from scanned PDF documents tutorial if your documents are scanned (i.e., as images and cannot be selected in your PDF reader).
Also, you can redact and highlight the text in your PDF. Below are some other related PDF tutorials:
Or you can explore all of them here.
Check the complete code here.
Finally, unlock the secrets of Python PDF manipulation! Our compelling Practical Python PDF Processing eBook offers exclusive, in-depth guidance you won't find anywhere else. If you're passionate about enriching your skill set and mastering the intricacies of PDF handling with Python, your journey begins with a single click right here. Let's explore together!
Happy coding ♥
Take the stress out of learning Python. Meet our Python Code Assistant – your new coding buddy. Give it a whirl!
View Full Code Fix My Code
Got a coding query or need some guidance before you comment? Check out this Python Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!