An Apache Tika Server python client.
You can install the package simply from pypi:
pip install pytika
This package is a python client for Apache Tika Server, so you'll need to have Tika Server running locally somehow.
I would recommend using the docker image as that's the simplest.
docker run --name tika-server -it -d -p 9998:9998 apache/tika:2.7.0-full
After you have that running, you can use PyTika to interface with it.
from pytika.api import TikaApi
tika = TikaApi()
metadata = api.get_meta(file)
"""
>>> metadata
{
'Content-Type': 'application/pdf',
...
}
"""
For text detection, Tika Server usually decides on the response type (typically xml/html is the default). To force it to return plain text (Accept: text/plain header) you can set the following configuration:
from pytika.api import TikaApi
from pytika.config import GetTextOptionsBuilder as opt
tika = TikaApi()
with open("yourfile.whatever", "rb") as file:
text = api.get_text(file, opt.AsPlainText()).decode()
Notice the awkward configuration - passing a function call as an option - this is coming from a nice Golang standard that makes calling complex APIs a little friendlier. Since we have a lot of options, instead of having each be an argument, we can define an "option class" with chainable functions. This allows the API to validate each separately, avoid having a massive list of arguments for get_text, as well as tidy up the API code. (For more info: Uptrace, Dave Cheney's post)
For detection in HOCR format with bounding boxes:
from pytika.api import TikaApi
from pytika.config import GetTextOptionsBuilder as opt
tika = TikaApi()
with open("yourfile.whatever", "rb") as file:
text = api.get_text(file, opt.WithBoundingBoxes()).decode()
There are many more configuration options that you can look into in the GetTextOptionsBuilder class, and more to come in the future.
If you'd like to add some missing features that you can find in TikaServer or Tika, then you can contribute to this repo yourself!
1- Clone the repository
git clone "url from repo, either ssh or https"
2- Create a branch
cd pytika
git switch -c your-new-branch-name
3- Make necessary changes and commit, and push to Github
git add README.md
git commit -m "Updated README.md with new API changes"
git push -u origin your-new-branch-name
4- Go to your repository and you'll see a Compare and pull request
button, click on that.
5- Wait for us to review your PR, likely leave comments, and hopefully merge it in!