A minimal Python wrapper around the PAGE-XML format for OCR output.
pip install pygexml
Requires Python 3.12+.
from pygexml import Page
page = Page.from_xml_string(xml_string)
for line in page.all_text():
print(line)| Class | Import from |
|---|---|
Page |
pygexml |
Page, TextRegion, TextLine, Coords |
pygexml.page |
Point, Box, Polygon |
pygexml.geometry |
Page, TextRegion and TextLine each expose all_text() and all_words() iterators.
Lookups by ID are available via lookup_region() and lookup_textline().
Refer to the online API docs for details.
The pygexml.strategies module provides Hypothesis strategies for all pygexml types, ready to use in property-based tests - including downstream projects:
from hypothesis import given
from pygexml.strategies import st_pages
@given(st_pages())
def test_my_page_processing(page):
assert process(page) is not NoneRefer to the pygexml.strategies API docs for details.
pip install ".[dev,test,docs]"
black pygexml test # format
mypy pygexml test # type check
pyright pygexml test # type check
pytest -v # tests
pdoc -o .api_docs pygexml/* # API docsCI runs on Python 3.12, 3.13 and 3.14. API documentation is published to GitHub Pages on every push to main.
Bug reports, feature requests and pull requests are welcome. Feel free to open draft pull requests early to invite discussion and collaboration.
Please note that this project has a Code of Conduct.
Copyright (c) 2026 Mirko Westermeier (SCDH, University of Münster)
Released under the MIT License.