pygexml

A minimal Python wrapper around the PAGE-XML format for OCR output.

Installation

pip install pygexml

Requires Python 3.12+.

Usage

from pygexml import Page

page = Page.from_xml_string(xml_string)

for line in page.all_text():
    print(line)

Data model

Class	Import from
`Page`	`pygexml`
`Page`, `TextRegion`, `TextLine`, `Coords`	`pygexml.page`
`Point`, `Box`, `Polygon`	`pygexml.geometry`

Page, TextRegion and TextLine each expose all_text() and all_words() iterators. Lookups by ID are available via lookup_region() and lookup_textline().

Refer to the online API docs for details.

Hypothesis strategies

The pygexml.strategies module provides Hypothesis strategies for all pygexml types, ready to use in property-based tests - including downstream projects:

from hypothesis import given
from pygexml.strategies import st_pages

@given(st_pages())
def test_my_page_processing(page):
    assert process(page) is not None

Refer to the pygexml.strategies API docs for details.

Development

pip install ".[dev,test,docs]"

black pygexml test          # format
mypy pygexml test           # type check
pyright pygexml test        # type check
pytest -v                   # tests
pdoc -o .api_docs pygexml/* # API docs

CI runs on Python 3.12, 3.13 and 3.14. API documentation is published to GitHub Pages on every push to main.

Contributing

Bug reports, feature requests and pull requests are welcome. Feel free to open draft pull requests early to invite discussion and collaboration.

Please note that this project has a Code of Conduct.

Copyright and License

Released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github/workflows		.github/workflows
pygexml		pygexml
test		test
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pygexml

Installation

Usage

Data model

Hypothesis strategies

Development

Contributing

Copyright and License

About

Uh oh!

Releases 3

Languages

License

SCDH/pygexml

Folders and files

Latest commit

History

Repository files navigation

pygexml

Installation

Usage

Data model

Hypothesis strategies

Development

Contributing

Copyright and License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 3

Languages