Skip to content

feat: enable transcription export formats (TXT, Markdown, ALTO, PAGE XML) #122

@nikazzio

Description

@nikazzio

Summary

Complete the transcription export pipeline. Several formats are already declared in EXPORT_FORMATS with available: False — flip them to True and implement the renderers.

Motivation

Researchers need to export their transcriptions in standard formats for publication, interoperability with other tools (Transkribus, eScriptorium, Kraken), and archival.

Formats to implement

Plain Text (.txt)

  • One page per section, separated by --- or page numbers.
  • Straightforward concatenation of transcription content.

Markdown (.md)

  • Page headings (## Page N), transcription content, optional image references.

ALTO XML (.alto)

  • Standard XML format for OCR output.
  • Map transcription text to <TextLine> / <String> elements.
  • Include page dimensions from IIIF canvas metadata.

PAGE XML (.page)

  • Alternative OCR interchange format used by Transkribus/eScriptorium.
  • Similar structure to ALTO but different schema.

Acceptance criteria

  • TXT export works end-to-end via export UI
  • Markdown export works end-to-end
  • ALTO XML export produces valid XML against the ALTO 4.x schema
  • PAGE XML export produces valid XML
  • All formats available in the export dropdown
  • Existing PDF/ZIP export unaffected

Technical notes

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:exportExport pipeline and jobsminorIncrements the minor version when adding new functionality in a backward-compatible manner.priority:P1High prioritystatus:readyReady to be implementedtype:featureNew user-facing feature

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions