Skip to content

RAmuSelo/docs-convert

Repository files navigation

docs-convert

Tests

Batch-convert office documents in a folder, from the command line. One sub-command per conversion, no interactive prompts, no config files: point it at a folder (or a single file) and it does the rest.

Sub-command From To Backend (lazy import)
docx2pdf .docx .pdf docx2pdf
xlsx2csv .xlsx .csv openpyxl
rtf2csv .rtf .csv striprtf

The heavy backends are imported lazily, only when the matching conversion actually runs. That means you install just the dependency you need, the rest of the tool (argument parsing, file discovery, CSV writing) runs on the standard library alone, and the test-suite needs nothing extra.

Why I built this

I regularly needed to turn a folder of .docx, .xlsx or .rtf files into something scriptable (PDF or CSV) without opening each one by hand. Existing libraries each solve one format; this wraps three common conversions behind one CLI and installs only the backend you actually use.

Install

The core package has zero hard dependencies. Pull in the backend you want via an extra:

pip install "docs-convert[xlsx]"   # xlsx -> csv
pip install "docs-convert[rtf]"    # rtf  -> csv
pip install "docs-convert[docx]"   # docx -> pdf
pip install "docs-convert[all]"    # everything

From a local checkout:

pip install -e ".[all]"

Per-conversion dependency notes

  • xlsx2csv needs the xlsx extra (openpyxl). Reads the active worksheet and writes one CSV per workbook.
  • rtf2csv needs the rtf extra (striprtf). The RTF markup is stripped to plain text, then each line is split on the delimiter (default ,).
  • docx2pdf needs the docx extra (docx2pdf). On Windows and macOS this drives a local Microsoft Word installation; it does not work in a headless environment without Word. If you need a Word-free path, render the PDF with a different toolchain (e.g. a LibreOffice --headless --convert-to pdf pipeline) and feed the results to your workflow.

Usage

# Convert every .xlsx in a folder, output next to each source file
docs-convert xlsx2csv ./invoices

# Convert into a separate output directory
docs-convert xlsx2csv ./invoices --out ./csv_out

# A single file works too
docs-convert rtf2csv ./notes/list.rtf --out ./csv_out

# Tab-separated RTF -> CSV
docs-convert rtf2csv ./notes --delimiter $'\t'

# docx -> pdf for a whole folder
docs-convert docx2pdf ./contracts --out ./pdf_out

Common options for every sub-command:

  • input — a folder (batch all matching files, non-recursive) or a single file.
  • --out DIR — write outputs to DIR. Defaults to alongside each input file.

rtf2csv additionally accepts --delimiter to control how each line is split.

Office lock/temp files (e.g. ~$report.docx) are skipped automatically during discovery.

Exit codes

  • 0 — success (including "nothing to convert").
  • 1 — at least one file failed to convert (the rest still ran).
  • 2 — the input path does not exist.

Development

Run the test-suite (standard library unittest, no network, no backends):

python3 -m unittest discover -s tests

Roadmap

Honest next steps:

  • Recursive discovery and glob patterns (currently non-recursive).
  • An output-encoding option for the CSV writers (non-UTF-8 locales).
  • A built-in Word-free docx → pdf backend (e.g. LibreOffice headless), today only suggested as a manual workaround.

License

MIT — see LICENSE.

About

Batch-convert office documents in a folder from the command line (docx->pdf, xlsx->csv, rtf->csv).

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages