A modular data pipeline system built as a CLI tool — not just a parser. Think of it like a mini Apache Spark / Airflow pipeline, or a backend data processing service.
This is not just a parser.
It's a modular data pipeline system with:
- Separation of concerns
- Extendable architecture
- CLI usability
In industry terms:
- Like a mini Apache Spark / Airflow pipeline (simplified)
- Or a backend data processing service
Every file connects to this pipeline:
INPUT FILE → PARSER → CLEANER → TRANSFORMER → AGGREGATOR → WRITER → OUTPUT
Think of it like a factory assembly line — each stage does one job and passes the result forward.
This is the orchestrator (very important concept).
- Accepts CLI arguments
- Calls each stage in order
- Handles errors gracefully
- Controls the entire pipeline flow
- Clean modular design
- Good separation of concerns (parser / processor / output)
- CLI-based → practical and reusable
- Supports multiple formats → CSV, JSON, XML
- Zero external dependencies (pure Python stdlib)
cli-data-parser/
├── parser/
│ ├── __init__.py
│ ├── csv_parser.py ← chunked reading for large files (100MB+)
│ ├── json_parser.py ← handles nested/flat JSON structures
│ └── xml_parser.py ← namespace + attribute support
├── processor/
│ ├── __init__.py
│ ├── cleaner.py ← clean, deduplicate, normalize nulls
│ ├── transformer.py ← AND/OR multi-condition filter expressions
│ ├── aggregator.py ← min, max, avg, sum, std_dev, top values
│ └── validator.py ← schema validation (type, required, range, allowed)
├── output/
│ ├── __init__.py
│ └── writer.py ← structured JSON with metadata + invalid records
├── utils/
│ ├── __init__.py
│ ├── logger.py ← colored terminal output (Green/Red/Blue/Yellow)
│ └── batch.py ← multi-file, directory, and glob pattern support
├── sample_data/
│ ├── sample.csv
│ ├── sample.json
│ ├── sample.xml
│ └── schema.json
├── main.py ← CLI entry point (orchestrator)
├── requirements.txt
└── README.md
| Feature | Description |
|---|---|
| Multi-format | CSV, JSON, XML |
| Batch processing | Directory or glob pattern input |
| Smart cleaning | Deduplication, null normalization, type cast |
| AND/OR filtering | Multi-condition filter expressions |
| Schema validation | Type, required, min/max, allowed values |
| Rich aggregation | min, max, avg, sum, std_dev, top values |
| Large file support | Chunked CSV reading (100MB+) |
| Colored output | Success/Error/Warning terminal colors |
| Zero dependencies | Python standard library only |
git clone https://github.com/ganesh44we/cli-data-parser.git
cd cli-data-parser
python main.py --helppython main.py --input sample_data/sample.csv --format csv --output result.json
python main.py --input sample_data/sample.json --format json --output result.json
python main.py --input sample_data/sample.xml --format xml --output result.jsonpython main.py --input sample_data/sample.csv --format csv --filter "age>25" --output result.json
python main.py --input sample_data/sample.csv --format csv --filter "age>25 AND status=active" --output result.json
python main.py --input sample_data/sample.csv --format csv --filter "status=active OR status=pending" --output result.jsonpython main.py --input sample_data/ --output result.json
python main.py --input "sample_data/*.csv" --output result.jsonpython main.py --input sample_data/sample.csv --format csv --schema sample_data/schema.json --output result.jsonpython main.py \
--input sample_data/sample.csv \
--format csv \
--filter "age>25 AND status=active" \
--schema sample_data/schema.json \
--aggregate \
--output result.jsonpython main.py --input big_data.csv --format csv --chunk-size 50000 --output result.json| Option | Description |
|---|---|
--input |
File, directory, or glob pattern (required) |
--format |
csv / json / xml (auto-detected for batch) |
--output |
Output JSON file path (required) |
--filter |
Filter: "age>25" or "age>25 AND status=active" |
--aggregate |
Include min/max/avg/std_dev summary |
--schema |
Path to JSON schema for validation |
--no-clean |
Skip cleaning step |
--no-transform |
Skip transformation step |
--chunk-size |
Chunk size for large CSV files (default: 10000) |
{
"metadata": {
"total_records": 6,
"output_file": "result.json",
"invalid_records": 1
},
"data": [...],
"summary": {
"total_records": 6,
"fields": {
"age": {
"type": "numeric",
"min": 25, "max": 40,
"avg": 31.6, "sum": 190,
"std_dev": 5.2,
"null_count": 1
},
"status": {
"type": "string",
"unique_count": 2,
"top_values": [
{"value": "active", "count": 5},
{"value": "inactive", "count": 1}
]
}
}
},
"invalid_records": [...]
}- Python 3.10+
- Standard Library:
csv,json,xml,argparse,re,glob,collections
Ganesh Rayapati