Skip to content

ganesh44we/cli-data-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CLI Data Parser v2.0

A modular data pipeline system built as a CLI tool — not just a parser. Think of it like a mini Apache Spark / Airflow pipeline, or a backend data processing service.


What This Project Really Is

This is not just a parser.

It's a modular data pipeline system with:

  • Separation of concerns
  • Extendable architecture
  • CLI usability

In industry terms:

  • Like a mini Apache Spark / Airflow pipeline (simplified)
  • Or a backend data processing service

Overall Flow (Burn This Into Your Brain)

Every file connects to this pipeline:

INPUT FILE → PARSER → CLEANER → TRANSFORMER → AGGREGATOR → WRITER → OUTPUT

Think of it like a factory assembly line — each stage does one job and passes the result forward.


main.py — THE CONTROLLER

This is the orchestrator (very important concept).

  • Accepts CLI arguments
  • Calls each stage in order
  • Handles errors gracefully
  • Controls the entire pipeline flow

Strengths

  • Clean modular design
  • Good separation of concerns (parser / processor / output)
  • CLI-based → practical and reusable
  • Supports multiple formats → CSV, JSON, XML
  • Zero external dependencies (pure Python stdlib)

Project Structure

cli-data-parser/
├── parser/
│   ├── __init__.py
│   ├── csv_parser.py        ← chunked reading for large files (100MB+)
│   ├── json_parser.py       ← handles nested/flat JSON structures
│   └── xml_parser.py        ← namespace + attribute support
├── processor/
│   ├── __init__.py
│   ├── cleaner.py           ← clean, deduplicate, normalize nulls
│   ├── transformer.py       ← AND/OR multi-condition filter expressions
│   ├── aggregator.py        ← min, max, avg, sum, std_dev, top values
│   └── validator.py         ← schema validation (type, required, range, allowed)
├── output/
│   ├── __init__.py
│   └── writer.py            ← structured JSON with metadata + invalid records
├── utils/
│   ├── __init__.py
│   ├── logger.py            ← colored terminal output (Green/Red/Blue/Yellow)
│   └── batch.py             ← multi-file, directory, and glob pattern support
├── sample_data/
│   ├── sample.csv
│   ├── sample.json
│   ├── sample.xml
│   └── schema.json
├── main.py                  ← CLI entry point (orchestrator)
├── requirements.txt
└── README.md

Key Features

Feature Description
Multi-format CSV, JSON, XML
Batch processing Directory or glob pattern input
Smart cleaning Deduplication, null normalization, type cast
AND/OR filtering Multi-condition filter expressions
Schema validation Type, required, min/max, allowed values
Rich aggregation min, max, avg, sum, std_dev, top values
Large file support Chunked CSV reading (100MB+)
Colored output Success/Error/Warning terminal colors
Zero dependencies Python standard library only

⚡ Installation

git clone https://github.com/ganesh44we/cli-data-parser.git
cd cli-data-parser
python main.py --help

🚀 Usage Examples

Basic Parsing

python main.py --input sample_data/sample.csv --format csv --output result.json
python main.py --input sample_data/sample.json --format json --output result.json
python main.py --input sample_data/sample.xml --format xml --output result.json

Filter with AND / OR

python main.py --input sample_data/sample.csv --format csv --filter "age>25" --output result.json
python main.py --input sample_data/sample.csv --format csv --filter "age>25 AND status=active" --output result.json
python main.py --input sample_data/sample.csv --format csv --filter "status=active OR status=pending" --output result.json

Batch Processing

python main.py --input sample_data/ --output result.json
python main.py --input "sample_data/*.csv" --output result.json

Schema Validation

python main.py --input sample_data/sample.csv --format csv --schema sample_data/schema.json --output result.json

Full Pipeline

python main.py \
  --input sample_data/sample.csv \
  --format csv \
  --filter "age>25 AND status=active" \
  --schema sample_data/schema.json \
  --aggregate \
  --output result.json

Large File Support

python main.py --input big_data.csv --format csv --chunk-size 50000 --output result.json

⚙️ CLI Options

Option Description
--input File, directory, or glob pattern (required)
--format csv / json / xml (auto-detected for batch)
--output Output JSON file path (required)
--filter Filter: "age>25" or "age>25 AND status=active"
--aggregate Include min/max/avg/std_dev summary
--schema Path to JSON schema for validation
--no-clean Skip cleaning step
--no-transform Skip transformation step
--chunk-size Chunk size for large CSV files (default: 10000)

📊 Output Format

{
    "metadata": {
        "total_records": 6,
        "output_file": "result.json",
        "invalid_records": 1
    },
    "data": [...],
    "summary": {
        "total_records": 6,
        "fields": {
            "age": {
                "type": "numeric",
                "min": 25, "max": 40,
                "avg": 31.6, "sum": 190,
                "std_dev": 5.2,
                "null_count": 1
            },
            "status": {
                "type": "string",
                "unique_count": 2,
                "top_values": [
                    {"value": "active", "count": 5},
                    {"value": "inactive", "count": 1}
                ]
            }
        }
    },
    "invalid_records": [...]
}

🛠️ Tech Stack

  • Python 3.10+
  • Standard Library: csv, json, xml, argparse, re, glob, collections

Author

Ganesh Rayapati

About

Turn raw CSV, JSON, or XML into structured JSON — straight from your terminal.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages