CLI Data Parser v2.0

A modular data pipeline system built as a CLI tool — not just a parser. Think of it like a mini Apache Spark / Airflow pipeline, or a backend data processing service.

What This Project Really Is

This is not just a parser.

It's a modular data pipeline system with:

Separation of concerns
Extendable architecture
CLI usability

In industry terms:

Like a mini Apache Spark / Airflow pipeline (simplified)
Or a backend data processing service

Overall Flow (Burn This Into Your Brain)

Every file connects to this pipeline:

INPUT FILE → PARSER → CLEANER → TRANSFORMER → AGGREGATOR → WRITER → OUTPUT

Think of it like a factory assembly line — each stage does one job and passes the result forward.

`main.py` — THE CONTROLLER

This is the orchestrator (very important concept).

Accepts CLI arguments
Calls each stage in order
Handles errors gracefully
Controls the entire pipeline flow

Strengths

Clean modular design
Good separation of concerns (parser / processor / output)
CLI-based → practical and reusable
Supports multiple formats → CSV, JSON, XML
Zero external dependencies (pure Python stdlib)

Project Structure

cli-data-parser/
├── parser/
│   ├── __init__.py
│   ├── csv_parser.py        ← chunked reading for large files (100MB+)
│   ├── json_parser.py       ← handles nested/flat JSON structures
│   └── xml_parser.py        ← namespace + attribute support
├── processor/
│   ├── __init__.py
│   ├── cleaner.py           ← clean, deduplicate, normalize nulls
│   ├── transformer.py       ← AND/OR multi-condition filter expressions
│   ├── aggregator.py        ← min, max, avg, sum, std_dev, top values
│   └── validator.py         ← schema validation (type, required, range, allowed)
├── output/
│   ├── __init__.py
│   └── writer.py            ← structured JSON with metadata + invalid records
├── utils/
│   ├── __init__.py
│   ├── logger.py            ← colored terminal output (Green/Red/Blue/Yellow)
│   └── batch.py             ← multi-file, directory, and glob pattern support
├── sample_data/
│   ├── sample.csv
│   ├── sample.json
│   ├── sample.xml
│   └── schema.json
├── main.py                  ← CLI entry point (orchestrator)
├── requirements.txt
└── README.md

Key Features

Feature	Description
Multi-format	CSV, JSON, XML
Batch processing	Directory or glob pattern input
Smart cleaning	Deduplication, null normalization, type cast
AND/OR filtering	Multi-condition filter expressions
Schema validation	Type, required, min/max, allowed values
Rich aggregation	min, max, avg, sum, std_dev, top values
Large file support	Chunked CSV reading (100MB+)
Colored output	Success/Error/Warning terminal colors
Zero dependencies	Python standard library only

⚡ Installation

git clone https://github.com/ganesh44we/cli-data-parser.git
cd cli-data-parser
python main.py --help

🚀 Usage Examples

Basic Parsing

python main.py --input sample_data/sample.csv --format csv --output result.json
python main.py --input sample_data/sample.json --format json --output result.json
python main.py --input sample_data/sample.xml --format xml --output result.json

Filter with AND / OR

python main.py --input sample_data/sample.csv --format csv --filter "age>25" --output result.json
python main.py --input sample_data/sample.csv --format csv --filter "age>25 AND status=active" --output result.json
python main.py --input sample_data/sample.csv --format csv --filter "status=active OR status=pending" --output result.json

Batch Processing

python main.py --input sample_data/ --output result.json
python main.py --input "sample_data/*.csv" --output result.json

Schema Validation

python main.py --input sample_data/sample.csv --format csv --schema sample_data/schema.json --output result.json

Full Pipeline

python main.py \
  --input sample_data/sample.csv \
  --format csv \
  --filter "age>25 AND status=active" \
  --schema sample_data/schema.json \
  --aggregate \
  --output result.json

Large File Support

python main.py --input big_data.csv --format csv --chunk-size 50000 --output result.json

⚙️ CLI Options

Option	Description
`--input`	File, directory, or glob pattern (required)
`--format`	csv / json / xml (auto-detected for batch)
`--output`	Output JSON file path (required)
`--filter`	Filter: `"age>25"` or `"age>25 AND status=active"`
`--aggregate`	Include min/max/avg/std_dev summary
`--schema`	Path to JSON schema for validation
`--no-clean`	Skip cleaning step
`--no-transform`	Skip transformation step
`--chunk-size`	Chunk size for large CSV files (default: 10000)

📊 Output Format

{
    "metadata": {
        "total_records": 6,
        "output_file": "result.json",
        "invalid_records": 1
    },
    "data": [...],
    "summary": {
        "total_records": 6,
        "fields": {
            "age": {
                "type": "numeric",
                "min": 25, "max": 40,
                "avg": 31.6, "sum": 190,
                "std_dev": 5.2,
                "null_count": 1
            },
            "status": {
                "type": "string",
                "unique_count": 2,
                "top_values": [
                    {"value": "active", "count": 5},
                    {"value": "inactive", "count": 1}
                ]
            }
        }
    },
    "invalid_records": [...]
}

🛠️ Tech Stack

Python 3.10+
Standard Library: csv, json, xml, argparse, re, glob, collections

Author

Ganesh Rayapati

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CLI Data Parser v2.0

What This Project Really Is

Overall Flow (Burn This Into Your Brain)

`main.py` — THE CONTROLLER

Strengths

Project Structure

Key Features

⚡ Installation

🚀 Usage Examples

Basic Parsing

Filter with AND / OR

Batch Processing

Schema Validation

Full Pipeline

Large File Support

⚙️ CLI Options

📊 Output Format

🛠️ Tech Stack

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
output		output
parser		parser
processor		processor
utils		utils
.DS_Store		.DS_Store
README.md		README.md
main.py		main.py

Folders and files

Latest commit

History

Repository files navigation

CLI Data Parser v2.0

What This Project Really Is

Overall Flow (Burn This Into Your Brain)

main.py — THE CONTROLLER

Strengths

Project Structure

Key Features

⚡ Installation

🚀 Usage Examples

Basic Parsing

Filter with AND / OR

Batch Processing

Schema Validation

Full Pipeline

Large File Support

⚙️ CLI Options

📊 Output Format

🛠️ Tech Stack

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`main.py` — THE CONTROLLER

Packages