GitHub - cjc0013/arnio: Lightning-fast CSV processing & data cleaning engine for Python. Built to supercharge pandas workflows.

Arnio is an open-source C++ accelerated data preprocessing library
for Python. Built for speed and memory efficiency — and actively being optimized during GSSoC 2026.

The Problem • The Solution • Benchmarks • Quickstart

Pandas is incredible for analysis. It is notoriously slow and memory-hungry for ingesting and cleaning raw CSVs.
Arnio exists to do exactly one thing: intercept your messy CSVs, clean them natively in C++, and hand you a pristine Pandas DataFrame in half the time.

🧨 The Problem

Every data project starts the same way. You load a CSV. It crashes your RAM. You load it again in chunks. You find random nulls, weird capitalization, and trailing whitespaces. You write a 15-line script chaining .apply(), .dropna(), and .str.strip(). You copy-paste this script into your next 5 Jupyter notebooks.

It's slow. It's unreadable. It's error-prone.

✨ The Solution: Arnio

Arnio replaces your messy ingestion script with a high-performance, declarative pipeline powered by pybind11 and C++.

❌ The Old Way (Pandas)	⚡ The Arnio Way
Memory Spikes: Python loads the entire raw string file before casting.	C++ Native: Parses and infers types directly into columnar memory.
Spaghetti Code: `.apply()` lambda functions scattered across cells.	Declarative: A strict, readable list of cleaning steps.
Slow Execution: Python loops over strings to strip whitespaces.	Blazing Fast: Cleaning primitives run at near metal speeds.

🚀 Getting Started

If you have Python 3.9+, you are 5 seconds away from faster data pipelines.

pip install arnio

The 3-Step Workflow

Drop Arnio into the very top of your Jupyter Notebook or Python script.

import arnio as ar

# 1. Load the raw file using the C++ core (no Python overhead)
frame = ar.read_csv("messy_sales_data.csv")

# 2. Define a strict, readable cleaning pipeline
clean_frame = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# 3. Export to a clean pandas DataFrame and start your analysis!
df = ar.to_pandas(clean_frame)

# -> Now, use `df` exactly like you always have.

🏎️ Benchmarks

Tested on Ubuntu, Python 3.12, 1M row CSV.
Run make benchmark to reproduce on your machine.

Metric	pandas	arnio v1.0.0
Execution Time	4.73s	5.75s
Peak RAM	211MB	212MB

Current state: arnio's C++ CSV reader matches pandas on memory.
Speed parity is the active engineering goal for v0.2.0 — specifically
drop_duplicates and strip_whitespace are unoptimized C++ and are
the primary contributors to the gap.

Help close the gap →

🔍 Want to peek at a massive file without loading it?

Arnio lets you instantly scan a massive CSV to infer its schema without loading the data into memory.

import arnio as ar

schema = ar.scan_csv("100GB_file.csv")
print(schema) 
# {'id': 'INT64', 'name': 'STRING', 'is_active': 'BOOL'}

🛠️ What's Inside?

Arnio ships with a growing library of hyper-optimized C++ cleaning primitives:

drop_nulls: Rip out bad rows instantly.
fill_nulls: Patch holes with scalar values.
drop_duplicates: Deduplicate rows based on exact matches.
strip_whitespace: Trim invisible spaces from string columns.
normalize_case: Force upper or lower case instantly.
rename_columns & cast_types: Shape your data exactly how you need it.

🤝 Contributing

Arnio is a GSSoC 2026 project. We welcome contributors of all levels.

No C++ required: Add pipeline steps in pure Python
C++ contributors: Help optimize drop_duplicates and strip_whitespace
— these are the current performance bottleneck
Docs & examples: Always needed

Read the Contribution Guide → | Browse open issues →

🗺️ Roadmap

Version	Focus	Status
v1.0.0	Stable release, cross-platform wheels, Google Colab support, CI/CD pipeline	✅ Released
v0.2.0	C++ pipeline optimization, speed parity with pandas	🔨 Active
v0.3.0	Chunked processing, Parquet/JSON support	📋 Planned

Stop fighting your data. Let Arnio clean it.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github		.github
arnio		arnio
benchmarks		benchmarks
bindings		bindings
cpp		cpp
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
arnio.svg		arnio.svg
messy_sales_data.csv		messy_sales_data.csv
pyproject.toml		pyproject.toml
setup.py		setup.py
test_readme.py		test_readme.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧨 The Problem

✨ The Solution: Arnio

🚀 Getting Started

The 3-Step Workflow

🏎️ Benchmarks

🛠️ What's Inside?

🤝 Contributing

🗺️ Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧨 The Problem

✨ The Solution: Arnio

🚀 Getting Started

The 3-Step Workflow

🏎️ Benchmarks

🛠️ What's Inside?

🤝 Contributing

🗺️ Roadmap

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages