Home Documentation API Reference Benchmarks Contributing Architecture Roadmap Community Discord Changelog

C++ CSV cleaning with built-in data quality intelligence

Parse messy CSVs through a native columnar engine, profile quality problems, validate data contracts, and hand pandas a cleaner DataFrame.

$ pip install arnio click to copy

The Problem Arnio Solves

Every data project starts with messy CSVs. You load, clean, inspect, validate, and hope nothing changed upstream. Arnio makes that workflow explicit and repeatable before pandas does the analysis.

❌ The Old Way (Pandas)

  • Memory spikes — Python loads raw data before you know if it is valid
  • Hidden assumptions — schemas and business rules live in notebook cells
  • Manual inspection — nulls, whitespace, duplicates, and bad emails are found late

⚡ The Arnio Way

  • C++ native — Parses and infers types directly into columnar memory
  • Declarative — Cleaning pipelines, quality reports, and schemas are code
  • Production-minded — Row-level validation issues make bad data visible

The Arnio Workflow

Drop Arnio into the top of your notebook, ETL script, or validation job. One flow for reading, understanding, cleaning, and validating CSV data.

Python
import arnio as ar

# 1. Load the raw file using the C++ core
frame = ar.read_csv("messy_sales_data.csv")

# 2. Understand the incoming data before analysis
report = ar.profile(frame)
print(report.summary())

# 3. Apply safe, pipeline-compatible cleaning suggestions
clean_frame = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# 4. Validate production data contracts
schema = ar.Schema({
    "id": ar.Int64(nullable=False, unique=True),
    "email": ar.Email(nullable=False),
    "revenue": ar.Float64(nullable=True, min=0),
})
result = ar.validate(clean_frame, schema)

# 5. Export to a clean pandas DataFrame
df = ar.to_pandas(clean_frame)

# Now use df exactly like you always have

What's Inside

A practical toolkit for messy real-world CSV workflows: native parsing, reusable cleaning steps, quality profiling, and data contracts.

🗑️

drop_nulls

Remove rows containing null or empty values. Optionally target specific columns with the subset parameter.

🔧

fill_nulls

Replace null/empty values with a specified fill value. Supports column-level targeting via subset.

🧹

strip_whitespace

Trim leading and trailing whitespace from all string columns in a single C++ pass.

🔤

normalize_case

Force lower, upper, or title case on string columns instantly.

🧬

drop_duplicates

Deduplicate rows based on exact matches. Supports subset and keep options.

🏷️

rename & cast

rename_columns and cast_types — shape your data exactly how you need it.

📋

profile

Generate a quality report with null ratios, duplicate counts, uniqueness, whitespace signals, and semantic hints.

validate

Define schemas with required columns, dtypes, ranges, uniqueness, patterns, emails, URLs, and row-level failures.

auto_clean

Apply low-risk cleanup automatically, or use strict mode for deterministic casts and exact duplicate removal.

Quality Reports & Data Contracts

Arnio now helps answer the questions pandas leaves to you: what is wrong with this dataset, what can be cleaned safely, and does it satisfy the contract my pipeline expects?

Python
report = ar.profile(frame)
print(report.summary())
# {'rows': 1000, 'columns_with_nulls': ['revenue'], ...}

suggestions = ar.suggest_cleaning(report)
clean = ar.pipeline(frame, suggestions)

schema = ar.Schema({
    "email": ar.Email(nullable=False, unique=True),
    "signup_url": ar.URL(),
    "age": ar.Int64(nullable=False, min=13),
})
result = ar.validate(clean, schema)
print(result.summary())

Fast Schema Scanning

Inspect a massive CSV to infer column names and types before committing to a full read.

Python
import arnio as ar

schema = ar.scan_csv("100GB_file.csv", encoding="latin-1")
print(schema)
# {'id': 'int64', 'name': 'string', 'is_active': 'bool'}

Built by the Community

Arnio is shaped by contributors from around the world. Every PR, every issue, every idea makes this project better.

Your profile is waiting here.

Whether it's a new pipeline step, a C++ optimization, a typo fix, or an idea — every contribution counts.
Help us make Arnio the most practical data quality layer before pandas.

Start Contributing Join Discord Good First Issues