C++ CSV cleaning with built-in data quality intelligence
Parse messy CSVs through a native columnar engine, profile quality problems, validate data contracts, and hand pandas a cleaner DataFrame.
Parse messy CSVs through a native columnar engine, profile quality problems, validate data contracts, and hand pandas a cleaner DataFrame.
Every data project starts with messy CSVs. You load, clean, inspect, validate, and hope nothing changed upstream. Arnio makes that workflow explicit and repeatable before pandas does the analysis.
Drop Arnio into the top of your notebook, ETL script, or validation job. One flow for reading, understanding, cleaning, and validating CSV data.
import arnio as ar
# 1. Load the raw file using the C++ core
frame = ar.read_csv("messy_sales_data.csv")
# 2. Understand the incoming data before analysis
report = ar.profile(frame)
print(report.summary())
# 3. Apply safe, pipeline-compatible cleaning suggestions
clean_frame = ar.pipeline(frame, [
("strip_whitespace",),
("normalize_case", {"case_type": "lower"}),
("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
("drop_nulls",),
("drop_duplicates",),
])
# 4. Validate production data contracts
schema = ar.Schema({
"id": ar.Int64(nullable=False, unique=True),
"email": ar.Email(nullable=False),
"revenue": ar.Float64(nullable=True, min=0),
})
result = ar.validate(clean_frame, schema)
# 5. Export to a clean pandas DataFrame
df = ar.to_pandas(clean_frame)
# Now use df exactly like you always have
A practical toolkit for messy real-world CSV workflows: native parsing, reusable cleaning steps, quality profiling, and data contracts.
Remove rows containing null or empty values. Optionally target specific columns with the subset parameter.
Replace null/empty values with a specified fill value. Supports column-level targeting via subset.
Trim leading and trailing whitespace from all string columns in a single C++ pass.
Force lower, upper, or title case on string columns instantly.
Deduplicate rows based on exact matches. Supports subset and keep options.
rename_columns and cast_types — shape your data exactly how you need it.
Generate a quality report with null ratios, duplicate counts, uniqueness, whitespace signals, and semantic hints.
Define schemas with required columns, dtypes, ranges, uniqueness, patterns, emails, URLs, and row-level failures.
Apply low-risk cleanup automatically, or use strict mode for deterministic casts and exact duplicate removal.
Arnio now helps answer the questions pandas leaves to you: what is wrong with this dataset, what can be cleaned safely, and does it satisfy the contract my pipeline expects?
report = ar.profile(frame)
print(report.summary())
# {'rows': 1000, 'columns_with_nulls': ['revenue'], ...}
suggestions = ar.suggest_cleaning(report)
clean = ar.pipeline(frame, suggestions)
schema = ar.Schema({
"email": ar.Email(nullable=False, unique=True),
"signup_url": ar.URL(),
"age": ar.Int64(nullable=False, min=13),
})
result = ar.validate(clean, schema)
print(result.summary())
Inspect a massive CSV to infer column names and types before committing to a full read.
import arnio as ar
schema = ar.scan_csv("100GB_file.csv", encoding="latin-1")
print(schema)
# {'id': 'int64', 'name': 'string', 'is_active': 'bool'}
Arnio is shaped by contributors from around the world. Every PR, every issue, every idea makes this project better.
Your profile is waiting here.
Whether it's a new pipeline step, a C++ optimization, a typo fix, or an idea — every contribution counts.
Help us make Arnio the most practical data quality layer before pandas.