Documentation
Everything you need to integrate Arnio into data cleaning, profiling, and validation workflows.
Installation
Arnio requires Python 3.9+ and supports Python 3.9 through 3.13. It ships pre-compiled wheels for Windows, Linux (manylinux), and macOS (Intel & Apple Silicon). No C++ compiler needed for normal installation.
pip install arnio
Arnio depends on pandas ≥ 1.5 and numpy ≥ 1.23. Both are installed automatically.
Google Colab
Arnio works out of the box on Google Colab. Just run !pip install arnio in a cell.
Quickstart
The typical Arnio workflow is five steps: load, profile, clean, validate, export.
import arnio as ar
# 1. Load — C++ reads and parses the CSV
frame = ar.read_csv("messy_sales_data.csv")
# 2. Profile — understand nulls, duplicates, whitespace, and semantic hints
report = ar.profile(frame)
print(report.summary())
# 3. Clean — declarative pipeline
clean = ar.pipeline(frame, [
("strip_whitespace",),
("normalize_case", {"case_type": "lower"}),
("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
("drop_nulls",),
("drop_duplicates",),
])
# 4. Validate — production data contract
schema = ar.Schema({
"name": ar.String(nullable=False, min_length=2),
"email": ar.Email(nullable=False),
"revenue": ar.Float64(nullable=True, min=0),
})
result = ar.validate(clean, schema)
# 5. Export — pandas DataFrame
df = ar.to_pandas(clean)
After export, df is a standard pandas.DataFrame. Use it exactly as you always would.
ArFrame
ArFrame is Arnio's core data container — a lightweight columnar structure backed by C++. It wraps the native _Frame object and provides Python-friendly access.
frame = ar.read_csv("data.csv")
print(frame.shape) # (1000000, 8)
print(frame.columns) # ['id', 'name', 'city', ...]
print(frame.dtypes) # {'id': 'int64', 'name': 'string', ...}
print(frame.memory_usage()) # bytes consumed
print(len(frame)) # 1000000
ArFrame is not a DataFrame replacement. It is an intermediate representation designed for high-speed cleaning before you export to pandas.
Pipeline System
The pipeline() function chains cleaning steps sequentially. Each step is a tuple: (step_name,) or (step_name, kwargs_dict).
result = ar.pipeline(frame, [
("drop_nulls", {"subset": ["age", "name"]}),
("strip_whitespace",),
("normalize_case", {"case_type": "lower"}),
("drop_duplicates", {"keep": "first"}),
])
Steps are executed in order. Each step receives the output of the previous step. Built-in steps run via the C++ backend; custom Python steps go through a pandas round-trip.
Custom Steps
You can register pure-Python functions as pipeline steps — no C++ required. The function receives and returns a pandas.DataFrame.
import arnio as ar
def remove_special_chars(df, columns=None):
cols = columns or df.select_dtypes("object").columns
for col in cols:
df[col] = df[col].str.replace(r"[^a-zA-Z0-9\s]", "", regex=True)
return df
ar.register_step("remove_special_chars", remove_special_chars)
# Now use it in any pipeline
result = ar.pipeline(frame, [
("strip_whitespace",),
("remove_special_chars",),
])
Contributor-friendly
This is how 90% of GSSoC contributors add new features — no C++ compiler, no pybind11, just pure Python.
Cleaning Functions
All built-in cleaning functions accept an ArFrame and return a new ArFrame. They can be called directly or via the pipeline.
| Function | Description | Key Parameters |
|---|---|---|
drop_nulls(frame) | Remove rows with null/empty values | subset |
fill_nulls(frame, value) | Replace nulls with a fill value | value, subset |
drop_duplicates(frame) | Remove duplicate rows | subset, keep |
strip_whitespace(frame) | Trim whitespace from strings | subset |
normalize_case(frame) | Normalize string case | subset, case_type |
rename_columns(frame, mapping) | Rename columns via dict | mapping |
cast_types(frame, mapping) | Cast column types via dict | mapping |
clean(frame) | Convenience: strip + drop nulls + dedup | boolean flags |
See the API Reference for complete signatures and examples.
DataFrame Conversion
Arnio provides zero-copy (where possible) conversion between ArFrame and pandas:
# ArFrame → pandas
df = ar.to_pandas(frame)
# pandas → ArFrame
frame = ar.from_pandas(df)
to_pandas() uses NumPy buffer interfaces for numeric columns (int64, float64, bool), avoiding row-by-row conversion where possible. String columns use one boundary crossing per column.
Schema Scanning
Use scan_csv() to infer column names and types from a CSV without loading the full dataset:
schema = ar.scan_csv("huge_file.csv", encoding="utf-8")
print(schema)
# {'id': 'int64', 'name': 'string', 'is_active': 'bool'}
This is useful for previewing large files before committing to a full load.
Data Profiling
profile() returns a DataQualityReport with high-signal dataset diagnostics: row and column counts, memory usage, duplicate rows, null counts, uniqueness, whitespace issues, semantic hints, sample values, warnings, and safe cleaning suggestions.
report = ar.profile(frame)
print(report.summary())
print(report.columns["email"].semantic_type)
print(report.to_pandas())
Reports can be serialized with to_dict() or inspected as a pandas DataFrame with to_pandas().
Auto Clean
suggest_cleaning() converts profile signals into pipeline-compatible cleaning steps. auto_clean() applies those steps for you.
suggestions = ar.suggest_cleaning(frame)
clean = ar.pipeline(frame, suggestions)
# Safe mode trims whitespace only
safe = ar.auto_clean(frame)
# Strict mode also applies deterministic casts and exact deduplication
strict, report = ar.auto_clean(frame, mode="strict", return_report=True)
Safe by default
auto_clean(mode="safe") only applies low-risk whitespace cleanup. Use mode="strict" when deterministic casts and exact duplicate removal are acceptable for your workflow.
Schema Validation
Schema and Field let you express production data contracts directly in Python. Validation returns all issues instead of stopping at the first failure.
schema = ar.Schema({
"id": ar.Int64(nullable=False, unique=True),
"email": ar.Email(nullable=False),
"status": ar.String(allowed={"active", "blocked"}),
"revenue": ar.Float64(nullable=True, min=0),
}, strict=True)
result = schema.validate(frame)
print(result.passed)
print(result.summary())
print(result.to_pandas())
Built-in field helpers include Int64, Float64, String, Bool, Email, and URL. Rules support nullable checks, min/max values, uniqueness, allowed sets, regular expressions, and string length bounds.