Documentation

Everything you need to integrate Arnio into data cleaning, profiling, and validation workflows.

Installation

Arnio requires Python 3.9+ and supports Python 3.9 through 3.13. It ships pre-compiled wheels for Windows, Linux (manylinux), and macOS (Intel & Apple Silicon). No C++ compiler needed for normal installation.

Terminal
pip install arnio

Arnio depends on pandas ≥ 1.5 and numpy ≥ 1.23. Both are installed automatically.

Google Colab

Arnio works out of the box on Google Colab. Just run !pip install arnio in a cell.

Quickstart

The typical Arnio workflow is five steps: load, profile, clean, validate, export.

Python
import arnio as ar

# 1. Load — C++ reads and parses the CSV
frame = ar.read_csv("messy_sales_data.csv")

# 2. Profile — understand nulls, duplicates, whitespace, and semantic hints
report = ar.profile(frame)
print(report.summary())

# 3. Clean — declarative pipeline
clean = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("fill_nulls", {"value": 0.0, "subset": ["revenue"]}),
    ("drop_nulls",),
    ("drop_duplicates",),
])

# 4. Validate — production data contract
schema = ar.Schema({
    "name": ar.String(nullable=False, min_length=2),
    "email": ar.Email(nullable=False),
    "revenue": ar.Float64(nullable=True, min=0),
})
result = ar.validate(clean, schema)

# 5. Export — pandas DataFrame
df = ar.to_pandas(clean)

After export, df is a standard pandas.DataFrame. Use it exactly as you always would.

ArFrame

ArFrame is Arnio's core data container — a lightweight columnar structure backed by C++. It wraps the native _Frame object and provides Python-friendly access.

Python
frame = ar.read_csv("data.csv")

print(frame.shape)     # (1000000, 8)
print(frame.columns)   # ['id', 'name', 'city', ...]
print(frame.dtypes)    # {'id': 'int64', 'name': 'string', ...}
print(frame.memory_usage())  # bytes consumed
print(len(frame))      # 1000000

ArFrame is not a DataFrame replacement. It is an intermediate representation designed for high-speed cleaning before you export to pandas.

Pipeline System

The pipeline() function chains cleaning steps sequentially. Each step is a tuple: (step_name,) or (step_name, kwargs_dict).

Python
result = ar.pipeline(frame, [
    ("drop_nulls", {"subset": ["age", "name"]}),
    ("strip_whitespace",),
    ("normalize_case", {"case_type": "lower"}),
    ("drop_duplicates", {"keep": "first"}),
])

Steps are executed in order. Each step receives the output of the previous step. Built-in steps run via the C++ backend; custom Python steps go through a pandas round-trip.

Custom Steps

You can register pure-Python functions as pipeline steps — no C++ required. The function receives and returns a pandas.DataFrame.

Python
import arnio as ar

def remove_special_chars(df, columns=None):
    cols = columns or df.select_dtypes("object").columns
    for col in cols:
        df[col] = df[col].str.replace(r"[^a-zA-Z0-9\s]", "", regex=True)
    return df

ar.register_step("remove_special_chars", remove_special_chars)

# Now use it in any pipeline
result = ar.pipeline(frame, [
    ("strip_whitespace",),
    ("remove_special_chars",),
])

Contributor-friendly

This is how 90% of GSSoC contributors add new features — no C++ compiler, no pybind11, just pure Python.

Cleaning Functions

All built-in cleaning functions accept an ArFrame and return a new ArFrame. They can be called directly or via the pipeline.

FunctionDescriptionKey Parameters
drop_nulls(frame)Remove rows with null/empty valuessubset
fill_nulls(frame, value)Replace nulls with a fill valuevalue, subset
drop_duplicates(frame)Remove duplicate rowssubset, keep
strip_whitespace(frame)Trim whitespace from stringssubset
normalize_case(frame)Normalize string casesubset, case_type
rename_columns(frame, mapping)Rename columns via dictmapping
cast_types(frame, mapping)Cast column types via dictmapping
clean(frame)Convenience: strip + drop nulls + dedupboolean flags

See the API Reference for complete signatures and examples.

DataFrame Conversion

Arnio provides zero-copy (where possible) conversion between ArFrame and pandas:

Python
# ArFrame → pandas
df = ar.to_pandas(frame)

# pandas → ArFrame
frame = ar.from_pandas(df)

to_pandas() uses NumPy buffer interfaces for numeric columns (int64, float64, bool), avoiding row-by-row conversion where possible. String columns use one boundary crossing per column.

Schema Scanning

Use scan_csv() to infer column names and types from a CSV without loading the full dataset:

Python
schema = ar.scan_csv("huge_file.csv", encoding="utf-8")
print(schema)
# {'id': 'int64', 'name': 'string', 'is_active': 'bool'}

This is useful for previewing large files before committing to a full load.

Data Profiling

profile() returns a DataQualityReport with high-signal dataset diagnostics: row and column counts, memory usage, duplicate rows, null counts, uniqueness, whitespace issues, semantic hints, sample values, warnings, and safe cleaning suggestions.

Python
report = ar.profile(frame)

print(report.summary())
print(report.columns["email"].semantic_type)
print(report.to_pandas())

Reports can be serialized with to_dict() or inspected as a pandas DataFrame with to_pandas().

Auto Clean

suggest_cleaning() converts profile signals into pipeline-compatible cleaning steps. auto_clean() applies those steps for you.

Python
suggestions = ar.suggest_cleaning(frame)
clean = ar.pipeline(frame, suggestions)

# Safe mode trims whitespace only
safe = ar.auto_clean(frame)

# Strict mode also applies deterministic casts and exact deduplication
strict, report = ar.auto_clean(frame, mode="strict", return_report=True)

Safe by default

auto_clean(mode="safe") only applies low-risk whitespace cleanup. Use mode="strict" when deterministic casts and exact duplicate removal are acceptable for your workflow.

Schema Validation

Schema and Field let you express production data contracts directly in Python. Validation returns all issues instead of stopping at the first failure.

Python
schema = ar.Schema({
    "id": ar.Int64(nullable=False, unique=True),
    "email": ar.Email(nullable=False),
    "status": ar.String(allowed={"active", "blocked"}),
    "revenue": ar.Float64(nullable=True, min=0),
}, strict=True)

result = schema.validate(frame)
print(result.passed)
print(result.summary())
print(result.to_pandas())

Built-in field helpers include Int64, Float64, String, Bool, Email, and URL. Rules support nullable checks, min/max values, uniqueness, allowed sets, regular expressions, and string length bounds.