Core Philosophy

Arnio is designed to provide high-performance, memory-efficient data ingestion and cleaning by leveraging C++ while maintaining a seamless, declarative Python API.

Data preprocessing often involves operations that are slow, repetitive, and easy to leave implicit: string cleanup, duplicate removal, quality checks, and schema validation. Arnio solves this by:

  1. Loading data directly into C++ memory structures.
  2. Performing built-in cleaning operations natively in C++ without Python GIL contention.
  3. Keeping orchestration, profiling, auto-clean suggestions, and validation readable in Python.
  4. Translating the final dataset to pandas through an efficient NumPy-backed boundary.

Python ↔ C++ Boundary

The boundary is managed using pybind11. The C++ core is compiled into a Python extension module (_arnio_cpp). The Python API serves as a lightweight wrapper around this extension.

C++ Runtime: Includes the CsvReader, Frame/Column structures, type inference, and native cleaning primitives.

Data Model

Arnio's data model is columnar, resembling Apache Arrow or modern Pandas internals.

Column

A Column represents a 1D array of homogeneous data. Data is stored using strongly-typed std::vectors (e.g., std::vector). Nulls are tracked via a separate boolean mask, keeping data vectors dense and cache-friendly.

Frame

A Frame is an ordered collection of Column objects. It maintains an index mapping column names to their respective Column objects for O(1) access.

Pipeline Execution

The pipeline() function accepts a list of declarative steps. Arnio maintains a registry mapping names (e.g., "strip_whitespace") to function pointers. For native operations, the Python wrapper calls C++ directly. If a step is pure Python, the Frame is temporarily converted to pandas, executed, and converted back.

Data Quality Layer

The newer quality layer sits above the native frame. profile() converts an ArFrame into pandas for rich diagnostics, then returns immutable report objects that expose summaries, JSON-friendly dictionaries, and pandas views.

This layer intentionally favors clear Python data contracts over hidden magic. suggest_cleaning() turns report signals into standard pipeline steps, while auto_clean() applies only explicit built-in operations.

Schema Validation

Schema, Field, and the field builders (Int64, Float64, String, Bool, Email, URL) form the production data contract layer. Validation collects all issues with column name, rule, row index, and value so callers can decide whether to fail a job, quarantine rows, or report problems upstream.

Converting to Pandas

The to_pandas() function uses NumPy-backed paths for numeric and boolean types, avoiding row-by-row conversion where possible. String columns currently require instantiation of Python str objects.

Runtime Flow

Flow
CSV file
  ↓
read_csv / scan_csv
  ↓
C++ CsvReader → Frame / Column storage
  ↓
Native cleaning pipeline
  ↓
profile / suggest_cleaning / validate
  ↓
to_pandas for analysis, ML, or downstream ETL