Internal Architecture
A technical deep dive into how Arnio combines a native CSV engine, Python orchestration, data quality profiling, validation, and pandas conversion.
Core Philosophy
Arnio is designed to provide high-performance, memory-efficient data ingestion and cleaning by leveraging C++ while maintaining a seamless, declarative Python API.
Data preprocessing often involves operations that are slow, repetitive, and easy to leave implicit: string cleanup, duplicate removal, quality checks, and schema validation. Arnio solves this by:
- Loading data directly into C++ memory structures.
- Performing built-in cleaning operations natively in C++ without Python GIL contention.
- Keeping orchestration, profiling, auto-clean suggestions, and validation readable in Python.
- Translating the final dataset to pandas through an efficient NumPy-backed boundary.
Python ↔ C++ Boundary
The boundary is managed using pybind11. The C++ core is compiled into a Python extension module (_arnio_cpp). The Python API serves as a lightweight wrapper around this extension.
C++ Runtime: Includes the CsvReader, Frame/Column structures, type inference, and native cleaning primitives.
Data Model
Arnio's data model is columnar, resembling Apache Arrow or modern Pandas internals.
Column
A Column represents a 1D array of homogeneous data. Data is stored using strongly-typed std::vectors (e.g., std::vector). Nulls are tracked via a separate boolean mask, keeping data vectors dense and cache-friendly.
Frame
A Frame is an ordered collection of Column objects. It maintains an index mapping column names to their respective Column objects for O(1) access.
Pipeline Execution
The pipeline() function accepts a list of declarative steps. Arnio maintains a registry mapping names (e.g., "strip_whitespace") to function pointers. For native operations, the Python wrapper calls C++ directly. If a step is pure Python, the Frame is temporarily converted to pandas, executed, and converted back.
Data Quality Layer
The newer quality layer sits above the native frame. profile() converts an ArFrame into pandas for rich diagnostics, then returns immutable report objects that expose summaries, JSON-friendly dictionaries, and pandas views.
This layer intentionally favors clear Python data contracts over hidden magic. suggest_cleaning() turns report signals into standard pipeline steps, while auto_clean() applies only explicit built-in operations.
Schema Validation
Schema, Field, and the field builders (Int64, Float64, String, Bool, Email, URL) form the production data contract layer. Validation collects all issues with column name, rule, row index, and value so callers can decide whether to fail a job, quarantine rows, or report problems upstream.
Converting to Pandas
The to_pandas() function uses NumPy-backed paths for numeric and boolean types, avoiding row-by-row conversion where possible. String columns currently require instantiation of Python str objects.
Runtime Flow
CSV file
↓
read_csv / scan_csv
↓
C++ CsvReader → Frame / Column storage
↓
Native cleaning pipeline
↓
profile / suggest_cleaning / validate
↓
to_pandas for analysis, ML, or downstream ETL