Current Reference Results

Tested on Ubuntu, Python 3.12, 1M row CSV with mixed types. Each metric is averaged over 3 runs. The production-hardening and data-quality updates did not change the core benchmark workload.

4.73s
pandas
Execution Time
5.75s
arnio reference
Execution Time
~1:1
Memory Parity
~211MB vs ~212MB peak RAM

Honest Assessment

Arnio's C++ CSV reader is close to pandas on memory consumption in the reference run. Execution speed is still behind due to known bottlenecks in drop_duplicates and strip_whitespace. The next performance milestone is speed parity on the standard benchmark.

What the Pipeline Tests

Both pipelines perform identical operations on the same dataset:

Steppandasarnio
Read CSVpd.read_csv()ar.read_csv() (C++)
Strip whitespace.str.strip() per columnstrip_whitespace (C++)
Normalize case.str.lower() per columnnormalize_case (C++)
Drop nulls.dropna()drop_nulls (C++)
Drop duplicates.drop_duplicates()drop_duplicates (C++)

Bottleneck Analysis

Profiling reveals two primary contributors to the speed gap:

These two operations account for the majority of the gap. The CSV reader itself and the to_pandas conversion (using NumPy buffer interfaces) are already competitive.

Reproduce on Your Machine

Terminal
# Generate the 1M row test dataset
python benchmarks/generate_data.py

# Run the benchmark
python benchmarks/benchmark_vs_pandas.py

# Or use the Makefile shortcut
make benchmark

The benchmark script runs 3 iterations and reports average execution time and peak RAM for both pandas and Arnio.

Performance Roadmap

VersionTargetStatus
CurrentProduction-safe parsing, encoding support, and quality APIs without regressing the reference benchmarkStable
NextSpeed parity — optimized drop_duplicates & strip_whitespaceActive
LaterChunked processing, streaming execution, and larger-than-memory CSV workflowsPlanned

Help Close the Gap

C++ contributors can make a significant impact by optimizing the two bottleneck functions. See the open issues for performance-tagged work.