Benchmarks
Transparent, reproducible performance comparison between Arnio and pandas.
Current Reference Results
Tested on Ubuntu, Python 3.12, 1M row CSV with mixed types. Each metric is averaged over 3 runs. The production-hardening and data-quality updates did not change the core benchmark workload.
Honest Assessment
Arnio's C++ CSV reader is close to pandas on memory consumption in the reference run. Execution speed is still behind due to known bottlenecks in drop_duplicates and strip_whitespace. The next performance milestone is speed parity on the standard benchmark.
What the Pipeline Tests
Both pipelines perform identical operations on the same dataset:
| Step | pandas | arnio |
|---|---|---|
| Read CSV | pd.read_csv() | ar.read_csv() (C++) |
| Strip whitespace | .str.strip() per column | strip_whitespace (C++) |
| Normalize case | .str.lower() per column | normalize_case (C++) |
| Drop nulls | .dropna() | drop_nulls (C++) |
| Drop duplicates | .drop_duplicates() | drop_duplicates (C++) |
Bottleneck Analysis
Profiling reveals two primary contributors to the speed gap:
drop_duplicatesā The current C++ implementation uses a naive O(n²) comparison. The next target is hash-based deduplication.strip_whitespaceā String operations create unnecessary copies. In-place mutation is planned.
These two operations account for the majority of the gap. The CSV reader itself and the to_pandas conversion (using NumPy buffer interfaces) are already competitive.
Reproduce on Your Machine
# Generate the 1M row test dataset
python benchmarks/generate_data.py
# Run the benchmark
python benchmarks/benchmark_vs_pandas.py
# Or use the Makefile shortcut
make benchmark
The benchmark script runs 3 iterations and reports average execution time and peak RAM for both pandas and Arnio.
Performance Roadmap
| Version | Target | Status |
|---|---|---|
| Current | Production-safe parsing, encoding support, and quality APIs without regressing the reference benchmark | Stable |
| Next | Speed parity ā optimized drop_duplicates & strip_whitespace | Active |
| Later | Chunked processing, streaming execution, and larger-than-memory CSV workflows | Planned |
Help Close the Gap
C++ contributors can make a significant impact by optimizing the two bottleneck functions. See the open issues for performance-tagged work.