Current main Merged

Production Readiness & Data Quality Engine

The current codebase goes beyond CSV cleanup and adds practical production data checks.

  • profile() for nulls, duplicate rows, uniqueness, whitespace, semantic hints, memory usage, and cleaning suggestions
  • suggest_cleaning() and auto_clean() for safe or strict cleanup workflows
  • Schema, Field, and validation helpers for row-level data contract failures
  • CSV hardening for quoted multiline records, duplicate headers, non-UTF-8 encodings, and clearer Python exceptions
  • Python 3.13 classifier and CI coverage alongside existing supported versions
v1.0.x Released

Stable Release & Cross-Platform Packaging

The foundation release establishing Arnio as a production-ready library.

  • Cross-platform pre-compiled wheels via cibuildwheel — Windows, Linux (manylinux), macOS (Intel & Apple Silicon)
  • Google Colab compatibility out of the box
  • Production-grade packaging — resolved ModuleNotFoundError issues
  • Fully automated PyPI publishing pipeline via Trusted Publishing
  • CI/CD for Python 3.9–3.13 across all platforms
  • Stable public API marked "Production/Stable"
  • Zero-copy to_pandas() via NumPy buffer interfaces
  • Custom exception hierarchy: ArnioError, UnknownStepError, CsvReadError, TypeCastError
  • Pure-Python step registration via register_step()
Next Active Development

C++ Pipeline Optimization — Speed Parity

The primary engineering goal: match or exceed pandas execution speed on the standard benchmark.

  • Hash-based drop_duplicates — replace O(n²) naive comparison with O(n) hash deduplication
  • In-place strip_whitespace — eliminate unnecessary string copies
  • Optimize columnar iteration patterns in C++
  • Benchmark-driven development with CI-integrated regression detection

This is where contributors make the biggest impact

If you're comfortable with C++, optimizing these two functions is the single highest-value contribution you can make. See open issues.

Scaling Planned

Chunked Processing & Format Expansion

Scaling Arnio to handle files that don't fit in memory, and expanding beyond CSV.

  • Chunked CSV reading — process files larger than available RAM
  • Parquet support — read and write Apache Parquet files
  • JSON support — ingest newline-delimited JSON (NDJSON)
  • Streaming pipeline execution for memory-constrained environments

Want to influence the roadmap?

Open an Issue