Contributing
Arnio is a GSSoC 2026 project. We welcome contributors of all levels across Python, C++, tests, docs, data quality, validation rules, and examples.
Quick Start
macOS / Linux
git clone https://github.com/im-anishraj/arnio.git
cd arnio
make install
make test
make lint
Windows
Install Visual Studio Build Tools with the "Desktop development with C++" workload, then:
git clone https://github.com/im-anishraj/arnio.git
cd arnio
pip install -e ".[dev]"
pre-commit install
Tip
Windows users can install make via Chocolatey: choco install make. Or use WSL for a faster setup experience.
Adding a Python Pipeline Step
Many new features don't require touching C++. You can write a pure Python step and register it with Arnio, then add focused tests and documentation.
Step 1: Write the function
import arnio as ar
def remove_special_chars(df, columns=None):
cols = columns or df.select_dtypes("object").columns
for col in cols:
df[col] = df[col].str.replace(r"[^a-zA-Z0-9\s]", "", regex=True)
return df
ar.register_step("remove_special_chars", remove_special_chars)
Step 2: Write tests
def test_remove_special_chars(sample_csv):
ar.register_step("remove_special_chars", remove_special_chars)
frame = ar.read_csv(sample_csv)
result = ar.pipeline(frame, [
("remove_special_chars",),
])
df = ar.to_pandas(result)
assert "name" in df.columns
# Add your specific assertions here
Step 3: Open a PR
That's it. No build system changes, no C++ compiler, no pybind11.
Data Quality Contributions
The data quality layer is also friendly to Python contributors. High-impact areas include semantic detectors, validation rules, examples, and tests around messy real-world CSVs.
- Profiling: Add safe, explainable signals to
profile()such as date-like columns, constant columns, or suspicious cardinality. - Validation: Extend
Fieldrules only when the behavior is deterministic and easy to test. - Auto-cleaning: Keep suggestions conservative; anything destructive should require explicit user choice.
- Documentation: Add before/after examples that show real user workflows instead of isolated toy snippets.
C++ Contributions
For developers comfortable with C++, the highest-impact work right now is performance optimization:
drop_duplicatesā Replace naive O(n²) with hash-based deduplicationstrip_whitespaceā Implement in-place mutation to avoid copies
These two functions are the primary performance bottleneck. See Benchmarks for details.
Pull Request Process
- Fork the repo and create your branch from
main. - Conventional Commits: Ensure your PR title and commits follow the Conventional Commits specification (e.g.,
feat: add support for...,fix: resolve issue with...). This is required for our automated release system. - If you've added code that should be tested, add tests.
- If you've changed APIs, update the documentation.
- Ensure the test suite passes:
make testorpytest tests/ -v - Ensure your code passes linting and formatting:
make lintandpre-commit run --all-files - Issue the pull request.
Code Style
Arnio uses:
- Python: black (line-length 88) + ruff
- C++: clang-format
pre-commit runs these automatically before each commit if installed.