API Reference
Complete function signatures for the arnio public API after the production-readiness and data-quality updates.
I/O — arnio.io
Read a CSV file into an ArFrame via the C++ backend.
| Parameter | Type | Description |
|---|---|---|
path | str | os.PathLike | Path to CSV, TSV, or TXT file. |
delimiter | str | Column delimiter. Default: "," |
has_header | bool | Whether the first row is a header. Default: True |
usecols | list[str] | None | Subset of columns to load. |
nrows | int | None | Maximum number of rows to read. |
encoding | str | File encoding. Default: "utf-8" |
Returns: ArFrame
Return the schema (column names and inferred types) without loading data into memory.
| Parameter | Type | Description |
|---|---|---|
path | str | os.PathLike | Path to CSV, TSV, or TXT file. |
delimiter | str | Column delimiter. Default: "," |
encoding | str | File encoding. Non-UTF-8 input is transcoded before native scanning. |
Returns: dict[str, str] — e.g. {"id": "int64", "name": "string"}
Cleaning — arnio.cleaning
Remove rows containing null or empty values.
| Parameter | Type | Description |
|---|---|---|
frame | ArFrame | Input frame. |
subset | list[str] | None | Columns to check. If None, checks all. |
Returns: ArFrame
Replace null/empty values with a given fill value.
| Parameter | Type | Description |
|---|---|---|
frame | ArFrame | Input frame. |
value | Any | Fill value (scalar). |
subset | list[str] | None | Columns to fill. |
Returns: ArFrame
Remove duplicate rows.
| Parameter | Type | Description |
|---|---|---|
frame | ArFrame | Input frame. |
subset | list[str] | None | Columns to check for duplicates. |
keep | str | bool | Which duplicate to keep: "first", "last", "none", or False. |
Returns: ArFrame
Trim leading and trailing whitespace from string columns.
Returns: ArFrame
Normalize string columns to "lower", "upper", or "title" case.
Returns: ArFrame
Rename columns using a {old_name: new_name} dictionary.
Returns: ArFrame
Cast columns to specified types via a {column: type_str} dictionary.
Returns: ArFrame
Convenience function to apply common cleaning operations in order: strip whitespace → drop nulls → drop duplicates.
Returns: ArFrame
Pipeline — arnio.pipeline
Apply a list of cleaning steps sequentially. Each step is a tuple: (step_name,) or (step_name, kwargs_dict).
| Parameter | Type | Description |
|---|---|---|
frame | ArFrame | Input frame. |
steps | list[tuple] | Ordered list of (name,) or (name, kwargs) tuples. |
Returns: ArFrame
Raises: UnknownStepError if a step name is not found.
Register a custom Python pipeline step. The function fn should accept and return a pandas.DataFrame.
| Parameter | Type | Description |
|---|---|---|
name | str | Step name for use in pipeline(). |
fn | Callable | Function: DataFrame → DataFrame. |
Conversion — arnio.convert
Convert an ArFrame to a pandas.DataFrame. Uses zero-copy NumPy buffer interfaces for numeric columns.
Returns: pandas.DataFrame
Convert a pandas.DataFrame to an ArFrame. Handles pd.NA and np.nan conversion to null.
Returns: ArFrame
Raises: TypeError if columns contain nested/complex types.
Data Quality — arnio.quality
Profile an ArFrame for quality signals before analysis.
| Parameter | Type | Description |
|---|---|---|
frame | ArFrame | Input frame to inspect. |
sample_size | int | Number of non-null sample values to keep per column. |
Returns: DataQualityReport
Return pipeline-compatible cleaning steps based on detected quality signals.
Returns: list[tuple[str, dict[str, Any]]] — for example [("strip_whitespace", {"subset": ["name"]})]
Apply built-in automatic cleaning based on the quality report.
| Parameter | Type | Description |
|---|---|---|
mode | "safe" | "strict" | "safe" trims whitespace only. "strict" also applies deterministic casts and exact duplicate removal. |
return_report | bool | Return the pre-cleaning DataQualityReport with the cleaned frame. |
Returns: ArFrame or tuple[ArFrame, DataQualityReport]
Whole-frame quality report returned by profile().
| Attribute/Method | Description |
|---|---|
.row_count, .column_count | Frame dimensions. |
.memory_usage | Frame memory usage in bytes. |
.duplicate_rows, .duplicate_ratio | Exact duplicate-row diagnostics. |
.columns | Mapping of column name to ColumnProfile. |
.suggestions | Pipeline-compatible suggested cleaning steps. |
.summary(), .to_dict(), .to_pandas() | Compact, JSON-friendly, or tabular output. |
Schema Validation — arnio.schema
Named validation contract for an ArFrame. Use strict=True to reject unexpected columns.
Method: schema.validate(frame) returns ValidationResult.
| Builder | Rules |
|---|---|
Int64(nullable=True, min=None, max=None, unique=False) | Integer dtype, nullability, range, uniqueness. |
Float64(nullable=True, min=None, max=None, unique=False) | Float dtype, nullability, range, uniqueness. |
String(nullable=True, pattern=None, allowed=None, unique=False, min_length=None, max_length=None) | String dtype, regex, allowed values, uniqueness, length bounds. |
Bool(nullable=True) | Boolean dtype and nullability. |
Email(nullable=True, unique=False) | Email semantic validation. |
URL(nullable=True, unique=False) | URL semantic validation. |
Validate an ArFrame against a Schema or dict[str, Field].
Returns: ValidationResult
| Attribute/Method | Description |
|---|---|
.passed | True when there are zero issues. |
.issue_count | Total number of validation issues. |
.issues | List of ValidationIssue objects with column, rule, message, row index, and value. |
.bad_rows | Sorted row indexes with validation failures. |
.summary(), .to_dict(), .to_pandas() | Compact, JSON-friendly, or tabular output. |
Core — arnio.frame
Lightweight columnar data container backed by C++.
| Property/Method | Returns | Description |
|---|---|---|
.shape | tuple[int, int] | Row and column count. |
.columns | list[str] | Column names. |
.dtypes | dict[str, str] | Column name → inferred type. |
.memory_usage() | int | Total bytes consumed. |
len(frame) | int | Number of rows. |
Exceptions — arnio.exceptions
| Exception | Description |
|---|---|
ArnioError | Base exception for all Arnio errors. |
UnknownStepError | Raised when a pipeline step name is not registered. Lists available steps. |
CsvReadError | Raised when a CSV file cannot be read. |
TypeCastError | Raised when cast_types encounters an incompatible type. |