Prepare Dataset Config Reference¶
idxr prepare_datasets consumes a JSON document that maps each Pydantic model name to a preprocessing recipe. A minimal config looks like:
{
"Contract": {
"path": "datasets/contracts.csv",
"columns": {
"id": "CONTRACT_ID",
"title": "CONTRACT_TITLE",
"summary": "DESCRIPTION"
},
"character_encoding": "utf-8",
"delimiter": ",",
"malformed_column": null,
"header_row": "all",
"drop_na_columns": []
}
}
Each field plays a specific role during preprocessing:
| Field | Required | Description |
|---|---|---|
path |
✅ | Absolute or relative path to the source CSV/JSONL file. Leave blank ("") to skip a model temporarily. |
columns |
✅ | Mapping of model field name → source column header. Add, remove, or rename entries to match the schema expected by the Pydantic model. |
character_encoding |
optional (defaults to "utf-8") |
Target encoding for the dataset. idxr decodes using this value and re-encodes output partitions as UTF-8. |
delimiter |
optional (defaults to ",") |
Column delimiter for CSV inputs. Change to "\t" for TSV or ";" if your exports use semicolons. |
malformed_column |
optional | Zero-based index of a column that frequently contains embedded newlines. idxr stitches rows by looking ahead until it can parse this column. |
header_row |
optional (defaults to "all") |
Controls which rows are considered headers: "all" keeps every header row, "first" retains only the first row, or specify a literal string to match. |
drop_na_columns |
optional (defaults to []) |
List of column names that must be non-empty. Rows with empty/NaN values in these columns are dropped before partitioning. |
Tips¶
- Store configs under version control. They serve as the contract between data engineering and knowledge engineering teams.
- When CSV exports move, update just the
path—manifest diffing prevents reprocessing unchanged rows. - Use separate configs for different sourcing strategies (e.g., nightly full export vs. targeted hotfix).