Prepare Datasets¶
idxr prepare_datasets turns raw CSV/JSONL exports into manifest-tracked partitions. It is the “migration authoring” phase of the lifecycle—you build reproducible input slices that downstream vectorization can replay in order.
Responsibilities¶
- Schema awareness – uses your Pydantic model registry to validate column mappings and detect schema drift.
- Row hygiene – trims whitespace, fixes malformed encodings, stitches newline-leaking rows, and drops duplicates via deterministic digests.
- Partition orchestration – writes partitions into timestamped directories while maintaining a single manifest that records model, partition, and schema version metadata.
- Change tracking – records row-level digests alongside the manifest so reruns skip previously processed records.
- Drop planning – generates and applies remediation scripts that mark partitions as stale or deleted when you want to unwind a bad migration.
Workflow Summary¶
- Scaffold a config with
idxr prepare_datasets new-config. The config lists every model and generates a column mapping stub. - Edit the config to point at the CSV exports you actually want to ingest.
- Run
idxr prepare_datasetswith the config and an output directory. Repeated runs append new partitions if the source data changed. - Review the manifest (
manifest.json) to audit what was produced, including digests and schema signatures. - Plan drops with
idxr prepare_datasets plan-dropand execute them withidxr prepare_datasets apply-dropwhenever you need to roll back a migration.
The rest of this section documents the configuration schema and command-line surface area so you can tailor the pipeline to your own datasets.