Vectorize¶
idxr vectorize ingests partitions and upserts documents into Chroma. It is designed for long-running, resume-friendly indexing jobs that can target local persistent stores or Chroma Cloud tenants.
Responsibilities¶
- Model validation – the same registry used by
prepare_datasetsdrives schema-aware indexing and metadata enrichment. - Token budgeting – dynamic truncation strategies ensure each document respects the embedding model’s token limits.
- Resume state – batch offsets, row digests, and manifest progress are checkpointed so reruns skip already indexed slices.
- Observability – structured logs, optional log rotation, and error YAML payloads make failures debuggable.
- Multi-tenant support – built-in clients for local persistent stores and Chroma Cloud, with pluggable collection strategies.
Workflow Summary¶
- Generate configs – either edit a vectorize JSON config manually or let
idxr prepare_datasetsgenerate partition manifests. - Run
idxr vectorize index– point at the manifest or config, choose an output location for per-partition persistence, and supply the required connectivity flags. - Resume as needed – re-run with
--resumeto pick up where you left off after a failure or partial run. - Inspect progress – use
idxr vectorize statusto compare manifest entries against indexed partitions and ensure the run completed.
The following pages document configuration schemas and command-line flags in detail.