Multi-Collection Execution Architecture¶

Overview¶

ChromaSQL now supports executing queries across multiple ChromaDB collections with intelligent routing based on metadata filters. This implementation is generic and extensible - it doesn't hardcode any specific discriminator field like "model", allowing developers to build custom routing strategies for their use cases.

What Was Built¶

1. Core Abstractions (`chromasql/multi_collection.py`)¶

Two Protocol Interfaces:

CollectionRouter - Decides which collections to query based on parsed AST
Returns None → query all collections
Returns Sequence[str] → query specific collections
Fully customizable by implementing the protocol
AsyncCollectionProvider - Abstracts async collection retrieval
Works with any async ChromaDB client (HTTP, Cloud, custom)
Handles collection caching and connection pooling

Main Function: - execute_multi_collection() - Orchestrates multi-collection execution - Parses query → routes to collections → executes in parallel → merges results - Supports both vector and filter-only queries - Handles partial failures gracefully - Respects LIMIT/OFFSET/ORDER BY after merging

2. Pre-Built Adapters (`chromasql/adapters.py`)¶

Three Ready-to-Use Implementations:

MetadataFieldRouter - Generic metadata-based routing
Extracts values from any metadata field path (e.g., ("model",), ("tenant", "id"))
Uses query config to map values to collections
Configurable fallback behavior
SimpleAsyncClientAdapter - Wraps raw ChromaDB async clients
For simpler setups without query config infrastructure

3. Comprehensive Tests (`tests/chromasql/test_multi_collection.py`)¶

11 Test Cases Covering: - Static and dynamic routing - Metadata-based routing (single value, IN lists) - Fallback behavior (query all vs. strict mode) - LIMIT/OFFSET/ORDER BY after merge - Vector and filter-only query modes - Partial collection failure handling - Adapter implementations

Test Coverage: 96% overall (1067 statements, 45 miss)

4. Documentation¶

Three Documentation Files: - CONTRIBUTING.md - Updated with multi-collection architecture - EXAMPLES.md - 6 comprehensive usage examples - MULTI_COLLECTION_SUMMARY.md - This file

Key Design Decisions¶

✅ Generic, Not Specific¶

Decision: Don't hardcode "model" as the discriminator field

Rationale: - Different developers have different metadata structures - Field could be: tenant_id, region, category, model, etc. - Protocol-based design allows unlimited customization

Implementation:

# Generic - works with any field
router = MetadataFieldRouter(config, field_path=("model",))
router = MetadataFieldRouter(config, field_path=("tenant_id",))
router = MetadataFieldRouter(config, field_path=("org", "region"))

✅ Leverage Existing Analysis Module¶

Decision: Use chromasql.analysis.extract_metadata_values() for routing

Rationale: - Already existed and was designed for this purpose - Keeps routing logic separate from query execution - Easy to test in isolation

Code Reference: chromasql/analysis.py

✅ Protocol-Based Extensibility¶

Decision: Use Python protocols instead of base classes

Rationale: - More Pythonic and flexible - Duck typing enables easy mocking in tests - No inheritance complexity

Example:

class CollectionRouter(Protocol):
    def route(self, query: Query) -> Optional[Sequence[str]]:
        ...

✅ Graceful Partial Failures¶

Decision: Return results from successful collections even if some fail

Rationale: - Common in distributed systems for some nodes to be unavailable - Better to return partial results than fail entirely - Logs errors for monitoring

Behavior: - If 35 of 37 collections succeed → return merged results from 35 - If ALL collections fail → raise ChromaSQLExecutionError

✅ Result Merging Strategy¶

Decision: Merge by distance and re-apply ORDER BY/LIMIT/OFFSET

Rationale: - Vector queries need global ranking across collections - LIMIT/OFFSET should apply to final merged results, not per-collection - ORDER BY may include multiple fields (e.g., ORDER BY metadata.year DESC, distance ASC)

Implementation: chromasql/multi_collection.py

Integration with Your Setup¶

Your Current Infrastructure¶

37 collections
16M total records
metadata.model as discriminator
AsyncMultiCollectionQueryClient already exists
query_config.json maps models → collections

How to Use¶

from pathlib import Path
from chromasql.adapters import MetadataFieldRouter
from chromasql.multi_collection import execute_multi_collection
from idxr.query_lib.async_multi_collection_adapter import AsyncMultiCollectionAdapter
from idxr.vectorize_lib.query_client import AsyncMultiCollectionQueryClient
from idxr.vectorize_lib.query_config import load_query_config

# Load config
config = load_query_config(Path("output/query_config.json"))

# Initialize client (your existing code)
client = AsyncMultiCollectionQueryClient(
    config_path=Path("output/query_config.json"),
    client_type="cloud",
    cloud_api_key=api_key,
)
await client.connect()

# Create adapters
adapter = AsyncMultiCollectionAdapter(client)
router = MetadataFieldRouter(
    query_config=config,
    field_path=("model",),  # Your discriminator
    fallback_to_all=True,    # Query all 37 if not specified
)

# Execute ChromaSQL with routing
result = await execute_multi_collection(
    query_str="""
        SELECT id, distance, document
        FROM sap_data
        WHERE metadata.model IN ('Table', 'Field')
        USING EMBEDDING (TEXT 'financial tables')
        TOPK 10;
    """,
    router=router,
    collection_provider=adapter,
    embed_fn=your_embed_function,
)

# Router extracted {'Table', 'Field'} from WHERE clause
# Queried only collections containing those models (e.g., 5 of 37)
# Results merged and ranked globally by distance

Routing Examples¶

Query with model filter → targeted routing:

SELECT * FROM demo
WHERE metadata.model = 'Table'
USING EMBEDDING (TEXT 'query')
-- Queries only collections containing 'Table'

Query without model filter → all collections:

SELECT * FROM demo
WHERE metadata.year > 2020
USING EMBEDDING (TEXT 'query')
-- Queries all 37 collections (model not constrained)

Filter-only query → works too:

SELECT * FROM demo
WHERE metadata.model = 'Field'
  AND metadata.status = 'active'
-- No USING EMBEDDING = filter-only mode
-- Still routes based on metadata.model

Performance Characteristics¶

Parallel Execution¶

All collections queried in parallel using asyncio.gather()
No sequential bottlenecks

Network Efficiency¶

Only queries collections that contain the filtered models
Example: Filter on 2 models → query 5 collections (not all 37)

Result Merging¶

In-memory merge after collection queries complete
Complexity: O(n log n) where n = total results from all collections
For TOPK 10 across 5 collections → sorts ~50 items, returns top 10

Recommended Patterns¶

Use specific model filters when possible:

WHERE metadata.model IN ('Table', 'Field')  -- Good: queries 5 collections
WHERE metadata.status = 'active'             -- Queries all 37 collections

Fetch more candidates per collection for better recall:

result = await execute_multi_collection(
    query_str=...,
    n_results_per_collection=50,  # Fetch 50 from each collection
    # Final result still respects TOPK/LIMIT in query
)

Monitor routing decisions:

query = parse(sql)
collections = router.route(query)
logger.info(f"Querying {len(collections) if collections else 'all'} collection(s)")

Testing & Coverage¶

Test Suite¶

11 new tests in test_multi_collection.py
All existing 109 tests still pass
Total: 120 passing tests

Coverage¶

chromasql/__init__.py              100%
chromasql/adapters.py               75% (16 miss - edge cases)
chromasql/analysis.py               93% (2 miss - rare branches)
chromasql/ast.py                   100%
chromasql/errors.py                100%
chromasql/executor.py              100%
chromasql/explain.py               100%
chromasql/grammar.py               100%
chromasql/multi_collection.py       77% (27 miss - error paths)
chromasql/parser.py                100%
chromasql/plan.py                  100%
chromasql/planner.py               100%
-----------------------------------------------------
TOTAL                               96% coverage

Key Test Scenarios Covered¶

✅ Routing based on single value (WHERE model = 'Table')
✅ Routing based on IN list (WHERE model IN ('Table', 'Field'))
✅ Fallback when discriminator absent
✅ Strict mode (error if discriminator missing)
✅ Partial collection failures
✅ Result merging with LIMIT/OFFSET
✅ Both vector and filter-only queries
✅ Adapter implementations

Files Modified/Created¶

New Files¶

✨ chromasql/multi_collection.py (393 lines) - Core multi-collection execution
✨ chromasql/adapters.py (300 lines) - Pre-built adapters
✨ tests/chromasql/test_multi_collection.py (412 lines) - Test suite
✨ chromasql/EXAMPLES.md (500+ lines) - Usage examples
✨ chromasql/MULTI_COLLECTION_SUMMARY.md (this file)

Modified Files¶

📝 chromasql/__init__.py - Export new APIs
📝 chromasql/CONTRIBUTING.md - Add multi-collection patterns section

Unchanged (All Tests Still Pass)¶

✅ chromasql/parser.py
✅ chromasql/planner.py
✅ chromasql/executor.py
✅ chromasql/ast.py
✅ All other core modules

Next Steps¶

For Your Use Case¶

Try it out:
```
poetry run python -m your_module
```
Monitor routing behavior:
Log router.route(query) results
Track which collections are queried
Measure query latency improvements
Tune performance:
Adjust n_results_per_collection if needed
Consider adding more discriminators (e.g., environment, region)

For Other Developers¶

Implement custom routers:
Extend CollectionRouter protocol
Use extract_metadata_values() helper
See examples in EXAMPLES.md
Contribute improvements:
Add support for nested metadata paths in analysis module
Implement additional merge strategies (e.g., score blending)
Add async streaming for large result sets
Documentation:
Add your routing strategy to EXAMPLES.md
Share patterns in discussions

API Reference¶

Public Exports¶

From chromasql:

from chromasql import (
    # Core API (unchanged)
    parse, build_plan, execute_plan, plan_to_dict,
    ExecutionResult, QueryPlan,

    # Analysis helpers
    extract_metadata_values,

    # Multi-collection support
    CollectionRouter,
    AsyncCollectionProvider,
    execute_multi_collection,

    # Pre-built adapters
    AsyncMultiCollectionAdapter,
    MetadataFieldRouter,
    SimpleAsyncClientAdapter,
)

Main Function Signature¶

async def execute_multi_collection(
    query_str: str,
    router: CollectionRouter,
    collection_provider: AsyncCollectionProvider,
    *,
    embed_fn: Optional[EmbedFunction] = None,
    merge_strategy: str = "distance",
    n_results_per_collection: Optional[int] = None,
) -> ExecutionResult:
    """Execute a ChromaSQL query across multiple collections."""

Protocol Signatures¶

class CollectionRouter(Protocol):
    def route(self, query: Query) -> Optional[Sequence[str]]:
        """Return collection names or None for all."""

class AsyncCollectionProvider(Protocol):
    async def get_collection(self, name: str) -> Any:
        """Get collection by name."""

    async def list_collection_names(self) -> Sequence[str]:
        """List all available collections."""

Questions & Support¶

For questions or issues:

Check EXAMPLES.md for usage patterns
Review CONTRIBUTING.md for architecture details
Run tests: poetry run pytest tests/chromasql/ -v
Open an issue with reproduction steps

Implementation completed: All tasks done ✅ Test coverage: 96% overall, 120 tests passing Status: Ready for production use