Multi-Collection Execution Architecture¶
Overview¶
ChromaSQL now supports executing queries across multiple ChromaDB collections with intelligent routing based on metadata filters. This implementation is generic and extensible - it doesn't hardcode any specific discriminator field like "model", allowing developers to build custom routing strategies for their use cases.
What Was Built¶
1. Core Abstractions (chromasql/multi_collection.py)¶
Two Protocol Interfaces:
CollectionRouter- Decides which collections to query based on parsed AST- Returns
None→ query all collections - Returns
Sequence[str]→ query specific collections -
Fully customizable by implementing the protocol
-
AsyncCollectionProvider- Abstracts async collection retrieval - Works with any async ChromaDB client (HTTP, Cloud, custom)
- Handles collection caching and connection pooling
Main Function:
- execute_multi_collection() - Orchestrates multi-collection execution
- Parses query → routes to collections → executes in parallel → merges results
- Supports both vector and filter-only queries
- Handles partial failures gracefully
- Respects LIMIT/OFFSET/ORDER BY after merging
2. Pre-Built Adapters (chromasql/adapters.py)¶
Three Ready-to-Use Implementations:
MetadataFieldRouter- Generic metadata-based routing- Extracts values from any metadata field path (e.g.,
("model",),("tenant", "id")) - Uses query config to map values to collections
-
Configurable fallback behavior
-
SimpleAsyncClientAdapter- Wraps raw ChromaDB async clients - For simpler setups without query config infrastructure
3. Comprehensive Tests (tests/chromasql/test_multi_collection.py)¶
11 Test Cases Covering: - Static and dynamic routing - Metadata-based routing (single value, IN lists) - Fallback behavior (query all vs. strict mode) - LIMIT/OFFSET/ORDER BY after merge - Vector and filter-only query modes - Partial collection failure handling - Adapter implementations
Test Coverage: 96% overall (1067 statements, 45 miss)
4. Documentation¶
Three Documentation Files:
- CONTRIBUTING.md - Updated with multi-collection architecture
- EXAMPLES.md - 6 comprehensive usage examples
- MULTI_COLLECTION_SUMMARY.md - This file
Key Design Decisions¶
✅ Generic, Not Specific¶
Decision: Don't hardcode "model" as the discriminator field
Rationale:
- Different developers have different metadata structures
- Field could be: tenant_id, region, category, model, etc.
- Protocol-based design allows unlimited customization
Implementation:
# Generic - works with any field
router = MetadataFieldRouter(config, field_path=("model",))
router = MetadataFieldRouter(config, field_path=("tenant_id",))
router = MetadataFieldRouter(config, field_path=("org", "region"))
✅ Leverage Existing Analysis Module¶
Decision: Use chromasql.analysis.extract_metadata_values() for routing
Rationale: - Already existed and was designed for this purpose - Keeps routing logic separate from query execution - Easy to test in isolation
Code Reference: chromasql/analysis.py
✅ Protocol-Based Extensibility¶
Decision: Use Python protocols instead of base classes
Rationale: - More Pythonic and flexible - Duck typing enables easy mocking in tests - No inheritance complexity
Example:
✅ Graceful Partial Failures¶
Decision: Return results from successful collections even if some fail
Rationale: - Common in distributed systems for some nodes to be unavailable - Better to return partial results than fail entirely - Logs errors for monitoring
Behavior:
- If 35 of 37 collections succeed → return merged results from 35
- If ALL collections fail → raise ChromaSQLExecutionError
✅ Result Merging Strategy¶
Decision: Merge by distance and re-apply ORDER BY/LIMIT/OFFSET
Rationale:
- Vector queries need global ranking across collections
- LIMIT/OFFSET should apply to final merged results, not per-collection
- ORDER BY may include multiple fields (e.g., ORDER BY metadata.year DESC, distance ASC)
Implementation: chromasql/multi_collection.py
Integration with Your Setup¶
Your Current Infrastructure¶
37 collections
16M total records
metadata.model as discriminator
AsyncMultiCollectionQueryClient already exists
query_config.json maps models → collections
How to Use¶
from pathlib import Path
from chromasql.adapters import MetadataFieldRouter
from chromasql.multi_collection import execute_multi_collection
from idxr.query_lib.async_multi_collection_adapter import AsyncMultiCollectionAdapter
from idxr.vectorize_lib.query_client import AsyncMultiCollectionQueryClient
from idxr.vectorize_lib.query_config import load_query_config
# Load config
config = load_query_config(Path("output/query_config.json"))
# Initialize client (your existing code)
client = AsyncMultiCollectionQueryClient(
config_path=Path("output/query_config.json"),
client_type="cloud",
cloud_api_key=api_key,
)
await client.connect()
# Create adapters
adapter = AsyncMultiCollectionAdapter(client)
router = MetadataFieldRouter(
query_config=config,
field_path=("model",), # Your discriminator
fallback_to_all=True, # Query all 37 if not specified
)
# Execute ChromaSQL with routing
result = await execute_multi_collection(
query_str="""
SELECT id, distance, document
FROM sap_data
WHERE metadata.model IN ('Table', 'Field')
USING EMBEDDING (TEXT 'financial tables')
TOPK 10;
""",
router=router,
collection_provider=adapter,
embed_fn=your_embed_function,
)
# Router extracted {'Table', 'Field'} from WHERE clause
# Queried only collections containing those models (e.g., 5 of 37)
# Results merged and ranked globally by distance
Routing Examples¶
Query with model filter → targeted routing:
SELECT * FROM demo
WHERE metadata.model = 'Table'
USING EMBEDDING (TEXT 'query')
-- Queries only collections containing 'Table'
Query without model filter → all collections:
SELECT * FROM demo
WHERE metadata.year > 2020
USING EMBEDDING (TEXT 'query')
-- Queries all 37 collections (model not constrained)
Filter-only query → works too:
SELECT * FROM demo
WHERE metadata.model = 'Field'
AND metadata.status = 'active'
-- No USING EMBEDDING = filter-only mode
-- Still routes based on metadata.model
Performance Characteristics¶
Parallel Execution¶
- All collections queried in parallel using
asyncio.gather() - No sequential bottlenecks
Network Efficiency¶
- Only queries collections that contain the filtered models
- Example: Filter on 2 models → query 5 collections (not all 37)
Result Merging¶
- In-memory merge after collection queries complete
- Complexity: O(n log n) where n = total results from all collections
- For TOPK 10 across 5 collections → sorts ~50 items, returns top 10
Recommended Patterns¶
-
Use specific model filters when possible:
-
Fetch more candidates per collection for better recall:
-
Monitor routing decisions:
Testing & Coverage¶
Test Suite¶
- 11 new tests in
test_multi_collection.py - All existing 109 tests still pass
- Total: 120 passing tests
Coverage¶
chromasql/__init__.py 100%
chromasql/adapters.py 75% (16 miss - edge cases)
chromasql/analysis.py 93% (2 miss - rare branches)
chromasql/ast.py 100%
chromasql/errors.py 100%
chromasql/executor.py 100%
chromasql/explain.py 100%
chromasql/grammar.py 100%
chromasql/multi_collection.py 77% (27 miss - error paths)
chromasql/parser.py 100%
chromasql/plan.py 100%
chromasql/planner.py 100%
-----------------------------------------------------
TOTAL 96% coverage
Key Test Scenarios Covered¶
- ✅ Routing based on single value (
WHERE model = 'Table') - ✅ Routing based on IN list (
WHERE model IN ('Table', 'Field')) - ✅ Fallback when discriminator absent
- ✅ Strict mode (error if discriminator missing)
- ✅ Partial collection failures
- ✅ Result merging with LIMIT/OFFSET
- ✅ Both vector and filter-only queries
- ✅ Adapter implementations
Files Modified/Created¶
New Files¶
- ✨
chromasql/multi_collection.py(393 lines) - Core multi-collection execution - ✨
chromasql/adapters.py(300 lines) - Pre-built adapters - ✨
tests/chromasql/test_multi_collection.py(412 lines) - Test suite - ✨
chromasql/EXAMPLES.md(500+ lines) - Usage examples - ✨
chromasql/MULTI_COLLECTION_SUMMARY.md(this file)
Modified Files¶
- 📝
chromasql/__init__.py- Export new APIs - 📝
chromasql/CONTRIBUTING.md- Add multi-collection patterns section
Unchanged (All Tests Still Pass)¶
- ✅
chromasql/parser.py - ✅
chromasql/planner.py - ✅
chromasql/executor.py - ✅
chromasql/ast.py - ✅ All other core modules
Next Steps¶
For Your Use Case¶
-
Try it out:
-
Monitor routing behavior:
- Log
router.route(query)results - Track which collections are queried
-
Measure query latency improvements
-
Tune performance:
- Adjust
n_results_per_collectionif needed - Consider adding more discriminators (e.g., environment, region)
For Other Developers¶
- Implement custom routers:
- Extend
CollectionRouterprotocol - Use
extract_metadata_values()helper -
See examples in
EXAMPLES.md -
Contribute improvements:
- Add support for nested metadata paths in analysis module
- Implement additional merge strategies (e.g., score blending)
-
Add async streaming for large result sets
-
Documentation:
- Add your routing strategy to
EXAMPLES.md - Share patterns in discussions
API Reference¶
Public Exports¶
From chromasql:
from chromasql import (
# Core API (unchanged)
parse, build_plan, execute_plan, plan_to_dict,
ExecutionResult, QueryPlan,
# Analysis helpers
extract_metadata_values,
# Multi-collection support
CollectionRouter,
AsyncCollectionProvider,
execute_multi_collection,
# Pre-built adapters
AsyncMultiCollectionAdapter,
MetadataFieldRouter,
SimpleAsyncClientAdapter,
)
Main Function Signature¶
async def execute_multi_collection(
query_str: str,
router: CollectionRouter,
collection_provider: AsyncCollectionProvider,
*,
embed_fn: Optional[EmbedFunction] = None,
merge_strategy: str = "distance",
n_results_per_collection: Optional[int] = None,
) -> ExecutionResult:
"""Execute a ChromaSQL query across multiple collections."""
Protocol Signatures¶
class CollectionRouter(Protocol):
def route(self, query: Query) -> Optional[Sequence[str]]:
"""Return collection names or None for all."""
class AsyncCollectionProvider(Protocol):
async def get_collection(self, name: str) -> Any:
"""Get collection by name."""
async def list_collection_names(self) -> Sequence[str]:
"""List all available collections."""
Questions & Support¶
For questions or issues:
- Check
EXAMPLES.mdfor usage patterns - Review
CONTRIBUTING.mdfor architecture details - Run tests:
poetry run pytest tests/chromasql/ -v - Open an issue with reproduction steps
Implementation completed: All tasks done ✅ Test coverage: 96% overall, 120 tests passing Status: Ready for production use