ROBOTICS DATA CURATION
In DevelopmentDataset optimization for robotics foundation models
We're building mathematical curation infrastructure for RLDS datasets. The goal: higher policy success rates with dramatically reduced training data through coverage optimization and embedding-based selection.
groot-n2openvlaoctoπ0.5What We're Building
GCP Cloud Batch integration for computing embeddings at scale with state-of-the-art robotics models
Hybrid ILP/greedy solver for subset selection with coverage guarantees and constraint satisfaction
Immutable versioning on GCS with complete lineage tracking and reproducibility
Target Launch: Q1 2026
APPROACH
Mathematical optimization, not manual curation
Traditional dataset curation relies on heuristics and manual selection. We're building infrastructure that treats curation as a constrained optimization problem: maximize coverage diversity while minimizing dataset size, subject to task-specific constraints.
CAPABILITIES
End-to-end data optimization pipeline
OBJECTIVES
What we're optimizing for
Our research focuses on solving the dataset curation problem for robotics foundation models trained on diverse manipulation tasks.
Coverage Maximization
Ensure curated subsets maintain representative coverage across the operational distribution, including edge cases and safety-critical scenarios.
Metric: UMAP embedding space diversity
Data Efficiency
Minimize dataset size while preserving downstream policy performance, reducing compute costs and iteration cycles.
Target: Order-of-magnitude reduction
Reproducibility
Every optimization run fully versioned and traceable with complete artifact lineage for scientific rigor.
Infrastructure: Immutable GCS storage
Constraint Satisfaction
Support for domain-specific requirements via probabilistic soft logic rules and hard constraints.
Method: PSL + hybrid solver
INFRASTRUCTURE
Private by default, fully reproducible
Data Privacy
- •All data stays in your GCP project
- •No third-party compute dependencies
- •Full IAM control and audit logs
Versioning
- •Immutable artifacts on Google Cloud Storage
- •Complete lineage tracking for datasets
- •Reproducible optimization runs
Join the waitlist
We're currently in active development. Join our waitlist to be notified when we launch and to provide input on early product direction.