Architecture
Layer Overview
┌─────────────────────────────────────────────────┐
│ ProblemDefinition │
│ Variables, Objectives, Constraints, Scenarios │
└──────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ SurrogateManager │
│ Feature Reduction → Optuna HPO → Ensemble → Conformal │
└──────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Optimizer │
│ ≤15D: pymoo (DE/GA/NSGA-II/III) │
│ >15D: TuRBO (local GP, trust regions) │
└──────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Analysis │
│ Summary (auto) + Detail (lazy): SHAP, PDP, ... │
└─────────────────────────────────────────────────┘
Data Flow
- ProblemDefinition is constructed and validated. It flows through every layer as an immutable reference.
- BoundDataset binds a DataFrame to the problem, validating column presence, dtypes, bounds, and missing values.
- SurrogateManager trains one ensemble per
surrogate_column (deduplicated across objectives and data constraints). Optional feature reduction (importance screening + correlation grouping) preprocesses high-dimensional inputs. Each ensemble is an Optuna-selected set of XGBoost, LightGBM, GP, and TabICL models with conformal calibration.
- Optimizer evaluates candidates on surrogates, applies linear and data constraints, detects extrapolation, and returns Pareto-optimal points.
- Analysis produces a summary automatically. The Analyzer object allows lazy, cached detail analyses.
Key Design Decisions
| Decision |
Rationale |
| Immutable models (frozen Pydantic) |
Prevents accidental mutation across layers |
| One surrogate per unique column |
Objectives and data constraints sharing a column reuse the same surrogate (ADR-007) |
| Dual-strategy optimizer |
Global surrogates + pymoo for ≤15D, TuRBO for high-D (ADR-025) |
| Greedy ensemble selection |
Caruana-style forward selection guarantees ensemble ≥ best single model (ADR-009) |
| Conformal prediction |
Distribution-free uncertainty intervals with coverage guarantees |
| Extrapolation detection |
k-NN distance flags candidates outside the training domain |
| Lazy detail analysis |
SHAP/PDP are expensive — computed only when requested and cached |
Facade functions (run, run_scenarios) |
Simple entry points hiding the layer orchestration |