Early Access Preview—AitherOS is in active development. Features may change, break, or disappear.

Live

Connecting to services…

•

Live Demo

Invite Only

Theme

GitHub

Live Demo

Invite Only

Theme

GitHub

Back to blog

engineeringdeep-diveworld-modelmachine-learningarchitectureresearch

From Variance+Covariance to Cramer-Wold: Implementing SIGReg in a Production World Model

April 22, 202614 min readDavid Parkhurst

From Variance+Covariance to Cramer-Wold: Implementing SIGReg in a Production World Model

How a single theorem upgrade gives our latent world model mathematically provable collapse prevention — and opens the door to CEM planning, anomaly detection, and self-improvement.

The Problem

AitherOS manages {{services.total}} microservices across {{architecture.layers}} architectural layers. Services start, degrade, crash, recover. Dependencies cascade — if Redis goes down, half the stack follows. Resource pressure on GPU memory, CPU, or disk can trigger failures minutes before any health check notices.

To handle this, we built a Learned World Model: a JEPA-style latent predictor that learns state transition dynamics from real system data. The pipeline is:

GraphStateEncoder  -->  LatentPredictor  -->  MCTSPlanner
(768-dim state)        (256-dim latent)      (action selection)

The GraphStateEncoder compresses service mesh state into a 768-dimensional embedding. The LatentPredictor takes that embedding plus an action (restart, scale, deploy) and predicts what the next state will look like — in latent space. The MCTSPlanner uses those predictions to plan sequences of actions.

The critical question: how do you prevent the latent space from collapsing? If the encoder learns to map every state to the same point, prediction loss goes to zero trivially — the model has learned nothing.

Our original answer was IsotropicGaussianRegularizer: penalise dimensions with variance far from 1.0, and penalise off-diagonal covariance. Standard VICReg-style regularization:

# Old regularizer (removed)
variance_loss = F.relu(1.0 - var).mean()
covariance_loss = (off_diag ** 2).sum() / D

This works for the obvious collapse modes — constant output, rank-deficient subspaces. But it has a blind spot. Consider a bimodal distribution: half the batch maps to +1, the other half to -1. The variance is exactly 1.0. The covariance is zero. The regularizer reports loss = 0. But the distribution is not Gaussian — it's two point masses. And CEM planning, which samples from a continuous Gaussian in latent space, will sample between the two modes where no real states exist.

The old regularizer was blind to bimodal collapse. We needed something stronger.

The Paper

In early 2026, Maes et al. published LeWM (Learned World Model) — the first stable end-to-end JEPA that trains with only two loss terms. No EMA target network required. No stop-gradient tricks. No pretrained encoder. Just:

L_total = L_pred + lambda * L_SIGReg

The prediction loss is standard MSE between predicted and actual next-state latents. The innovation is entirely in the regularizer: SIGReg (Sketched-Isotropic-Gaussian Regularizer).

The Cramer-Wold Theorem

The mathematical foundation is a classical result from probability theory:

Cramer-Wold Theorem: A probability distribution in R^D is uniquely determined by the collection of all its one-dimensional projections.

Corollary: If for every unit vector v in R^D, the projection v^T z is distributed as N(0, 1), then z ~ N(0, I_D).

This is a theorem, not a heuristic. It means: instead of estimating a D x D covariance matrix and checking higher-order moments, you can enforce full joint Gaussianity by testing 1D projections. Since you can't test every direction, you sample M random unit-norm vectors and test each projection for Gaussianity using the Epps-Pulley test — a characteristic-function comparison that is fully differentiable:

EP statistic = mean_k [ |ECF(t_k)|^2 - |GCF(t_k)|^2 ]^2

where:
  ECF(t) = mean(cos(t*x))^2 + mean(sin(t*x))^2   (empirical)
  GCF(t) = exp(-t^2)                                (standard Gaussian)

The EP test compares the empirical characteristic function of each projected sample to the standard Gaussian characteristic function. Higher statistic = less Gaussian. Aggregate across M projections and you have SIGReg.

LeWM's key result: with SIGReg alone, EMA and stop-gradient are unnecessary. The regularizer is strong enough to prevent all collapse modes, and training converges smoothly with a single hyperparameter: lambda = 0.1.

Hyperparameter Comparison

System	Loss Terms	Free Hyperparameters	Tuning Complexity
PLDM	6+	O(n^6) grid search	Exponential
I-JEPA	3	~10	Large grid
VICReg	3	3 weights + EMA	Medium grid
LeWM	2	1 (lambda)	O(log n) bisection

The Implementation

Here is the actual SIGRegRegularizer from lib/cognitive/LatentPredictor.py:

class SIGRegRegularizer:
    """Sketched-Isotropic-Gaussian Regularizer via the Cramer-Wold theorem.

    Projects embeddings onto M random unit-norm directions and tests each
    projection for Gaussianity using the Epps-Pulley characteristic-function
    test. By Cramer-Wold, matching all 1D marginals to N(0,1) guarantees
    the full joint distribution matches N(0, I_D).
    """

    def __init__(self, latent_dim=256, num_projections=256, num_test_points=20):
        self._latent_dim = latent_dim
        self._num_projections = num_projections
        self._num_test_points = num_test_points
        self._directions = None   # [D, M] fixed random unit-norm directions
        self._test_points = None  # [K] Epps-Pulley evaluation grid

    def compute(self, embeddings):
        """Compute SIGReg loss on a batch of embeddings [B, D]."""
        B = embeddings.shape[0]
        if B < 4:
            return torch.tensor(0.0, device=embeddings.device, requires_grad=True)

        # Standardise to zero mean, unit variance per dimension
        std = embeddings.std(dim=0).clamp(min=1e-8)
        z = (embeddings - embeddings.mean(dim=0)) / std          # [B, D]

        # Project: [B, D] @ [D, M] -> [B, M]
        projections = z @ self._directions

        # Vectorised Epps-Pulley across all M projections at once
        t = self._test_points
        tp = t.unsqueeze(1).unsqueeze(2) * projections.unsqueeze(0)  # [K, B, M]

        cos_part = tp.cos().mean(dim=1)   # [K, M]
        sin_part = tp.sin().mean(dim=1)   # [K, M]
        ecf_sq = cos_part ** 2 + sin_part ** 2          # |ECF|^2
        gcf_sq = (-t ** 2).exp().unsqueeze(1)            # |N(0,1) CF|^2

        ep_stats = ((ecf_sq - gcf_sq) ** 2).mean(dim=0)  # [M]
        return ep_stats.mean()

What's happening step by step

Standardise the batch to zero mean, unit variance per dimension. This normalises the input so the EP test can compare against N(0, 1).
Project onto M=256 random unit-norm directions via a single matrix multiply: [B, D] @ [D, M] = [B, M]. Each column of the result is one 1D projection of the batch.
Vectorised Epps-Pulley test across all 256 projections simultaneously. For each of 20 test points, compute the empirical characteristic function (cos + sin components) and compare to the Gaussian characteristic function exp(-t^2). The squared difference gives the EP statistic.
Aggregate: mean EP statistic across all M projections = SIGReg loss.

The projection matrix is generated once at initialisation (random, fixed, not learned). This is critical: learned projections would optimise to avoid non-Gaussian regions, defeating the purpose. The entire computation is one matrix multiply plus some trigonometric ops — negligible overhead.

We chose M=256 projections (vs the paper's M=1024) because ablation studies show the test is insensitive above M=128, and with D=256 latent dimensions, M=D gives approximately one projection per dimension.

The Results

Collapse Mode Detection

We tested three distributions against both the old regularizer and SIGReg:

Distribution	Description	Old Regularizer	SIGReg
Normal N(0, I)	1000 samples from standard Gaussian	~0	0.002
Bimodal	Half at +1, half at -1 (var=1, cov=0)	~0	0.167
Collapsed	All identical (constant output)	detects	0.436

The critical row is bimodal. The old regularizer reports approximately zero loss for a bimodal distribution — it literally cannot see the problem. SIGReg returns 0.167, correctly identifying that the distribution is non-Gaussian. The loss ordering is provably correct:

normal (0.002)  <  bimodal (0.167)  <  collapsed (0.436)

This ordering is guaranteed by the mathematics: a standard Gaussian minimises the EP statistic (by definition), bimodal deviates from Gaussian but maintains some spread, and full collapse maximises deviation.

Gradient Flow

Gradients flow cleanly through the entire SIGReg computation (cos/sin are smooth, bounded, differentiable). Verified: embeddings.grad is non-None with correct shape [B, 256] after loss.backward().

Training Convergence

Over 100+ training steps on synthetic transition data, total loss (prediction + SIGReg) monotonically decreases. No oscillations, no mode collapse, no warmup hacks needed.

Full Test Suite: 23/23 Passing

7 SIGReg tests: normal loss near-zero, collapsed detection, correlated detection, bimodal detection, small-batch guard, gradient flow, loss ordering
4 LatentEncoder tests: build, single forward, batch forward, state dict
4 TargetEncoder tests: init from source, initial match, EMA divergence, no-gradient
7 LatentPredictor tests: build, predict, target, train step, loss decrease, EMA updates, status
1 singleton test: get_latent_predictor idempotence

What This Enables

SIGReg is Phase 1 of an 8-phase integration roadmap. Each subsequent phase builds on the mathematically guaranteed smooth, continuous latent space that SIGReg provides.

Phase 2: Transformer Encoder with AdaLN

Replace the MLP encoder with a 4-layer Transformer using Adaptive Layer Normalisation. Attention captures relational structure between services — "if Redis is degraded AND Veil is rebuilding, the combined effect is different from the sum of individual effects." The paper's ablation shows encoder architecture is agnostic when SIGReg is present, so even if the Transformer doesn't outperform the MLP, it's a safe bet.

Phase 3: EMA Removal

The paper's clearest ablation result: SIGReg alone prevents collapse. EMA is redundant. Removing the TargetEncoder eliminates a hyperparameter, simplifies the codebase, and reduces memory by dropping a full copy of the encoder.

Configuration	Converges?	Final Loss
SIGReg + EMA + stop-grad	Yes	1.0x
SIGReg only (no EMA)	Yes	1.0x
No SIGReg, no EMA	COLLAPSE	N/A

Phase 4: CEM Latent Planning

Cross-Entropy Method planning in latent space: sample 300 action sequences, roll them forward through the predictor, keep the top 10%, refine. The paper reports 48x speedup over foundation model planning. For AitherOS, this means incident response in milliseconds instead of seconds — CEM rolls out 5 candidate actions in <10ms, while MCTS would take seconds querying faculty graphs.

CEM requires a smooth, continuous latent space. Bimodal collapse (which the old regularizer allowed) would break it. SIGReg is the prerequisite.

Phase 5: Violation-of-Expectation Anomaly Detection

Surprise = prediction error that exceeds a running baseline. If the world model predicts Redis will stay healthy but it suddenly shows high latency, the surprise score spikes. Emit a Flux event. SchedulerLoop receives it. Autoheal preemptively scales Redis before it crashes.

Current anomaly detection is reactive (health check fails, then respond). VoE is predictive (prediction error spikes, then preempt). Statistical thresholds: 3-sigma = 0.27% false positive rate, 5-sigma for critical.

The Research Pipeline

The .JEPA/ directory contains 15 structured analysis files totalling 3,960 lines. Every file follows an 8-part template: Summary, Paper Analysis, AitherOS Current State, Integration Design, Why This Works, Why We'd Use This, Implementation Estimate, and Proof of Concept.

This template is reusable for any paper. The analysis feeds into multiple systems: KnowledgeGraph for queryable cross-references, AitherEvolution for candidate improvements, AutoResearch for hyperparameter recommendations, and Lyra for automated paper analysis.

Existing Infrastructure

SIGReg doesn't exist in a vacuum. AitherOS already has the substrate for a self-improving world model.

5,410 real state transitions already logged in transitions.jsonl. Every 30 minutes, the LearnedWorldModel retrains — and now co-trains the LatentPredictor with SIGReg on the same data batch. Cross-domain contrastive training runs automatically when multiple transition domains are present.

NanoGPT Lab: experiment registry, sweep runner, evaluation harness. When we wire lambda bisection into AutoResearch, the Lab handles the sweep automatically.

AutoResearch lambda bisection: binary search on [0.001, 10.0]. If the world model shows signs of collapse, increase lambda. If prediction loss is too high, decrease lambda. Converges in ~10 experiments — O(log n) tuning for the only free hyperparameter.

Computational Cost

SIGReg adds negligible overhead. For our configuration (B=64, D=256, M=256, K=20):

Operation	FLOPs	GPU Time
Standardise z	16K	<0.001ms
Project z @ V	4M	~0.01ms
EP test (all M)	327K	~0.005ms
Total SIGReg	~4.3M	~0.015ms
Forward pass	~100M	~0.3ms
Overhead		~5%

Five percent compute overhead for provably better regularization. No new dependencies. No new hyperparameters beyond lambda. The projection matrix is a fixed random tensor — zero learned parameters added.

Conclusion

SIGReg is a mathematically principled upgrade with zero downside. The Cramer-Wold theorem guarantees what variance+covariance cannot: if all 1D projections are Gaussian, the full joint distribution is Gaussian. No bimodal collapse. No nonlinear manifold collapse. No higher-order correlation collapse. Provably.

The implementation is 76 lines of vectorised PyTorch. The swap was drop-in — same .compute(embeddings) interface, same file, same training loop. 23 tests pass, including 2 new tests that specifically demonstrate the bimodal detection gap the old regularizer had.

The 8-phase roadmap builds from here: Transformer encoder, EMA removal, CEM planning, VoE anomaly detection, research automation, training integration, knowledge graph. Total: ~1,567 net new lines across 8 files. Each phase delivers standalone value and has a clean rollback path.

15M parameters. Single GPU. One hyperparameter. Provable collapse prevention. That's a good trade.

Implementation: lib/cognitive/LatentPredictor.py. Tests: dev/tests/test_latent_predictor.py (23/23 passing). Research corpus: .JEPA/ (15 files, 3,960 lines). Paper: Maes et al., "LeWM: Learned World Models with SIGReg" (2026).

Enjoyed this post?

All posts Try AitherOS

Back to blog

engineeringdeep-diveworld-modelmachine-learningarchitectureresearch

From Variance+Covariance to Cramer-Wold: Implementing SIGReg in a Production World Model

April 22, 202614 min readDavid Parkhurst

From Variance+Covariance to Cramer-Wold: Implementing SIGReg in a Production World Model

How a single theorem upgrade gives our latent world model mathematically provable collapse prevention — and opens the door to CEM planning, anomaly detection, and self-improvement.

The Problem

To handle this, we built a Learned World Model: a JEPA-style latent predictor that learns state transition dynamics from real system data. The pipeline is:

GraphStateEncoder  -->  LatentPredictor  -->  MCTSPlanner
(768-dim state)        (256-dim latent)      (action selection)

Our original answer was IsotropicGaussianRegularizer: penalise dimensions with variance far from 1.0, and penalise off-diagonal covariance. Standard VICReg-style regularization:

# Old regularizer (removed)
variance_loss = F.relu(1.0 - var).mean()
covariance_loss = (off_diag ** 2).sum() / D

The old regularizer was blind to bimodal collapse. We needed something stronger.

The Paper

L_total = L_pred + lambda * L_SIGReg

The prediction loss is standard MSE between predicted and actual next-state latents. The innovation is entirely in the regularizer: SIGReg (Sketched-Isotropic-Gaussian Regularizer).

The Cramer-Wold Theorem

The mathematical foundation is a classical result from probability theory:

Cramer-Wold Theorem: A probability distribution in R^D is uniquely determined by the collection of all its one-dimensional projections.

Corollary: If for every unit vector v in R^D, the projection v^T z is distributed as N(0, 1), then z ~ N(0, I_D).

EP statistic = mean_k [ |ECF(t_k)|^2 - |GCF(t_k)|^2 ]^2

where:
  ECF(t) = mean(cos(t*x))^2 + mean(sin(t*x))^2   (empirical)
  GCF(t) = exp(-t^2)                                (standard Gaussian)

Hyperparameter Comparison

System	Loss Terms	Free Hyperparameters	Tuning Complexity
PLDM	6+	O(n^6) grid search	Exponential
I-JEPA	3	~10	Large grid
VICReg	3	3 weights + EMA	Medium grid
LeWM	2	1 (lambda)	O(log n) bisection

The Implementation

Here is the actual SIGRegRegularizer from lib/cognitive/LatentPredictor.py:

class SIGRegRegularizer:
    """Sketched-Isotropic-Gaussian Regularizer via the Cramer-Wold theorem.

    Projects embeddings onto M random unit-norm directions and tests each
    projection for Gaussianity using the Epps-Pulley characteristic-function
    test. By Cramer-Wold, matching all 1D marginals to N(0,1) guarantees
    the full joint distribution matches N(0, I_D).
    """

    def __init__(self, latent_dim=256, num_projections=256, num_test_points=20):
        self._latent_dim = latent_dim
        self._num_projections = num_projections
        self._num_test_points = num_test_points
        self._directions = None   # [D, M] fixed random unit-norm directions
        self._test_points = None  # [K] Epps-Pulley evaluation grid

    def compute(self, embeddings):
        """Compute SIGReg loss on a batch of embeddings [B, D]."""
        B = embeddings.shape[0]
        if B < 4:
            return torch.tensor(0.0, device=embeddings.device, requires_grad=True)

        # Standardise to zero mean, unit variance per dimension
        std = embeddings.std(dim=0).clamp(min=1e-8)
        z = (embeddings - embeddings.mean(dim=0)) / std          # [B, D]

        # Project: [B, D] @ [D, M] -> [B, M]
        projections = z @ self._directions

        # Vectorised Epps-Pulley across all M projections at once
        t = self._test_points
        tp = t.unsqueeze(1).unsqueeze(2) * projections.unsqueeze(0)  # [K, B, M]

        cos_part = tp.cos().mean(dim=1)   # [K, M]
        sin_part = tp.sin().mean(dim=1)   # [K, M]
        ecf_sq = cos_part ** 2 + sin_part ** 2          # |ECF|^2
        gcf_sq = (-t ** 2).exp().unsqueeze(1)            # |N(0,1) CF|^2

        ep_stats = ((ecf_sq - gcf_sq) ** 2).mean(dim=0)  # [M]
        return ep_stats.mean()

What's happening step by step

Standardise the batch to zero mean, unit variance per dimension. This normalises the input so the EP test can compare against N(0, 1).
Project onto M=256 random unit-norm directions via a single matrix multiply: [B, D] @ [D, M] = [B, M]. Each column of the result is one 1D projection of the batch.
Vectorised Epps-Pulley test across all 256 projections simultaneously. For each of 20 test points, compute the empirical characteristic function (cos + sin components) and compare to the Gaussian characteristic function exp(-t^2). The squared difference gives the EP statistic.
Aggregate: mean EP statistic across all M projections = SIGReg loss.

The Results

Collapse Mode Detection

We tested three distributions against both the old regularizer and SIGReg:

Distribution	Description	Old Regularizer	SIGReg
Normal N(0, I)	1000 samples from standard Gaussian	~0	0.002
Bimodal	Half at +1, half at -1 (var=1, cov=0)	~0	0.167
Collapsed	All identical (constant output)	detects	0.436

normal (0.002)  <  bimodal (0.167)  <  collapsed (0.436)

Gradient Flow

Training Convergence

Over 100+ training steps on synthetic transition data, total loss (prediction + SIGReg) monotonically decreases. No oscillations, no mode collapse, no warmup hacks needed.

Full Test Suite: 23/23 Passing

7 SIGReg tests: normal loss near-zero, collapsed detection, correlated detection, bimodal detection, small-batch guard, gradient flow, loss ordering
4 LatentEncoder tests: build, single forward, batch forward, state dict
4 TargetEncoder tests: init from source, initial match, EMA divergence, no-gradient
7 LatentPredictor tests: build, predict, target, train step, loss decrease, EMA updates, status
1 singleton test: get_latent_predictor idempotence

What This Enables

SIGReg is Phase 1 of an 8-phase integration roadmap. Each subsequent phase builds on the mathematically guaranteed smooth, continuous latent space that SIGReg provides.

Phase 2: Transformer Encoder with AdaLN

Phase 3: EMA Removal

Configuration	Converges?	Final Loss
SIGReg + EMA + stop-grad	Yes	1.0x
SIGReg only (no EMA)	Yes	1.0x
No SIGReg, no EMA	COLLAPSE	N/A

Phase 4: CEM Latent Planning

CEM requires a smooth, continuous latent space. Bimodal collapse (which the old regularizer allowed) would break it. SIGReg is the prerequisite.

Phase 5: Violation-of-Expectation Anomaly Detection

The Research Pipeline

Existing Infrastructure

SIGReg doesn't exist in a vacuum. AitherOS already has the substrate for a self-improving world model.

NanoGPT Lab: experiment registry, sweep runner, evaluation harness. When we wire lambda bisection into AutoResearch, the Lab handles the sweep automatically.

Computational Cost

SIGReg adds negligible overhead. For our configuration (B=64, D=256, M=256, K=20):

Operation	FLOPs	GPU Time
Standardise z	16K	<0.001ms
Project z @ V	4M	~0.01ms
EP test (all M)	327K	~0.005ms
Total SIGReg	~4.3M	~0.015ms
Forward pass	~100M	~0.3ms
Overhead		~5%

Conclusion

15M parameters. Single GPU. One hyperparameter. Provable collapse prevention. That's a good trade.

Enjoyed this post?

All posts Try AitherOS