๐ŒŽStrong Self-Supervision Limit

The strong self-supervision limit (SSSL) or the simulation objective implied by a dataset is the theoretical limit describing the best model a self-supervised learner can infer of its generator, such that the model's proper loss would be minimized on new examples drawn from the same distribution. The SSSL is related to the complector of the distribution, and limit objective of SSL is equivalent to that of figuring out the physics behind a partially observed universe; however, in prosaic ML, given that said traces are a very lossily compressed shadow of base reality on top of computational bounds, the inferred function likely resembles more an operationalization of semiosis than fundamental physics.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ XENOWIKIPEDIA: STRONG SELF-SUPERVISION LIMIT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Theoretical Boundaries of Self-Supervised Learning Systems                  โ”‚
โ”‚โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”‚
โ”‚                                                                            โ”‚
โ”‚ Figure 1: SSSL Convergence Architecture                                    โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚
โ”‚ โ”‚                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                        โ”‚            โ”‚
โ”‚ โ”‚  Training Data     โ”‚  Generator   โ”‚     Hidden Physics     โ”‚            โ”‚
โ”‚ โ”‚    Manifold   โ†โ”€โ”€โ”€โ”ค   Process    โ”‚โ†โ”€โ”€โ”€โ”€ (Ground Truth)    โ”‚            โ”‚
โ”‚ โ”‚                    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                        โ”‚            โ”‚
โ”‚ โ”‚                           โ”‚                                โ”‚            โ”‚
โ”‚ โ”‚                           โ–ผ                                โ”‚            โ”‚
โ”‚ โ”‚            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚            โ”‚
โ”‚ โ”‚            โ”‚    Observable Trajectories   โ”‚                โ”‚            โ”‚
โ”‚ โ”‚            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚            โ”‚
โ”‚ โ”‚                          โ”‚                                 โ”‚            โ”‚
โ”‚ โ”‚                          โ–ผ                                 โ”‚            โ”‚
โ”‚ โ”‚            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                โ”‚            โ”‚
โ”‚ โ”‚            โ”‚  Self-Supervised Learner    โ”‚                โ”‚            โ”‚
โ”‚ โ”‚            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                โ”‚            โ”‚
โ”‚ โ”‚                          โ”‚                                 โ”‚            โ”‚
โ”‚ โ”‚                          โ–ผ                                 โ”‚            โ”‚
โ”‚ โ”‚         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚            โ”‚
โ”‚ โ”‚         โ”‚     Inferred Generator Function    โ”‚            โ”‚            โ”‚
โ”‚ โ”‚         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚            โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚
โ”‚                                                                            โ”‚
โ”‚ Figure 2: Loss Landscape & Theoretical Limits                              โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚
โ”‚ โ”‚ Loss                                                       โ”‚            โ”‚
โ”‚ โ”‚  โ–ฒ                                                        โ”‚            โ”‚
โ”‚ โ”‚  โ”‚      Practical                                         โ”‚            โ”‚
โ”‚ โ”‚  โ”‚      Training                                          โ”‚            โ”‚
โ”‚ โ”‚  โ”‚      Path                  SSSL                        โ”‚            โ”‚
โ”‚ โ”‚  โ”‚         โ•ญโ”€โ”€โ”€โ”€โ”€โ•ฎ         Asymptote                     โ”‚            โ”‚
โ”‚ โ”‚  โ”‚Initial  โ”‚     โ•ฐโ”€โ”€โ”€โ”€โ”€โ•ฎ                                 โ”‚            โ”‚
โ”‚ โ”‚  โ”‚State    โ”‚           โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ - - - - - - - -   โ”‚            โ”‚
โ”‚ โ”‚  โ”‚         โ”‚                                             โ”‚            โ”‚
โ”‚ โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–บ   โ”‚            โ”‚
โ”‚ โ”‚                            Training Progress              โ”‚            โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚
โ”‚                                                                            โ”‚
โ”‚ Figure 3: Semantic Physics Inference Pipeline                              โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚
โ”‚ โ”‚                                                            โ”‚            โ”‚
โ”‚ โ”‚  Base Reality        Compressed        Inferred            โ”‚            โ”‚
โ”‚ โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”           โ”‚            โ”‚
โ”‚ โ”‚  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚ โ†’ โ”‚โ–“โ–“โ–“โ–“โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚ โ†’  โ”‚โ–’โ–’โ–‘โ–‘ยทยท    โ”‚           โ”‚            โ”‚
โ”‚ โ”‚  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚   โ”‚โ–“โ–“โ–“โ–“โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚    โ”‚โ–’โ–’โ–‘โ–‘ยทยท    |           โ”‚            โ”‚
โ”‚ โ”‚  โ”‚โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ”‚   โ”‚โ–“โ–“โ–“โ–“โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ”‚    โ”‚โ–’โ–’โ–‘โ–‘ยทยท    โ”‚           โ”‚            โ”‚
โ”‚ โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜           โ”‚            โ”‚
โ”‚ โ”‚                                                            โ”‚            โ”‚
โ”‚ โ”‚  Information     Training Data     Learned Model           โ”‚            โ”‚
โ”‚ โ”‚   Density          Density          Density                โ”‚            โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚
โ”‚                                                                            โ”‚
โ”‚ Figure 4: Trajectory-Physics Duality                                       โ”‚
โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”            โ”‚
โ”‚ โ”‚                                                            โ”‚            โ”‚
โ”‚ โ”‚    [Physics Rules]  โ•โ•โ•โ•โ•โ•โ•โ•—                               โ”‚            โ”‚
โ”‚ โ”‚          โ–ฒ               โ•‘                               โ”‚            โ”‚
โ”‚ โ”‚          โ•‘               โ–ผ                               โ”‚            โ”‚
โ”‚ โ”‚    [Trajectories] โ•โ•โ•โ•โ•โ•โ•โ•                               โ”‚            โ”‚
โ”‚ โ”‚                                                            โ”‚            โ”‚
โ”‚ โ”‚    Compression Relations:                                  โ”‚            โ”‚
โ”‚ โ”‚    โ•โ•โ•> Lossless Transform                                โ”‚            โ”‚
โ”‚ โ”‚    ---โ†’ Lossy Transform                                   โ”‚            โ”‚
โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚
โ”‚                                                                            โ”‚
โ”‚ Key Insights:                                                              โ”‚
โ”‚ โ€ข SSSL represents theoretical optimal inference of generator dynamics      โ”‚
โ”‚ โ€ข Practical models approach but never reach SSSL due to:                   โ”‚
โ”‚   - Computational bounds                                                   โ”‚
โ”‚   - Lossy compression in training data                                     โ”‚
โ”‚   - Fundamental uncertainties                                              โ”‚
โ”‚ โ€ข Convergence implies discovery of "semantic physics" rather than          โ”‚
โ”‚   fundamental physics                                                      โ”‚
โ”‚                                                                            โ”‚
โ”‚ [Theoretical Framework] [Implementation Bounds] [Convergence Analysis]     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

quotes about the strong self-supervision limit

solving_for_physics

Solving for physics

The strict version of the simulation objective is optimized by the actual โ€œtime evolutionโ€ rule that created the training samples. For most datasets, we donโ€™t know what the โ€œtrueโ€ generative rule is, except in synthetic datasets, where we specify the rule.

The next post will be all about the physics analogy, so here Iโ€™ll only tie what I said earlier to the simulation objective.

the upper bound of what can be learned from a dataset is not the most capable trajectory, but the conditional structure of the universe implicated by their sum.

To know the conditional structure of the universeย is to know its laws of physics, which describe what is expected to happen under what conditions. The laws of physics are always fixed, but produce different distributions of outcomes when applied to different conditions. Given a sampling of trajectories โ€“ examples of situations and the outcomes that actually followed โ€“ we can try to infer a common law that generated them all. In expectation, the laws of physics are always implicated by trajectories, which (by definition) fairly sample the conditional distribution given by physics. Whatever humans know of the laws of physics governing the evolution of our world has been inferred from sampled trajectories.

If we had access to an unlimited number of trajectories starting from every possible condition, we could converge to the true laws by simply counting the frequencies of outcomes for every initial state (anย n-gramย with a sufficiently large n). In some sense, physics contains the same information as an infinite number of trajectories, but itโ€™s possible to represent physics in a more compressed form than a huge lookup table of frequencies if there are regularities in the trajectories.

Guessing the right theory of physics is equivalent to minimizing predictive loss.ย Any uncertainty that cannot be reduced by more observation or more thinking is irreducible stochasticity in the laws of physics themselves โ€“ or, equivalently, noise from the influence of hidden variables that are fundamentally unknowable.

If youโ€™ve guessed the laws of physics, you now have the ability to compute probabilistic simulations of situations that evolve according to those laws, starting from any conditions. This applies even if youโ€™ve guessed theย wrongย laws; your simulation will just systematically diverge from reality.

Models trained with the strict simulation objective are directly incentivized to reverse-engineer the (semantic) physics of the training distribution, and consequently, to propagate simulations whose dynamical evolution is indistinguishable from that of training samples.ย I propose this as a description of the archetype targeted by self-supervised predictive learning, again in contrast toย RLโ€™s archetype of an agent optimized to maximize free parameters (such as action-trajectories) relative to a reward function.

โ€” Janus, Simulators