Part I: Foundations

POMDP Formalization

Introduction
0:00 / 0:00

POMDP Formalization

The situation of a bounded system under uncertainty admits precise formalization as a Partially Observable Markov Decision Process (POMDP).

Existing Theory

The POMDP framework connects this analysis to several established research programs:

  • Active Inference (Friston et al., 2017): Organisms as inference machines that minimize expected free energy through action. The “belief state sufficiency” result here is their “Bayesian brain” hypothesis formalized.
  • Predictive Processing (Clark, 2013; Hohwy, 2013): The brain as a prediction engine, with perception as hypothesis-testing. The world model W\worldmodel is their “generative model.”
  • Good Regulator Theorem (Conant \& Ashby, 1970): Every good regulator of a system must be a model of that system. The belief state sufficiency result above is a POMDP-specific instantiation.
  • Embodied Cognition (Varela, Thompson \& Rosch, 1991): Cognition as enacted through sensorimotor coupling. My emphasis on the boundary as the locus of modeling aligns with enactivist insights.

Formally, a POMDP is a tuple (X,A,O,T,O,R,γ)(\mathcal{X}, \mathcal{A}, \mathcal{O}, T, O, R, \gamma) where:

  • X\mathcal{X}: State space (true world state, including system interior)
  • A\mathcal{A}: Action space
  • O\mathcal{O}: Observation space
  • T:X×A×X[0,1]T: \mathcal{X} \times \mathcal{A} \times \mathcal{X} \to [0,1]: Transition kernel, T(xx,a)T(\mathbf{x}’ | \mathbf{x}, \mathbf{a})
  • O:X×O[0,1]O: \mathcal{X} \times \mathcal{O} \to [0,1]: Observation kernel, O(ox)O(\mathbf{o} | \mathbf{x})
  • R:X×ARR: \mathcal{X} \times \mathcal{A} \to \R: Reward function
  • γ[0,1)\gamma \in [0,1): Discount factor

The agent does not observe xt\mathbf{x}_t directly but only otO(xt)\mathbf{o}_t \sim O(\cdot | \mathbf{x}_t). The sufficient statistic for decision-making is the belief state—the posterior distribution over world states given the history:

bt(x)=P(xt=xo1:t,a1:t1)\belief_t(\mathbf{x}) = \prob(\mathbf{x}_t = \mathbf{x} \mid \mathbf{o}_{1:t}, \mathbf{a}_{1:t-1})

The belief state updates via Bayes’ rule:

bt+1(x)=O(ot+1x)xT(xx,at)bt(x)xO(ot+1x)xT(xx,at)bt(x)\belief_{t+1}(\mathbf{x}’) = \frac{O(\mathbf{o}_{t+1} | \mathbf{x}’) \sum_{\mathbf{x}} T(\mathbf{x}’ | \mathbf{x}, \mathbf{a}_t) \belief_t(\mathbf{x})}{\sum_{\mathbf{x}”} O(\mathbf{o}_{t+1} | \mathbf{x}”) \sum_{\mathbf{x}} T(\mathbf{x}” | \mathbf{x}, \mathbf{a}_t) \belief_t(\mathbf{x})}

A classical result establishes that bt\belief_t is a sufficient statistic for optimal decision-making: any optimal policy π\policy^* can be written as π:Δ(X)A\policy^*: \Delta(\mathcal{X}) \to \mathcal{A}, mapping belief states to actions.

This establishes that any system that performs better than random under partial observability is implicitly maintaining and updating a belief state—i.e., a model of the world.