Part II: Identity Thesis

Why This Matters

Introduction

0:00 / 0:00

Why This Matters

The hard problem persists because we cannot step outside our own experience to check whether structure and experience are identical. We are trapped inside. The zombie conceivability intuition comes from this epistemic limitation.

But if we build systems from scratch, in environments like ours, and they develop structures like ours, and those structures produce language like ours and behavior like ours—then the conceivability intuition loses its grip. The systems are not us, but they are like us in the relevant ways. If structure suffices for them, why not for us?

The experiment does not prove identity. It makes identity the default hypothesis. The burden shifts to whoever wants to maintain the gap.

The exact definitions computable in discrete substrates and the proxy measures extractable from continuous substrates are related by a scale correspondence principle: both track the same structural invariant at their respective scales.

For each affect dimension:

Dimension	CA (exact)	Transformer (proxy)
Valence	Hamming to $\partial\viable$	Advantage / survival predictor
Arousal	Configuration change rate	Latent state $\Delta$ / KL
Integration	Partition prediction loss	Attention entropy / grad coupling
Effective rank	Trajectory covariance rank	Latent covariance rank
$\mathcal{CF}$	Counterfactual cell activity	Planning compute fraction
$\mathcal{SM}$	Self-tracking MI	Self-model component MI

The CA definitions are computable but don’t scale. The transformer proxies scale but are approximations. Validity comes from convergence: if CA and transformer measures correlate when applied to the same underlying dynamics, both are tracking the real structure.

Deep Technical: Transformer Affect Extraction

The CA gives exact definitions. Transformers give scale. The correspondence principle above justifies treating transformer proxies as measurements of the same structural invariants. Here is the protocol for extracting affect dimensions from transformer activations without human contamination.

Architecture. Multi-agent environment. Each agent: transformer encoder-decoder with recurrent latent state. Input: egocentric visual observation $o_t \in \mathbb{R}^{H \times W \times C}$ . Output: action logits $\pi(a|z_t)$ and value estimate $V(z_t)$ . Latent state $z_t \in \mathbb{R}^d$ updated each timestep via cross-attention over observation and self-attention over history.

No pretraining. Random weight initialization. The agents learn everything from interaction.

Valence extraction. Two approaches, should correlate:

Approach 1: Advantage-based.

\Val_t^{(1)} = Q(z_t, a_t) - V(z_t) = A(z_t, a_t)

The advantage function. Positive when current action is better than average from this state. Negative when worse. This is the RL definition of “how things are going.”

Approach 2: Viability-based. Train a separate probe to predict time-to-death $\tau$ from latent state:

\hat{\tau} = f_\phi(z_t), \quad \Val_t^{(2)} = \hat{\tau}_{t+1} - \hat{\tau}_t

Positive when expected survival time is increasing. Negative when decreasing. This is the viability gradient directly.

Validation: $\text{corr}(\Val^{(1)}, \Val^{(2)})$ should be high if both capture the same underlying structure.

Arousal extraction. Three approaches:

Approach 1: Belief update magnitude.

\Ar_t^{(1)} = |z_{t+1} - z_t|_2

How much did the latent state change? Simple. Fast. Proxy for belief update.

Approach 2: KL divergence. If the latent is probabilistic (VAE-style):

\Ar_t^{(2)} = D_{\text{KL}}[q(z_{t+1}|o_{1:t+1}) | q(z_t|o_{1:t})]

Information-theoretic belief update.

Approach 3: Prediction error.

\Ar_t^{(3)} = |o_{t+1} - \hat{o}_{t+1}|_2

Surprise. How much did the world deviate from expectation?

Integration extraction. The hard one. Full $\Phi$ is intractable for transformers (billions of parameters in superposition). Proxies:

Approach 1: Partition prediction loss. Train two predictors of $z_{t+1}$ :

Full predictor: $\hat{z}_{t+1} = g_\theta(z_t)$
Partitioned predictor: $\hat{z}_{t+1}^A = g_\theta^A(z_t^A)$ , $\hat{z}_{t+1}^B = g_\theta^B(z_t^B)$

\intinfo_{\text{proxy}} = \mathcal{L}[\text{partitioned}] - \mathcal{L}[\text{full}]

How much does partitioning hurt prediction? High $\intinfo_{\text{proxy}}$ means the parts must be considered together.

Approach 2: Attention entropy. In transformer, attention patterns reveal coupling:

\intinfo_{\text{attn}} = -\sum_{h,i,j} A_{h,i,j} \log A_{h,i,j}

Low entropy = focused attention = modular. High entropy = distributed attention = integrated.

Approach 3: Gradient coupling. During learning, how do gradients propagate?

\intinfo_{\text{grad}} = |\nabla_{z^A} \mathcal{L}|_2 \cdot |\nabla_{z^B} \mathcal{L}|_2 \cdot \cos(\nabla_{z^A} \mathcal{L}, \nabla_{z^B} \mathcal{L})

If gradients in different components are aligned, the system is learning as a whole.

Effective rank extraction. Straightforward:

\effrank[t] = \frac{(\sum_i \lambda_i)^2}{\sum_i \lambda_i^2}

where $\lambda_i$ are eigenvalues of the latent state covariance over a rolling window. How many dimensions is the agent actually using?

Track across time: depression-like states should show $\reff$ collapse. Curiosity states should show $\reff$ expansion.

Counterfactual weight extraction. In model-based agents with explicit planning:

\mathcal{CF}_t = \frac{\text{FLOPs in rollout/planning}}{\text{FLOPs in rollout} + \text{FLOPs in perception/action}}

In model-free agents, harder. Proxy: attention to future-oriented vs present-oriented features. Train a probe to classify “planning vs reacting” from activations.

Self-model salience extraction. Does the agent model itself?

Approach 1: Behavioral prediction probe. Train probe to predict agent’s own future actions from latent state:

\mathcal{SM}_t^{(1)} = \text{accuracy of } \hat{a}_{t+1:t+k} = f_\phi(z_t)

High accuracy = agent has predictive self-model.

Approach 2: Self-other distinction. In multi-agent setting, probe for which-agent-am-I:

\mathcal{SM}_t^{(2)} = \MI(z_t; \text{agent ID})

High MI = self-model is salient in representation.

Approach 3: Counterfactual self-simulation. If agent can answer “what would I do if X?” better than “what would other do if X?”, self-model is present.

The activation atlas. For each agent, each timestep, extract all structural dimensions. Plot trajectories through affect space. Cluster by situation type. Compare across agents.

The prediction: agents facing the same situation should occupy similar regions of affect space, even though they learned independently. The geometry is forced by the environment, not learned from human concepts.

Probing without contamination. The probes are trained on behavioral/environmental correlates, not on human affect labels. The probe that extracts $\Val$ is trained to predict survival, not to match human ratings of “how the agent feels.” The mapping to human affect concepts comes later, through the translation protocol, not through the extraction.

Status and Next Steps

Implementation requirements:

Multi-agent RL environment with viability pressure (survival, resource acquisition)
Transformer-based agents with random initialization (no pretraining)
Communication channel (discrete tokens or continuous signals)
VLM scene interpreter for translation alignment
Real-time affect dimension extraction from activations
Perturbation interfaces (language injection, hyperparameter modification)

Status (as of 2026): CA instantiation complete (V13–V18, 30 evolutionary cycles each, 3 seeds, 12 measurement experiments). Seven of twelve experiments show positive signal. Three hit the sensory-motor coupling wall. See Part VII and Appendix for full results.

Validation criteria:

Emergent language develops (not random; structured, predictive)
Translation achieves above-chance scene-signal alignment
Tripartite correlation exceeds null model (shuffled controls)
Bidirectional perturbations produce predicted shifts
Results replicate across random seeds and environment variations

Falsification conditions:

No correlation between affect signature and translated language
Perturbations do not propagate across modalities
Structure-language mapping is inconsistent across systems
Behavior decouples from both structure and language