线性注意力背后的视角转换 | The Paradigm Shift Behind Linear Attention

我最近阅读了 Google Research 的两篇论文【1】、【2】，并回溯了最早提出线性注意力机制的论文【3】。我在阅读过程中再一次体会到视角转换如何为一个理论框架赋予新的想象空间。如果你没有经历过这种转换，你会低估它的深远意义。联想量子力学的发展历史：

\[ \text{矩阵力学 / 波动力学} \Rightarrow \text{量子场论 / 正则量子化} \Rightarrow \text{路径积分} \]

每次变革不仅仅是用一套新的数学公式重写既有知识，而是用一套全新的视角和语言审视理论。在这个意义上，（数学）语言不仅是思维的边界，它内嵌了思维的空间结构。

线性注意力机制打开了什么空间？

传统的注意力机制本质上做了两件事情：

将输入 \(X\) 映射到三个矩阵：Key，Value，Query
计算输出：\(\mathrm{Attn}(Q,K,V)=\mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V\)（为了表述简便，忽略掩码矩阵）

要实现全局注意力，就要一个包含位置信息的全局矩阵来编码注意力矩阵 \(QK^\top\)，与之伴随的是空间平方的复杂度 \(O(n^2)\)，和超长上下文的工程难题。若将注意力核从 softmax 放宽为可分解的核函数（或用随机特征/显式特征映射近似 softmax kernel），则可将注意力写成：

\[ \mathrm{Attn}(Q,K,V) \approx \phi(Q)\big(\phi(K)^\top V\big) \]

在自回归（causal）情形下，令

\[ S_t = \sum_{i \le t} \phi(k_i) v_i^\top, \qquad z_t = \sum_{i \le t} \phi(k_i) \]

则输出可写成：

\[ o_t = \frac{\phi(q_t)^\top S_t}{\phi(q_t)^\top z_t} \]

其中 \(S_t\) 是一个随 token 流递推更新、与上下文长度无关的中间态。分母项 \(z_t\) 的归一化对于数值稳定和长度增长时的尺度漂移至关重要，这也是线性注意力处理灾难性遗忘的关键机制。Katharopoulos 等人指出，这种递推实现使线性注意力在因果场景下呈现出明确的 RNN 结构：注意力从"扫描全部 KV 列表"转为"更新固定大小的隐状态并进行读出"【3】。

这种重述带来了新的视角。中间状态 \(S_t\) 描述了一个随着输入流更新的记忆状态，也就是 RNN 里的隐状态。这是一个典型的关联记忆模型（associative memory），它编码了从 Key 到 Value 的映射关系。面对一个新的 Query，输出行为就是从 \(S_t\) 中提取出与 \(Q\) 关联的记忆。关联记忆模型正是 2024 年 Hopfield 和 Hinton 获得物理学诺贝尔奖的工作的基础^*。启发它的赫布理论以及衍生出的突触可塑性依然是今天主流神经生物学的基础理论。

波函数类比

中间状态 \(S_t\) 的引入，让我想起量子力学历史上矩阵力学与波动力学之争。深受马赫主义影响的海森堡认为物理理论应当避免引入缺乏操作定义的量，比如电子在轨道上的精确位置和经典动量等，只有轨道之间的跃迁信息是可观测量，应当用矩阵来描述。与此同时，德布罗意从光的波粒二相性出发，认为一切物质都具有波动性，需要用波函数描述这种物质波。注意：在波恩的概率诠释之前，德布罗意认为这是真实存在的实体波（就像电磁波一样），而不仅仅是便于计算的隐喻。虽然两个理论很快被证明在数学上是等价的——从本体论层面上，波函数从物质实体降格为量子态在位置表象下的表征（representation）——但是波函数相较抽象晦涩的矩阵拥有几何直觉与可视化的优势。它在帮助我们想象（而非理解）理论的可能性上仍有价值。

在我看来，线性注意力的中间状态 \(S_t\) 并不是额外引入了新的"本体论对象"，更像是一次表象切换：把原本隐含在一次性注意力计算中的线性映射（从 key 空间到 value 空间的关联）显式化为一个可递推更新的算符/状态。就像波函数作为量子态在特定基下的表示，提供了强烈的几何直觉与可视化入口；\(S_t\) 的显式化同样把"记忆是什么、如何写入/遗忘、如何分层"变成了可以被直接设计与分析的对象。

从静态到动态："惊讶度"的物理学

在此基础上，【1】提出了一个更深刻的视角转换。它将记忆更新视作一个在线学习（online learning）的动力学过程。在 Titans 的视角里，长期记忆的"写入"可以被理解为对某个 associative memory loss 的在线优化：每来一段新数据，就在记忆参数上走一步（或少数几步）梯度更新。更关键的是，更新强度并非恒定，而是由"惊讶度（surprise）"调制：违反模型预期的数据更"值得被记住"。Titans 用梯度作为 surprise 的度量，并结合衰减机制实现记忆管理；作者还指出该机制与带动量与 weight decay 的小批量梯度法存在等价联系。

我们对反过来的表述都很熟悉。传统的机器学习就是基于这样一套逻辑：预设一个目标函数，通过已标注训练集（在线或离线），训练出一个最佳模型。此处，目标函数是具象的（预测与标注的误差），模型参数是抽象的。而在"从关联记忆到在线学习"的图景转换里，模型参数是具象的（记忆），参数的动力学过程也是具象的（记忆更新），我们要反推出一个目标函数，是相对抽象的。这个目标函数被诠释为"惊讶度"：人们对习以为常的现象不会加以过多关注，意外事件更容易被记住，即触发记忆更新机制。

任何一个接受过高等物理学训练的人都会对这个逆向思维非常敏感，因为它是分析力学的核心思想。基于速度、加速度、力等矢量概念的牛顿力学表述是动力学过程，它用"动力因"描绘了一整套机械宇宙图景；但是当代力学体系所依赖的是一系列标量：哈密顿量、拉格朗日量、作用量、路径积分、配分函数。这些标量不仅完美兼容动力学表述，还可以简洁地描述系统遵循的对称性，深刻揭示对称与守恒的关系，优雅地处理边界条件和自由度。它们极其晦涩抽象，但是在数学上如此优越，足以让人们抛弃一个更直观的动力学图景。

一旦为线性注意力机制赋予了关联记忆和在线学习的诠释，我们就可以泛化以上两个步骤。【4】将泛化自由度归为四类：

记忆框架，用神经网络取代矩阵编码 \(K \to V\)，获得更高的自由度同时降低过拟合风险。
优化目标：设计损失函数。
遗忘门：记忆与遗忘一体两面，学习因子和遗忘因子本身可以是可学习的参数。
优化器：从损失函数到记忆更新的优化机制，Gradient Descent, SGD, Adam…

【1】用"顺行性遗忘症（anterograde amnesia）"这个比喻来描述今天的大模型：它拥有在某个时间节点前（预训练语料库截止时间）的所有记忆。在运行时，它也拥有执行该任务所需要的所有短期记忆（上下文）。但是，一旦任务结束，这些上下文被立刻抛弃，无法形成长期记忆。Titans 提出的 test-time memorization / online learning 可以突破这个藩篱，因为它不必保留所有上下文，只需要通过流经的短期记忆形成长期记忆（也就是 \(S_t\)）即可。与之相对的基础模型（提供了从输入 \(X\) 到 \(QKV\) 的映射矩阵）除了提供对世界的基本理解以外，还担负着"更有效地形成长期记忆"的职责。

需要区分的是，Titans 的在线学习与Test-Time Training (TTT) layers【7】是不同的概念：TTT 特指 Sun 等人提出的方法，他们将 hidden state 直接做成一个小模型，每个 token/mini-batch 做一步自监督学习更新；而 Titans 关注的是长期记忆模块的训练，用 surprise 和衰减机制来管理记忆。

对于这一点，【2】认为【1】的做法不够精致。从更新频率的视角来看，【1】本质上提供了二元更新频率，即：

外层循环，f=0，基础模型，静态，不更新。
内层循环，f=1，以token流为单位，每流经一个token就更新一次记忆状态。

Nested Learning 进一步把"更新频率"提升为一个新的设计轴：对系统中每个组件定义其 update rate（权重多久更新一次），据此将嵌套/并行的内部优化问题排序成 levels，并由此得到"连续谱"的记忆系统（CMS）：不同模块在不同频率上压缩与固化其各自的 context flow。Titans 可以被视为只有两级频率的特例。

而人的记忆更像是一个层级结构：从原初的感官信息（sensory registeration）开始，部分进入工作记忆（working memory）并被编码为具体的情境记忆（episodic memory）。随着系统整合的发生，这些记忆经历"去情境化"（decontextualization），最终提炼为稳固的语义知识（semantic knowledge），并内化为深层的认知图式（cognitive schemas）与世界观。（当然实际情况比这个单一链条复杂得多，比如我们今天的大部分抽象知识都在语义知识空间里直接生成，并没有对应的情境记忆。）这不仅仅是一个抽象的认识论模型，它更有着神经生物的基础，在大脑中与高频（gamma）到低频（delta）的神经振荡相对应。这也是当前神经科学研究的核心范式之一。

时间的洋葱：嵌套学习与重整化群

【2】设计了这样一套层级的记忆与学习模型框架。如同深度神经网络和stacked transformer在空间维度实现信息的层级抽象，记忆行为在时间维度实现层级抽象。当然它同时也是空间维度的，因为无论【2】还是大脑，不同频率的记忆与学习发生在不同的区域（从后脑到前额叶）。【2】提出了一个洋葱式的内嵌结构，最外层直面原初信息流，更新最快；最内层是基础模型，不更新；中间每一层都在某个特定的频率上对内层提炼与浓缩。【1】可以被看作只有两层的洋葱。

"层级结构"再一次触动了我的物理学神经。从某种程度上来说，物理学是研究"尺度"的学科。很幸运，我们并没有生活在一个缩放对称的物理世界（不然世界会变成一幅宏大、氤氲、永恒流动的分形抽象画）。指导整个物理世界运作的，是几个有量纲的物理常数，其中最基本的就是普朗克时间和普朗克空间。另一方面，物理学探索跨越时空的永恒性与普世性，那么时空尺度就必然成为任何理论都无法绕过的属性。重整化群（Renormalization Group，RG）便是探索物理理论与尺度关系的数学工具。这个令人困惑的名字来自量子场论。"重整化"最初是用来消除量子电动力学中臭名昭著的紫外发散问题的"奇技淫巧"，之后人们发现它背后蕴含某种深刻的不变性，被凝聚态物理借用后发扬光大，成为一门关于尺度的普世方法（Wilsonian RG）。"群"意味着它处理的尺度变换符合结合律。对这部分物理感兴趣的可以读我的科普书【6】。

重整化群是一套普世方法，无论对象是量子电动力学、永磁体还是沙堆，无论尺度变换是连续的还是离散的，它都用同一个框架去审视系统。在这套框架里最核心的概念是不动点（fixed point），或严格来说是非平庸的不动点（称之为"临界点"）。临界点针对参数空间而言，它指系统在一套特定的参数下（比如磁铁所处的温度和外磁场强度）不随时间或空间的尺度（比如磁铁原子的自旋块，block spin）的变化而变化。临界点并不是一个"点"，而是一切符合尺度不变性的参数子空间。在深度学习模型里，空间尺度坐标是神经网络的深度（从下到上）；在【2】里，时间尺度坐标是洋葱的层数（从外到里）。（其实重整化流的图像对扩散模型最直观，因为它的时间和空间尺度变化是绑定的。）

我们可以尝试用"接近某种非平庸不动点/不变流形"来组织现象，检验其是否能解释泛化与遗忘的权衡。作为类比示意，我们可以想象两类平庸的不动点：红外不动点——过拟合，模型仅仅是一本字典，没有泛化能力；紫外不动点——白噪音 / 灾难性遗忘，模型记不住任何东西，至少无法可靠地记住任何东西。而在非平庸的临界点上，模型收敛到了某个固定的参数流形上。它有效地冲刷掉了原初信息中的噪音，保留了记忆和认知中最深层的结构。注意，临界点性质不足以判断模型好坏，但它提供了一套归类方法（普世类，Universality Class），将模型的微扰自由度分为相关算符（relevant operator）和无关算符（irrelevant operator）。在同一个普世类下，所有无关算符都会被积分掉（integrated out），而相关算符则不断被强化。我想提出的是，对于记忆模型而言，特别是在个性化AI的语境下，我们真正关心的不是那些强相关算符，而是在两者边界上的边缘算符（marginal operator）。强相关算符对应的是基础模型中已经习得的通用知识（语法、基础知识、普世价值观），它对记忆模型而言并不重要；真正重要的是从"我"的记忆中提取的关于"我"的偏好、思维方式和价值观，这些信息最有可能编码在边缘算符中。令人期待的是，重整化群不仅为这种概念区分提供了可量化定义的蓝图，还提供了一个系统方案来计算模型在相关算符附近的状态方程（equation of state）。它不是第一性原理层面上的动力学方程（equation of dynamics），但它提供了丰富的模型库，来探索普世类流形的动态平衡关系。

以上是我最近的阅读体验。如果你看完一头雾水，这不怪你（当然也不怪我）。首先我自己并没有想清楚许多细节，特别是如何将【4】里的自由度映射到RG框架里；其次我也没有面向普通读者写作（和我的科普书不同）。我想象中的读者是兼具物理学和AI研究背景的科学家——这在今天的AI圈不在少数。物理学对于AI的滋养远未结束，尤其在关于"尺度"的研究上——这远不是一个"Scaling Law"可以涵盖的丰富宝藏。

参考

Titans: Learning to Memorize at Test Time (Dec 2024)
Nested Learning: The Illusion of Deep Learning Architecture (Nov 2025)
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Aug 2020)
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization (Apr 2025)
线性注意力简史：从模仿、创新到反哺 By 苏剑林，2025-06-20
什么是物理？用物理学的视角看世界（下）：近代物理篇，2024
Learning to (Learn at Test Time): RNNs with Expressive Hidden States (2024)

^* 2024年诺贝尔物理学奖授予 John J. Hopfield 与 Geoffrey Hinton，表彰他们在机器学习/神经网络领域的基础性贡献。详见 Nobel Prize 官方页面。

I recently read two Google Research works on test-time memory and continual learning [1, 2], and revisited the seminal paper that first made linear attention explicit [3]. The process reminded me once again how a shift in perspective can endow a theoretical framework with entirely new imaginative space. If you haven't experienced such a shift, it is easy to underestimate its significance. Consider the history of quantum mechanics:

\[ \text{Matrix Mechanics} \leftrightarrow \text{Wave Mechanics} \;\;\rightarrow\;\; \text{Second Quantization (Operator QFT)} \leftrightarrow \text{Path-Integral Formulation} \]

(This is intentionally schematic: matrix and wave mechanics are equivalent formulations of the same quantum theory; and in quantum field theory, the operator/second-quantized and path-integral formulations are complementary languages, not a strict chronological sequence.)

Each revolution was not merely rewriting existing knowledge with new formulas; it was a re-examination of theory through a new lens and language. In this sense, mathematical language is not just the boundary of thought; it embeds part of the "geometry" of thinking—what becomes salient, manipulable, and even askable.

What space does Linear Attention open up?

Traditional attention mechanisms essentially perform two tasks:

Map the input \(X\) to Query, Key, and Value via learned projections: \(Q=XW_Q,\,K=XW_K,\,V=XW_V\).
Compute the output (softmax applied row-wise): \[ \mathrm{Attn}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V, \] ignoring masking for brevity.

The bottleneck is not "a global positional matrix," but the all-pairs interaction itself: global attention forms an \(n\times n\) matrix of query–key scores (often augmented with positional encodings or relative position biases), which incurs \(O(n^2)\) time and memory in the sequence length.

Kernelized (a.k.a. linear) attention replaces the softmax kernel with a feature map \(\phi:\mathbb{R}^{d_k}\to\mathbb{R}^{m}\) such that the similarity factorizes: \(\kappa(q,k)=\phi(q)^\top \phi(k)\). (A related but distinct line of work approximates the exponential kernel underlying softmax using random/explicit features; the algebra below covers the kernelized form and clarifies what must be normalized.)

In normalized kernel attention, the attention can be written as:

\[ \mathrm{Attn}(Q,K,V) \;=\; \frac{\phi(Q)\big(\phi(K)^\top V\big)} {\phi(Q)\big(\phi(K)^\top \mathbf{1}\big)}, \]

where \(\mathbf{1}\) is an all-ones vector and the division is row-wise. In practice, one often requires \(\phi(\cdot)\ge 0\) (or otherwise ensures positivity) so the denominator is well-behaved.

In the autoregressive (causal) setting, define the recurrent accumulators

\[ S_t = \sum_{i \le t} \phi(k_i) v_i^\top, \qquad z_t = \sum_{i \le t} \phi(k_i), \]

then the per-token readout becomes

\[ o_t = \frac{\phi(q_t)^\top S_t}{\phi(q_t)^\top z_t}. \]

Here \(S_t\) has a fixed parameterization (constant memory footprint) that does not grow with the context length. The normalization accumulator \(z_t\) is essential for numerical stability and preventing scale drift as \(t\) grows. However, normalization does not eliminate information loss: compressing an ever-growing history into a fixed-size state inevitably induces recency bias and retrieval degradation unless the update rule has an explicit mechanism for selective write/forget. Katharopoulos et al. emphasize that this recursive implementation reveals an explicit RNN structure in causal linear attention: attention transforms from "scanning the entire KV list" to "updating a fixed-size recurrent state and reading out" [3].

This reformulation brings a new perspective. The intermediate state \(S_t\) plays the role of a matrix-valued hidden state: it is a fast-weight associative memory that encodes a (compressed) mapping from key features to values. Given a new query, the readout extracts the value associated with \(\phi(q_t)\) via the linear operator \(S_t\).

Associative memory is a central thread here: Hopfield's contribution is explicitly framed as an associative memory in the 2024 Nobel Prize press materials, while the broader prize citation credits foundational discoveries enabling machine learning with neural networks^*. Hebbian learning and synaptic plasticity—though far richer than any single toy model—remain foundational concepts in neuroscience.

The Wave Function Analogy

The introduction of the intermediate state \(S_t\) reminds me of the historical debate between Matrix Mechanics and Wave Mechanics. Heisenberg, deeply influenced by Machian empiricism/positivism, insisted that physical theory should be formulated in terms of observables—famously, spectral transition data—rather than unobservable classical trajectories.

Meanwhile, de Broglie, starting from the wave–particle duality of light, proposed that matter has a wave nature that invites a wave-function description. Historically, before Born's probabilistic interpretation became standard, it was natural to entertain ontological readings of this "matter wave." Although Schrödinger's and Heisenberg's formulations were soon shown to be mathematically equivalent, the wave function \(\psi(x)=\langle x|\psi\rangle\) retains geometric intuition as a particular representation of the abstract state.

In my view, the intermediate state \(S_t\) in linear attention is not an additional "ontological object," but a representation switch: it makes explicit the linear mapping (the association from key-feature space to value space) that was previously implicit in the one-shot attention computation, turning it into a recursively updatable operator/state. Just as the wave function provides an intuition-friendly representation of a quantum state in a chosen basis, making \(S_t\) explicit turns "what memory is," "how to write/forget," and "how to layer time-scales" into objects that can be directly designed and analyzed.

From Static to Dynamic: The Physics of "Surprise"

Building on this, Titans [1] proposes a deeper perspective shift: treat long-term memory as a parameterized module that can be updated at test time, and interpret "writing" as online optimization of an associative-memory objective. In particular, Titans argues that events that violate the model's expectations are more "memorable." It operationalizes this surprise using gradients of an associative-memory loss (used as a signal for what should be written), and combines it with an explicit decay mechanism for memory management; the paper further draws connections to mini-batch gradient descent with momentum and weight decay.

We are used to the forward direction: specify an objective and train parameters offline on a dataset (supervised or self-supervised) to minimize it. The online-learning lens flips what feels "given" versus "derived": we take the update dynamics of a particular state/parameter subset (the memory) seriously as a first-class object, and make the objective and regularization that the dynamics implements explicit. The surprise narrative is one concrete way to interpret that objective: routine events are compressed away; unexpected events receive disproportionate write bandwidth.

Anyone with training in theoretical physics will be sensitive to this move: modern physics often prefers scalar-valued functionals—Hamiltonians, Lagrangians, actions, partition functions—as organizing principles. They can encode symmetries and constraints cleanly, handle boundary conditions elegantly, and support powerful approximation tools. They may feel less "mechanistic" than Newton's vector dynamics, but they frequently make the structure of a theory more transparent.

Once we endow linear attention with the interpretations of associative memory and online optimization, we can systematically generalize the two-step picture. MIRAS [4] categorizes key degrees of freedom into four classes:

Memory Architecture: Replace the simple matrix-valued association (e.g., an outer-product accumulator) with a richer parameterization (e.g., a neural memory), trading off capacity, inductive bias, and regularization.
Objective: Design the internal loss ("attentional bias") that defines what the memory should fit.
Retention / Forget Gate: Memory and forgetting are two sides of the same coin; retention/decay can be learnable and can depend on surprise, capacity, or context.
Optimizer: Choose the online update rule (GD, SGD, Adam-like, learned optimizers, etc.).

To place this in a broader continual-learning narrative, Nested Learning [2] uses an analogy to anterograde amnesia: a person can retain long-past memories and experience a transient present, but cannot consolidate new experiences into long-term memory after onset. The authors argue that today's LLMs exhibit an analogous pattern: after pre-training, they can adapt within a context window, yet the information in that window typically does not get consolidated into persistent parameters.

Titans' test-time memorization breaks part of this barrier by allowing a long-term memory module to be updated online without retaining the full context buffer; it aims to distill the passing short-term context flow into a persistent state. In this framing, the foundation model (which implements \(X\mapsto Q,K,V\)) is responsible not only for broad world knowledge, but also for providing representations that make such consolidation efficient.

It is important to distinguish Titans' test-time memorization / online optimization view from Test-Time Training (TTT) layers [7]. TTT refers to Sun et al.'s approach in which the hidden state is itself a small learner (e.g., a linear model or an MLP) updated by a self-supervised learning step per token/mini-batch; Titans, in contrast, emphasizes a neural long-term memory module with surprise- and decay-modulated updates.

The Onion of Time: Nested Learning and RG

Nested Learning [2] argues that Titans-style designs are still "coarse" along one crucial axis: update frequency. A convenient caricature of many test-time memory schemes is a two-rate system:

Slow/Outer loop (≈0 at deployment): foundation parameters are static at inference time.
Fast/Inner loop (per-token): a memory state/module is updated with the token stream.

Nested Learning elevates update frequency into a design axis: assign each component an update rate and order nested or parallel learning problems into levels. This yields a memory system that is best viewed as a spectrum of time scales (often discretized into levels, but conceptually approaching a continuum): different modules compress and consolidate their respective context flows at different rates. In this view, Titans is a special case with essentially two update-frequency levels.

Human memory is often described as hierarchical: raw sensory registration feeds working memory, which can be encoded into episodic memory; through consolidation and abstraction, knowledge can become increasingly decontextualized into semantic knowledge and deeper cognitive schemas. Reality is messier than any single chain, but the multi-time-scale motif is robust. Nested Learning grounds this motif in neuroscience via neural oscillations spanning faster bands (e.g., gamma) to slower bands (e.g., delta/theta) that correlate with different roles in perception, cognition, and consolidation.

NL proposes an onion-like multi-level structure ordered by update rate:

Innermost / closest-to-stream levels: operate directly on the token/context flow and update most frequently.
Intermediate levels: consolidate and compress information from faster levels at intermediate rates.
Outermost levels: correspond to the slowest loop (e.g., pre-training) and are effectively static at deployment.

(Titans can be viewed as an onion with only two prominent layers.)

"Hierarchical structure" triggers my physics instincts again. Physics is, to some extent, the study of scale. Our world is not scale-invariant; it contains characteristic lengths and times. Even when we lack a final theory of quantum gravity, it is meaningful to note that Planck units (built from \(c\), \(G\), and \(\hbar\)) define natural reference scales such as Planck length and Planck time.

Renormalization Group (RG) is a mathematical language for reasoning about how effective descriptions change with scale. Historically misnamed, RG refers to a composable flow across scales; in many coarse-graining settings the transformation is not strictly invertible, so it behaves more like a semigroup.

RG suggests a concrete style of question-asking: identify a coarse-graining transformation, study how effective parameters ("couplings") flow under repeated application, and characterize invariant structures (fixed points, critical manifolds, slow/invariant manifolds). Fixed points are special parameter settings that remain unchanged under the RG transformation; in statistical physics, critical behavior is associated with non-trivial fixed points where correlation lengths become large.

In deep learning, it is tempting—but nontrivial—to treat layer depth as a discrete representation "scale," and in NL to treat level index (ordered by update rate) as a time-scale axis. I do not claim a literal RG equivalence here; rather, I propose RG as an organizing language for multi-time-scale memory systems.

As an illustrative caricature (not a theorem), imagine two "trivial" endpoints of a memory system:

IR-like endpoint (memorization): the model degenerates into a lookup table; weak generalization.
UV-like endpoint (overwrite/noise): the system fails to retain useful information under continual updates. (Here "catastrophic forgetting" refers to continual-learning interference across tasks; earlier we used "recency bias" for within-sequence memory dilution.)

The interesting regime is the middle: an effective balance where the system compresses away nuisance variability while retaining stable, reusable structure. RG provides a principled vocabulary for such regimes via universality and the decomposition of perturbations into relevant, irrelevant, and marginal directions.

Conjecture (toward personalized AI): in personalization, the most valuable information is often neither "strongly relevant" (broad world knowledge already in the foundation model) nor "irrelevant" noise. It may live near marginal directions: aspects of a user's preferences, reasoning style, and values that should persist without destabilizing general capabilities. Making this precise would require operational definitions of (i) a scale transformation on the memory system (e.g., rescaling update intervals/retention and compensating parameters) and (ii) an observable that measures persistence vs interference. RG then suggests an empirical program: estimate which perturbations grow/decay under scale changes (effective scaling dimensions) and test whether a stable manifold exists for personalization dynamics.

The above is my recent reading experience. If you find it confusing, it's not your fault (and I hope not mine). I have not yet fully clarified how to map all degrees of freedom in [4] into a rigorous RG framework. I imagine the target reader as a scientist fluent in both physics and AI—a growing demographic. Physics still has much to offer AI, especially around "scale"—a treasure trove far richer than what today's "scaling laws" alone capture.

References

Titans: Learning to Memorize at Test Time (arXiv: Dec 31, 2024)
Nested Learning: The Illusion of Deep Learning Architectures (NeurIPS 2025; arXiv: Dec 2025)
(See also the accompanying blog post: Introducing Nested Learning , Nov 7, 2025.)
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (ICML 2020)
It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization (Apr 2025)
A Brief History of Linear Attention by Su Jianlin (Jun 2025)
What is Physics? Looking at the World through the Lens of Physics (Part II) (2024)
Learning to (Learn at Test Time): RNNs with Expressive Hidden States (Jul 2024)

^* The 2024 Nobel Prize in Physics was awarded to John J. Hopfield and Geoffrey Hinton for foundational contributions enabling machine learning with neural networks. See the official Nobel Prize page.