Proofs: Ontologic Scalar Modulation Theorem by C.L. Vaillant

https://static1.squarespace.com/static/67b614459cffee266ad99e4b/t/6836d10c5abef746b786c315/1748422924931/Ontologic_Scalar_Modulation_Theorem.pdf

Ontologic Scalar Modulation Theorem

C.L. Vaillant

May/25/2025

Abstract

Mechanistic interpretability research seeks to reverse engineer the internal

computation of neural networks into human-understandable algorithms

and concepts: In this paper, we introduce an interdisciplinary theoretical

framework grounded in mechanistic interpretability and enriched

by cognitive science, symbolic AI, ontology, and philosophy of mind. We

formalize the *Ontological Scalar Modulation Theorem*, which provides

a rigorous account of how high-level semantic concepts (an **ontology**)

can be represented, identified, and continuously modulated within the latent

space of a learned model. Our approach offers precise mathematical

definitions and structures that bridge low-level network mechanisms and

high-level human-interpretable features. We illustrate the theorem with

examples drawn from vision and language models, demonstrating how

adjusting a single scalar parameter can “turn up or down” the presence

of an abstract concept in a model’s representation. We further connect

these technical insights to long-standing philosophical questions, drawing

on Kantian categories, Peircean semiotics, and Platonic forms, to contextualize

how neural networks might be said to *discover* or instantiate

abstract knowledge. The results highlight a convergence between modern

AI interpretability and classical understandings of cognition and ontology

and suggest new avenues for building AI systems with interpretable and

philosophically grounded knowledge representations.

1 Introduction

Modern artificial intelligence systems, particularly deep neural networks, have

achieved remarkable performance in a wide range of domains. However, their

inner workings often remain opaque, prompting a growing field of *mechanistic

interpretability* aimed at uncovering the algorithms and representations emerging

within these models: Mechanistic interpretability strives to go beyond the

correlations between inputs and outputs and instead * reverse engineer* the

network computations into human-understandable components and processes

(2). This pursuit is not only of academic interest, but a practical imperative for

AI safety and alignment, since understanding the internals of a model can help

ensure it aligns with human values and behaves as intended (3)

A central challenge in interpretability is to bridge the gap between the

model’s low-level numerical operations and the high-level semantic concepts

by which humans understand the world. In cognitive science and philosophy of

mind, this gap reflects the enduring question of how abstract ideas and categories

arise from raw sensory data. Immanuel Kant, for example, argued that the human

mind imposes innate *categories of understanding* (such as causality and

unity) to organize experience (Kant, 1781). Centuries earlier, Plato’s theory of

*Forms* posited that abstract universals (like ’Beauty’ or ’Circle’) underlie the

concrete objects we perceive (4,5). These philosophical perspectives highlight

an ontological stratification of knowledge: a hierarchy from concrete particulars

to abstract universals. Similarly, early artificial intelligence research in the

symbolic paradigm emphasized explicit, human-readable knowledge structures:

The Newell and Simon physical symbol system hypothesis famously claimed

that symbol manipulation operations are necessary and sufficient for general intelligence

(Newell Simon, 1976). Ontologies, formal representations of concepts

and relationships, were built by hand in projects like *Cyc*, which attempted to

encode common sense knowledge as millions of logical assertions (Lenat, 1995).

In the realm of language, comprehensive lexical ontologies such as *WordNet*

organized words into hierarchies of concepts (6, 7), reflecting human semantic

networks.

By contrast, the success of modern deep learning has arisen from subsymbolic,

distributed representations learned from data. Connectionist models encode

knowledge as patterns of activations across many neurons, rather than

discrete symbols. This led to debates in cognitive science: Could neural networks

capture the structured, systematic nature of human cognition? Critics

like Fodor and Pylyshyn (1988) argued that distributed representations lack the

*compositional* structure needed for systematic reasoning (for example, understanding

that if ”John loves Mary” then one can infer the structure of ”Mary

loves John”) (8.9). However, connectionism advocates hoped that as networks

grew in depth and complexity, they could develop internal representations that

mirror symbolic structures **implicitly**, even if not explicitly hard-coded (10,

11)

Recent research suggests that deep networks learn intermediate representations

that correspond to human-interpretable concepts, lending some credence

to this hope. For example, in computer vision, convolutional neural networks

trained in image classification have been found to develop a *hierarchy of features*:

the early layers detect simple edges and textures, while the deeper

layers encode higher-level patterns such as object parts and entire objects (12,

13). This emergent hierarchy is analogous to the *levels of analysis* in human

vision described by Marr (1982), and it hints that a form of learned ontology is

present within the network. A particularly striking demonstration was provided

by Zhou et al. (2015), who observed that object detectors (e.g. neurons that

fire for ’dog’ or ’airplane’) spontaneously emerged inside a CNN trained only

to classify scenes (such as ’kitchen’ or ’beach’) (14). In other words, without

explicit supervision for objects, the network invented an internal vocabulary of

objects as a means to recognize scenes. Such findings align with the “Platonic

representation hypothesis” suggested by Isola et al., that different neural networks,

even with different architectures or tasks, tend to converge on similar

internal representations for fundamental concepts. (15, 16)

Despite these insights, a rigorous framework for understanding and *controlling*

the mapping between low-level neural activity and high-level ontology

has been lacking. This paper aims to fill that gap. We present the **Ontologic

Scalar Modulation Theorem**, which formalizes how abstract concepts

can be mathematically identified within a network’s latent space and continuously

modulated by acting on a single scalar parameter. In simpler terms, we

demonstrate that for certain learned representations, one can construct a *concept

axis*—a direction in activation space corresponding to a human-meaningful

concept—such that moving a point along this axis strengthens or diminishes the

presence of that concept in the network’s behavior. This provides a principled

way to traverse the model’s *ontology* of concepts.

We proceed as follows. In Section 2, we review related work from mechanistic

interpretability and cognitive science that lays the foundation for our approach.

Section 3 introduces necessary definitions (tying together notions from ontology

and network representation) and formally states the Ontologic Scalar Modulation

Theorem with a proof sketch. Section 4 provides empirical examples of

the theorem in action: we discuss how concept vectors have been used to manipulate

image generation and analyze neurons in vision and language models,

drawing parallels to neurophysiological findings like “Jennifer Aniston neurons”

in the human brain (17). In Section 5, we explore the broader implications of

our work, including connections to philosophical theories of mind and prospects

for integrating symbolic structure into deep learning. We conclude in Section

6 with a summary and suggestions for future research, including how a better

understanding of learned ontologies could inform the design of AI systems that

are not only powerful, but also transparent and aligned with human values.

2 Background and Related Work

2.1 Mechanistic Interpretability of Neural Networks

Our work is situated within the field of *mechanistic interpretability*, which

seeks to uncover the internal mechanisms of neural networks in a *causal* and

fine-grained way (18). Unlike post-hoc explanation methods (e.g. saliency maps

or feature attributions) that highlight important features without detailing the

underlying computation, mechanistic interpretability endeavors to identify the

actual *subcircuits*, neurons, and weights that implement specific functions

within the model (19, 20). In this sense, it parallels the approach of cognitive

neuroscience: much as neuroscientists attempt to map cognitive functions

to circuits of biological neurons, interpretability researchers map algorithmic

functions to artificial neurons or groups thereof.

Significant progress has been made in reverse-engineering small components

of networks. For example, **induction heads** in transformer models (a type

of attention head) have been identified that implement a form of *copy-sort*

algorithm enabling in-context learning of repeated token sequences (21). In

another case, *multi-modal neurons* were discovered in vision-language models

(like CLIP) that respond to a high-level concept regardless of whether it is

presented as an image or a word (22). A famous instance is a neuron that fires

for the concept “Spider-Man”, responding both to pictures of the Spider-Man

character and to the text ”Spider-Man” (23). This echoes the concept of a

”Jennifer Aniston neuron” in the human brain – a single neuron that responds

to pictures of the actress Jennifer Aniston and even her written name (24),

suggesting that neural networks can, in some cases, learn similarly abstract and

multi-modal representations of concepts.

A variety of techniques have been developed to study such internal representations.

**Network dissection** is a seminal approach introduced by Bau et al.

(2017), which quantifies interpretability by evaluating how individual hidden

units align to human-labeled concepts in a broad (25). For a given convolutional

network, each neuron’s activation map can be compared to segmentation

masks for concepts like “cat”, “chair”, or “stripes” to see if that neuron acts

as a detector for that concept (26). Network dissection studies revealed that

many units in vision models have high alignment with intuitive visual concepts

(objects, parts, textures, etc.), providing a rough *ontology* of the network’s

learned features. However, not all concepts correspond to single neurons; some

are distributed across multiple units or dimensions.

Another line of work probes the geometry of representations. It has been

observed that in some models, conceptual relationships are reflected as linear

directions in latent space. Word embedding models famously exhibit linear

analogies (e.g., v(King)−v(Man)+v(Woman) ≈ v(Queen)), suggesting that

certain latent directions correspond to abstract relations (27, 28). In vision,

**feature visualization** (Olah et al. 2017) uses optimization to find an input

image that maximally activates a neuron or a combination of neurons, often

revealing the concept the neuron has learned to detect (e.g., a neuron might

consistently produce images of spiral patterns, indicating it detects spirals).

These methods provide qualitative insight into network ontology by directly

showcasing learned features.

Crucially for our work, recent advances allow not only *identifying* concepts

inside networks but also *intervening* on them. **Activation patching** and

causal intervention techniques replace or modify internal activations to test

their influence on outputs (29, 30). For example, one can swap a segment of

activations between two inputs (one with a concept and one without) to see if

the output swaps accordingly (31), thereby pinpointing where in the network

a concept is represented. If a specific layer’s activation carries the concept,

patching it into a different input can implant that concept’s effect (32). Along

similar lines, **model editing** methods like ROME (Rank-One Model Editing)

directly modify network weights to insert a desired knowledge (e.g., “Paris is

the capital of Italy” could be flipped to “Paris is the capital of France” by a

targeted weight change) (33). These interventions highlight that representations

of knowledge in networks can be located and manipulated in a targeted way.

Our theorem builds on these insights by providing a general theoretical account

of concept representation and modulation. In particular, it complements

work on **disentangled representations** in unsupervised learning. A disentangled

representation aims to have individual latent dimensions correspond

to distinct factors of variation in the data (for instance, in a face generator,

one latent might control hair color, another controls lighting, etc.). Beta-VAE

(Higgins et al. 2017) and related approaches encouraged disentanglement via

regularization, and metrics were proposed to quantify disentanglement. However,

Locatello et al. (2019) proved that without inductive biases or supervision,

disentanglement cannot be uniquely achieved (34, 35). In practice, perfect disentanglement

is hard, but even standard models often learn *approximately*

disentangled directions. For instance, in generative adversarial networks, unsupervised

techniques like PCA on latent activations (GANspace, H¨ark¨onen et al.

2020) or supervised approaches like **InterfaceGAN** (Shen et al. 2020) found

specific vectors in the latent space that correspond to human-meaningful transformations

(e.g. adding a smile on a face, changing the background scenery).

Importantly, moving the latent code in the direction of these vectors causes a

smooth change in the output image along that semantic dimension.

This ability to *modulate* a concept by moving along a latent direction is

a key empirical phenomenon that our Ontologic Scalar Modulation Theorem

formalizes. It ties into the notion of *concept activation vectors* described by

Kim et al. (2018). In their Testing with Concept Activation Vectors (TCAV)

framework, the authors obtained a vector in hidden space that points towards

higher activation of a chosen concept (learned from examples of that concept)

(36, 37). They then measured the sensitivity of the model’s predictions to perturbations

along that concept vector (38). TCAV thus provides a quantitative

tool to ask, for example: *is the concept of “stripes” important to this classifier’s

prediction of “zebra”?* — by checking if moving in the “stripe” direction

in feature space changes the zebra score (39). Our work generalizes the idea of

concept vectors and situates it in a broader theoretical context, linking it explicitly

with ontology (the set of concepts a model has and their relations) and

providing conditions under which a single scalar parameter can control concept

intensity.

In summary, prior research provides many pieces: evidence that networks

learn human-recognizable features, methods to find and manipulate those features,

and even hints from neuroscience that single units or sparse sets of units

can embody high-level concepts (40). What has been missing is an overarching

theoretical lens to integrate these pieces. By uniting insights from these works,

the Ontologic Scalar Modulation Theorem offers a unifying principle and a stepping

stone toward a more *systematic* mapping between neural representations

and symbolic knowledge.

2.2 Cognitive and Philosophical Perspectives

Our interdisciplinary approach draws from cognitive science and philosophy to

interpret the significance of the Ontologic Scalar Modulation Theorem. In cog-

nitive science, a classic framework due to David Marr (1982) delineates multiple

levels of analysis for information-processing systems: the *computational* level

(what problem is being solved and why), the *algorithmic/representational*

level (how the information is represented and what processes operate on it),

and the *implementational* level (how those representations and processes are

physically realized) (43). Mechanistic interpretability operates mainly at Marr’s

implementational and algorithmic levels for AI systems, revealing the representations

and transformations inside a network. However, to connect these to highlevel

semantic content (Marr’s computational level in human-understandable

terms), one needs a notion of the network’s internal *concepts*. Our theorem

can be seen as a bridge between the implementational level (activations, weights)

and the algorithmic level (the network’s internal “language” of concepts), allowing

us to reason about abstract computational roles of components.

From the perspective of the *symbolic vs. connectionist* debate in cognitive

science (43, 44), our work contributes to understanding how symbolic-like structures

might emerge from neural systems. Fodor’s critique (45), which asserted

that connectionist networks cannot naturally exhibit systematic, compositional

structure, is partially addressed by findings that networks do learn to encode

variables and relations in a distributed way. For instance, recent mechanistic

analyses show that transformers can bind variables to roles using superposition

in high-dimensional vectors (smearing multiple symbols in one vector in

a “fuzzy” manner) (46). Elhage et al. (2021) demonstrated that even in randomly

initialized transformer models, one can define *traitor* and *duplicate

token* circuits that perform a kind of variable binding and copying (47, 48).

Such results suggest connectionist models can implement discrete-like operations

internally. The Ontologic Scalar Modulation Theorem further supports

this by implying the existence of controllable dimensions corresponding to discrete

changes in a concept’s presence, effectively giving a handle on something

akin to a symbolic variable within the vector geometry of a network.

Philosophically, our approach resonates with *Peircean semiotics* and pragmatism.

Charles S. Peirce, in his theory of signs, proposed that a sign (representation)

stands for an object (referent) to an interpretant (the meaning

understood) through a triadic relation. One can draw an analogy: an internal

activation pattern in a network could be seen as a **sign** that corresponds

to some **object** or concept in the input (e.g., a pattern representing “cat”),

and the **interpretant** is the effect that representation has on the network’s

subsequent computation or output. In Peirce’s terms, signs can be *iconic*

(resembling the object), *indexical* (causally or correlationally linked to the

object), or *symbolic* (related by convention or interpretation). Neural representations

often begin as indexical or iconic (e.g., an edge detector neuron has an

iconic relation to visual edges) but can become increasingly symbolic (abstract,

not resembling the input) in deeper layers. Our theorem giving a formal way to

manipulate a high-level concept representation vC can be viewed as identifying

a *symbol* in the network’s language and showing how it can be systematically

varied. This aligns with Peirce’s idea that higher cognition uses symbols that

can be combined and modulated, albeit here the symbols are vectors in Rn.

The influence of Immanuel Kant is also noteworthy. Kant held that the mind

has innate structures (categories) that organize our experience of the world. One

might ask: do neural networks develop their own *categories* for making sense

of their inputs? The ontology of a trained network – the set of features or latent

variables it uses – can be thought of as analogous to Kantian categories,

albeit learned rather than innate. For example, a vision network might implicitly

adopt category-like distinctions (edges vs. textures vs. objects, animate vs.

inanimate, etc.) because these are useful for its tasks. Our work enables probing

those internal categories by finding directions that correspond to conceptual

distinctions. In effect, the theorem provides a method to *decompose* a network’s

representation space in terms of its phenomenological categories. This

also connects to modern discussions of *feature ontologies* in interpretability:

identifying what the primitive concepts of a network are (perhaps very different

from human concepts, or surprisingly similar).

Finally, our treatment of *ontology* itself is informed by both AI and philosophy.

In AI, an ontology is a formal specification of a set of entities, categories,

and relations – essentially an explicit knowledge graph of concepts. In our context,

the network’s ontology is implicit, embedded in weights and activations.

By extracting interpretable directions and features, we begin to make the network’s

ontology explicit. This evokes historical efforts like *ontology learning*

in knowledge engineering, but here it happens post hoc from a trained model.

Philosophically, ontology concerns what exists – the categories of being. One

might provocatively ask: does a neural network *discover* ontological structure

about its domain? For instance, a vision model that learns separate internal

representations for “cat” and “dog” is carving the world at its joints (at least

as reflected in its training data). There is evidence that large language models

learn internal clusters corresponding to semantic concepts like parts of speech or

world knowledge categories (e.g., a certain vector subspace might correspond to

“locations”) (49, 50). In examining such phenomena, we follow a lineage from

Plato’s belief in abstract Forms to modern machine learning: the concepts might

not be *transcendent* Forms, but the convergent learning of similar representations

across different models (51) hints that there is an objective structure

in data that neural networks are capturing – a structure that might be viewed

as *latent ontology*. The Ontologic Scalar Modulation Theorem gives a concrete

handle on that latent ontology by linking it to measurable, manipulable

quantities in the model.

3 Ontologic Scalar Modulation Theorem

In this section, we formalize the core theoretical contribution of this work. Our

aim is to define what it means for a concept to be present in a network’s representation

and to show that under certain conditions, the degree of presence of

that concept can be modulated by adjusting a single scalar parameter along a

specific direction in latent space. Intuitively, the theorem will demonstrate that

if a concept is well-represented in a network (in a sense made precise below),

then there exists a vector in the network’s activation space whose scalar projection

correlates directly with the concept. By moving the activation state of the

network along this vector (i.e., adding or subtracting multiples of it), one can

increase or decrease the evidence of the concept in the network’s computations

or outputs in a controlled, continuous manner.

3.1 Definitions and Preliminaries

We begin by establishing definitions that merge terminologies from ontology

and neural network theory:

Neural Representation Space: Consider a neural network with an internal

layer (or set of units) of interest. Without loss of generality, we focus on a single

layer’s activations as the representation. Let Z = Rn denote the n-dimensional

activation space of this layer. For an input x from the input domain X (e.g.,

images, text), let f(x) ∈ Z be the activation vector produced at that layer. We

call f(x) the *representation* of x. (The analysis can be extended to considering

the joint activations of multiple layers or the entire network, but a single layer

is sufficient for our theoretical development.)

Ontology and Concept: We define an *ontology* Ω in the context of the

model as the set of concepts that the model can represent or distinguish at the

chosen layer. A *concept* C ∈ Ω is an abstract feature or property that might

be present in an input (for example, a high-level attribute like “cat”, “striped”,

or “an outdoor scene”). We assume each concept C has an associated *concept

indicator function* on inputs, denoted 1C(x), which is 1 if concept C is present

in input x (to some defined criteria) and 0 if not. For instance, if C is the concept

“contains a cat”, then 1C(x) = 1 if image x contains a cat. In practice, 1C(x)

might be defined via human labeling or some ground-truth function outside the

model. We also define a real-valued *concept measure* μC(x) that quantifies

the degree or strength of concept C in input x. If C is binary (present/absent),

μC(x) could simply equal 1C(x); if C is continuous or graded (like “smiling” as

a concept that can be more or less intense), μC(x) might take a range of values.

Linear Concept Subspace: We say that concept C is *linearly represented*

at layer Z if there exists a vector wC ∈ Rn (not the zero vector) such that the

*concept score* defined by sC(x) = wC · f(x) is correlated with the concept’s

presence. More formally, we require that sC(x) is a reliable predictor of μC(x).

This could be evaluated, for example, by a high coefficient of determination

(R2) if μC(x) is real-valued, or high classification accuracy if μC(x) is binary.

The direction wC (up to scaling) can be thought of as a normal to a separating

hyperplane for the concept in representation space, as often obtained by training

a linear probe classifier (52). If such a wC exists, we define the *concept subspace*

for C as the one-dimensional subspace spanned by wC. Geometrically,

points in Z differing only by movement along wC have the same projection onto

all directions orthogonal to wC, and differ only in their coordinate along the

concept axis wC.

Concept Activation Vector: For convenience, we normalize and define a

unit vector in the direction of wC: let vC = wC

∥wC∥ . We call vC a *concept

activation vector* (borrowing the terminology of TCAV(53)). This vector points

in the direction of increased evidence for concept C in the representation space.

Thus, the dot product vC · f(x) (which is 1

∥wC∥ sC(x)) gives a signed scalar

representing how much C is present in representation f(x), according to the

linear model.

Modulation Operator: For any α ∈ R, we define a *modulated representation*

fα(x) as:

fα(x) = f(x) + α vC.

In other words, we take the original activation vector f(x) and add a multiple

of the concept vector vC. The parameter α is a scalar that controls the degree

of modulation. Positive α moves the representation in the direction that should

increase concept C’s presence; negative α moves it in the opposite direction.

It is important to note that fα(x) may not correspond to a valid activation

that the unmodified network would naturally produce for some input –

we are intervening in activation space off the standard manifold of f(x) values.

Nonetheless, one can conceptually imagine fα(x) as the activation if the network

were exposed to a version of x where concept C is artificially strengthened

or weakened. In practice, one could implement such modulation by injecting

an appropriate bias in the layer or by actually modifying x through an input

transformation that targets the concept (if such a transformation is known).

With these definitions in place, we can now state the theorem.

3.2 Theorem Statement

Theorem 1 (Ontologic Scalar Modulation Theorem) Assume a concept

C is linearly represented at layer Z of a neural network by vector wC, as defined

above. Then there exists a one-dimensional subspace (the span of wC) in

the activation space Z such that movement along this subspace monotonically

modulates the evidence of concept C in the network’s output or internal computations.

In particular, for inputs x where the concept is initially absent or

present to a lesser degree, there is a threshold α∗ > 0 for which the network’s

output ˆy on the modulated representation fα(x) will indicate the presence of C

for all α ≥ α∗, under the assumption that other features remain fixed.

More formally, let g be an indicator of the network’s output or classification

for concept C (for example, g(f(x)) = 1 if the network’s output classifies x as

having concept C, or if an internal neuron specific to C fires above a threshold).

Then under a local linearity assumption, there exists α∗ such that for all α ≥ α∗,

g(fα(x)) = 1,

and for α ≤ −α∗ (sufficiently large negative modulation),

g(fα(x)) = 0,

provided μC(x) was originally below the decision boundary for g.

In addition, the degree of concept presence measured by sC(x) = wC · f(x)

changes approximately linearly with α:

wC · fα2 (x) − wC · fα1 (x) = (α2 − α1)∥wC∥,

implying that the internal activation score for concept C is directly proportional

to the modulation parameter.

In essence, Theorem 1 states that if a concept can be captured by a linear

direction in a network’s latent space (a condition that empirical evidence

suggests holds for many concepts(54, 55)), then we can treat that direction as

an interpretable axis along which the concept’s strength varies. Increasing the

coordinate along that axis increases the network’s belief in or expression of the

concept, while decreasing it has the opposite effect. This allows for a continuous

*scalar* control of an otherwise discrete notion (the presence or absence of a

concept), hence the term “scalar modulation.”

3.3 Proof Sketch and Discussion

Proof Outline: Under the assumptions of the theorem, wC was obtained such

that wC · f(x) correlates with μC(x). In many cases wC might be explicitly

derived as the weight vector of a linear classifier hC(f(x)) = σ(wC · f(x) + b)

trained to predict 1C(x), with σ some link function (e.g., sigmoid for binary

classification). If the concept is perfectly linearly separable at layer Z, then

there is a hyperplane {z : wC · z +b = 0} such that wC · f(x)+b > 0 if and only

if 1C(x) = 1. For simplicity assume zero bias (b = 0) which can be achieved by

absorbing b into wC with one extra dimension.

Now consider an input x for which 1C(x) = 0, i.e. concept C is absent. This

means wC · f(x) < 0 (if x is on the negative side of the hyperplane). If we

construct fα(x) = f(x) + αvC, then:

wC · fα(x) = wC · f(x) + αwC · vC = wC · f(x) + α ∥wC∥.

Because vC is the unit vector in direction wC, wC · vC = ∥wC∥. Thus as α

increases, wC · fα(x) increases linearly. There will be a particular value α∗ =

−wC·f(x)

∥wC∥ at which wC ·fα∗ (x) = 0, i.e. the modulated representation lies exactly

on the decision boundary of the linear concept classifier. For any α > α∗,

wC · fα(x) > 0, and thus hC(fα(x)) will predict the concept as present (for a

sufficiently large margin above the boundary, making the probability σ(·) close

to 1 if using a sigmoid). This establishes the existence of a threshold beyond

which the network’s classification of x would be flipped with respect to concept

The monotonicity is evident from the linear relation: if α < α′, then wC ·

fα(x) < wC · fα′ (x). Therefore, if α is below the threshold and α′ is above

it, there is a monotonic increase in the concept score crossing the boundary,

implying a change from absence to presence of the concept in the network’s

output. Conversely, for negative modulation, as α becomes very negative, wC ·

fα(x) will be strongly negative, ensuring the network firmly classifies the concept

as absent.

One caveat is that this argument assumes the rest of the network’s processing

remains appropriately “ceteris paribus” when we intervene on the representation.

In reality, extremely large perturbations could move fα(x) off the manifold

of typical activations, leading the downstream computation to break the

linear approximation. However, for sufficiently small perturbations up to the

decision boundary, if we assume local linearity (which is often the case in highdimensional

spaces over short distances, especially if the next layer is linear or

approximately linear in the region of interest), the network’s downstream layers

will interpret fα(x) in a way consistent with its movement toward a prototypical

positive-C representation.

Another consideration is that concept C might not be perfectly represented

by a single direction due to entanglement with other concepts (56). In practice,

wC may capture a mixture of factors. However, if wC is the result of an optimal

linear probe, it will be the direction of steepest ascent for concept log-odds at

that layer. Thus moving along wC yields the greatest increase in the network’s

internal evidence for C per unit of change, compared to any other direction. If

multiple concepts are entangled, one might apply simultaneous modulation on

multiple relevant directions or choose a different layer where C is more disentangled.

The theorem can be generalized to a multi-dimensional subspace if needed

(modulating multiple scalars corresponding to basis vectors), but we focus on

the one-dimensional case for clarity.

Relationship to Prior Work: The Ontologic Scalar Modulation Theorem

is a theoretical generalization of several empirical observations made in prior

interpretability research. For instance, in generative image models, researchers

identified directions in latent space that correspond to semantic changes like

“increase smile” or “turn on lights” (57). Our theorem provides a foundation for

why such directions exist, assuming the generator’s intermediate feature space

linearly encodes those factors. Kim et al.’s TCAV method (58) empirically finds

vC by training a probe; Theorem 1 assures that if the concept is learnable by

a linear probe with sufficient accuracy, then moving along that probe’s weight

vector will indeed modulate the concept.

It is important to note that the theorem itself does not guarantee that every

high-level concept in Ω is linearly represented in Z. Some concepts might be

highly nonlinear or distributed in the representation. However, the surprising

effectiveness of linear probes in many networks (a phenomenon noted by Alain

and Bengio (2016) (59), and others) suggests that deep networks often organize

information in a surprisingly linear separable way at some layer – at least for

many semantically salient features. This might be related to the progressive

linear separation property of deep layers, or to networks reusing features in

a linear fashion for multiple tasks (as seen in multitask and transfer learning

scenarios).

4 Empirical Examples and Applications

We now turn to concrete examples to illustrate the Ontologic Scalar Modulation

Theorem in action. These examples span computer vision and natural language,

and even draw parallels to neuroscience, underscoring the broad relevance of our

framework.

4.1 Controlling Visual Concepts in Generative Networks

One vivid demonstration of concept modulation comes from generative adversarial

networks (GANs). In a landmark study, **GAN Dissection**, Bau et

al. (2019) analyzed the internal neurons of a GAN trained to generate scenes

(60). They found that certain neurons correspond to specific visual concepts:

for example, one neuron might correspond to “tree” such that activating this

neuron causes a tree to appear in the generated image. By intervening on that

neuron’s activation (setting it to a high value), the researchers could *insert*

the concept (a tree) into the scene (61). Conversely, suppressing the neuron

could remove trees from the scene. This is an example of scalar modulation at

the single-unit level.

Going beyond single units, **latent space factorization** approaches like InterfaceGAN

(Shen et al., 2020) explicitly sought linear directions in the GAN’s

latent Z that correlate with concepts like “smiling”, “age”, or “glasses” in generated

face images. Using a set of images annotated for a concept (say, smiling

vs. not smiling), a linear SVM was trained in Z to separate the two sets,

yielding a normal vector wsmile. This wsmile is exactly in line with our wC for

concept C = \smile”. The striking result is that taking any random face latent

z and moving it in the wsmile direction produces a smooth transformation from

a non-smiling face to a smiling face in the output image, all else held constant.

Figure ?? (conceptual, not shown here) would depict a face gradually increasing

its smile as α (the step along vsmile) increases. This provides intuitive visual

confirmation of the theorem: there is a clear axis in latent space for the concept

of “smile”, and adjusting the scalar coordinate along that axis modulates the

smile in the image.

The existence of these axes has been found for numerous concepts in GANs

and other generative models (62). Some are simple (color changes, lighting

direction), others are high-level (adding objects like trees, changing a building’s

architectural style). Not every concept is perfectly captured by one axis –

sometimes moving along one direction can cause entangled changes (e.g., adding

glasses might also change other facial features slightly, if those were correlated in

the training data). Nonetheless, the fact that many such directions exist at all

attests to a form of linear separability of semantic attributes in deep generative

representations, supporting a key premise of the Ontologic Scalar Modulation

Theorem.

It is also instructive to consider failure cases: when modulation along a

single direction does not cleanly correspond to a concept. This usually indicates

that the concept was *not* purely linear in the chosen representation. For

example, in GANs, “pose” and “identity” of a generated human face might

be entangled; trying to change pose might inadvertently change the identity.

Techniques to mitigate this include moving to a different layer’s representation

or applying orthogonal constraints to find disentangled directions. From the

theorem’s perspective, one could say that the ontology at that layer did not have

“pose” and “identity” as orthogonal axes, but perhaps some rotated basis might

reveal a better aligned concept axis. Indeed, methods like PCA (GANSpace)

implicitly rotate the basis to find major variation directions, which often align

with salient concepts.

4.2 Concept Patching and Circuit Interpretability

Mechanistic interpretability research on feedforward networks and transformers

has embraced interventions that align with our theorem’s implications. For

instance, consider a transformer language model that has an internal representation

of a specific factual concept, such as the knowledge of who the president

of a country is. Suppose concept C = “the identity of the president of France”.

This concept might be represented implicitly across several weights and activations.

Recent work by Meng et al. (2022) on model editing (ROME) was able

to identify a specific MLP layer in GPT-type models where a factual association

like (“France” -¿ “Emmanuel Macron”) is stored as a key–value mapping, and

by perturbing a single weight vector (essentially adding a scaled vector in that

weight space), they could change the model’s output on related queries (63).

While this is a weight space intervention rather than an activation space intervention,

the underlying idea is similar: there is a direction in parameter space

that corresponds to the concept of “who is President of France”, and adjusting

the scalar along that direction switches the concept (to e.g. “Marine Le Pen” if

one hypothetically wanted to edit the knowledge incorrectly).

At the activation level, one can apply *concept patching*. Suppose we have

two sentences: x1 = “The **red apple** is on the table.” and x2 = “The **green

apple** is on the table.” If we consider C = the concept of “red” color, we can

take the representation from x1 at a certain layer and transplant it into x2’s

representation at the same layer, specifically for the position corresponding to

the color attribute. This is a form of setting α such that we replace “green” with

“red” in latent space. Indeed, empirical techniques show that if you swap the

appropriate neuron activations (the ones encoding the color in that context),

the model’s output (e.g. an image generated or a completion) will switch the

color from green to red, leaving other words intact (64). This is essentially

moving along a concept axis in a localized subset of the network (those neurons

responsible for color).

These targeted interventions often leverage knowledge of the network’s *circuits*:

small networks of neurons that together implement some sub-function.

When a concept is represented not by a single direction but by a combination

of activations, one might modulate multiple scalars jointly. Nonetheless, each

scalar corresponds to one basis vector of variation, which could be seen as multiple

one-dimensional modulations done in concert. For example, a circuit for

detecting “negative sentiment” in a language model might involve several neurons;

toggling each from off to on might convert a sentence’s inferred sentiment.

In practice, one might find this circuit via causal experiments and then modulate

it. The theorem can be conceptually extended to the multi-dimensional case: a

low-dimensional subspace W ⊂ Z (spanned by a few vectors wC1 , ...,wCk ) such

that movement in that subspace changes a set of related concepts C1, ...,Ck.

This could handle cases like a concept that naturally breaks into finer subconcepts.

4.3 Neuroscience Analogies

It is worth reflecting on how the Ontologic Scalar Modulation Theorem relates

to what is known about brain representations. In neuroscience, the discovery of

neurons that respond to highly specific concepts – such as the so-called “Jennifer

Aniston neuron” that fires to pictures of Jennifer Aniston and even the

text of her name (65), suggests that the brain too has identifiable units (or ensembles)

corresponding to high-level semantics. These neurons are often called

*concept cells* (66). The existence of concept cells aligns with the idea that at

some level of processing, the brain achieves a disentangled or at least explicit

representation of certain entities or ideas. The mechanisms by which the brain

could *tune* these cells (increase or decrease their firing) parallels our notion

of scalar modulation. For instance, attention mechanisms in the brain might

effectively modulate certain neural populations, increasing their activity and

thereby making a concept more salient in one’s cognition.

Recent work using brain-computer interfaces has demonstrated volitional

control of individual neurons: in macaque monkeys, researchers have provided

real-time feedback to the animal from a single neuron’s firing rate and shown

that animals can learn to control that firing rate (essentially adjusting a scalar

activation of a targeted neuron). If that neuron’s firing corresponds to a concept

or action, the animal is indirectly modulating that concept in its brain. This is

a speculative connection, but it illustrates the broad relevance of understanding

how concept representations can be navigated in any intelligent system,

biological or artificial.

On a higher level, our theorem is an attempt to formalize something like a

“neural key” for a concept – akin to how one might think of a grandmother

cell (a neuron that represents one’s grandmother) that can be turned on or off.

While modern neuroscience leans towards distributed representations (a given

concept is encoded by a pattern across many neurons), there may still be principal

components or axes in neural activity space that correspond to coherent

variations (e.g., an “animal vs. non-animal” axis in visual cortex responses).

Indeed, techniques analogous to PCA applied to population neural data sometimes

reveal meaningful axes (like movement direction in motor cortex). The

mathematics of representational geometry is a common thread between interpreting

networks and brains (67, 68)

5 Discussion

The Ontologic Scalar Modulation Theorem opens several avenues for deeper

discussion, both practical and philosophical. We discuss the implications for

interpretability research, the limitations of the theorem, and how our work

interfaces with broader questions in AI and cognitive science.

5.1 Implications for AI Safety and Interpretability

Understanding and controlling concepts in neural networks is crucial for AI

safety. One major risk with black-box models is that they might latch onto

spurious or undesired internal representations that could lead to errant behavior.

By identifying concept vectors, we can audit what concepts a model has

internally learned. For example, one might discover a “race” concept in a face

recognition system’s latent space and monitor or constrain its use to prevent biased

decisions. The ability to modulate concepts also allows for *counterfactual

testing*: “What would the model do if this concept were present/absent?” – this

is effectively what our α parameter adjustment achieves. Such counterfactuals

help in attributing causality to internal features (69, 70).

Our theorem, being a formal statement, suggests the possibility of *guarantees*

under certain conditions. In safety-critical systems, one might want

guarantees that no matter what input, the internal representation cannot represent

certain forbidden concepts (for instance, a military AI that should never

represent civilians as targets). If those concepts can be characterized by vectors,

one could attempt to null out those directions (set α = 0 always) and ensure

the network does not drift along those axes. This is speculative and challenging

(since what if the concept is not perfectly linear?), but it illustrates how

identifying an ontology can lead to enforceable constraints.

Moreover, interpretability methods often suffer from the criticism of being

“fragmentary” – one can analyze one neuron or one circuit, but it’s hard to

get a global picture. An ontology-level view provides a structured summary: a

list of concepts and relations the model uses internally. This is akin to reverseengineering

a symbolic program from a trained neural network. If successful,

it could bridge the gap between sub-symbolic learning and symbolic reasoning

systems, allowing us to extract, for example, a logical rule or a decision tree that

approximates the network’s reasoning in terms of these concepts. In fact, there

is ongoing research in *neuro-symbolic* systems where neural nets interface with

explicit symbolic components; our findings could inform better integrations by

telling us what symbols the nets are implicitly working with.

5.2 Limitations and Complexity of Reality

While the theorem provides a neat picture, reality is more complex. Not all

concepts are cleanly separable by a single hyperplane in a given layer’s representation.

Many useful abstractions might only emerge in a highly nonlinear

way, or be distributed such that no single direction suffices. In such cases,

one might need to consider non-linear modulation (perhaps quadratic effects or

higher) or find a new representation (maybe by adding an auxiliary network that

makes the concept explicit). Our theorem could be extended with additional

conditions to handle these scenarios, but at some cost of simplicity.

Additionally, the presence of **superposition** in neural networks – where

multiple unrelated features are entangled in the same neurons due to limited

dimensionality or regularization (71, 72) can violate the assumptions of linear

separability. Recent work by Elhage et al. (2022b) studied “toy models of superposition”

showing that when there are more features to represent than neurons

available, the network will store features in a compressed, entangled form (73).

In such cases, wC might pick up on not only concept C but also pieces of other

concepts. One potential solution is to increase dimensionality or encourage sparsity

(74) so that features disentangle (which some interpretability researchers

have indeed been exploring (75)). The theorem might then apply piecewise in

different regions of activation space where different features dominate.

From a technical standpoint, another limitation is that we assumed a known

concept C with an indicator 1C(x). In unsupervised settings, we might not

know what concepts the model has learned; discovering Ω (the ontology) itself

is a challenge. Methods like clustering of activation vectors, or finding extreme

activations and visualizing them, are used to hypothesize concepts. Our framework

could potentially be turned around to *define* a concept by a direction:

if an unknown direction v consistently yields a certain pattern in outputs when

modulated, we might assign it a meaning. For example, one could scan through

random directions in latent space of a GAN and see what changes occur, thereby

discovering a concept like “add clouds in the sky” for some direction. Automating

this discovery remains an open problem, but our theorem provides a way to

verify and quantify a discovered concept axis once you have a candidate.

5.3 Philosophical Reflections: Symbols and Understanding

Finally, we circle back to philosophy of mind. One might ask: does the existence

of a concept vector mean the network *understands* that concept? In the

strong sense, probably not – understanding involves a host of other capacities

(such as using the concept appropriately in varied contexts, explaining it, etc.).

However, it does indicate the network has a *representation* of the concept

in a way that is isomorphic (structurally similar) to how one might represent

it symbolically. Searle’s Chinese Room argument (1980) posits that a system

could manipulate symbols without understanding them. Here, the network did

not even have explicit symbols, yet we as observers can *attribute* symbolic

meaning to certain internal vectors. Whether the network “knows” the concept

is a matter of definition, but it at least has a handle to turn that correlates with

the concept in the world. This touches on the *symbol grounding problem*

(Harnad, 1990): how do internal symbols get their meaning? In neural nets,

the “meaning” of a hidden vector is grounded in how it affects outputs in relation

to inputs. If moving along vC changes the output in a way humans interpret

as “more C”, that hidden vector’s meaning is grounded by that causal role.

Our work thus contributes to an operational solution to symbol grounding in

AI systems: a concept is grounded by the set of inputs and outputs it governs

when that internal representation is activated or modulated (76).

In the context of Kantian philosophy, one could muse that perhaps these

networks, through training, develop a posteriori analogues of Kant’s a priori

categories. They are not innate, but learned through exposure to data, yet once

learned they function as lens through which the network “perceives” inputs. A

network with a concept vector for “edible” vs “inedible” might, after training on

a survival task, literally see the world of inputs divided along that categorical

line in its latent space. Philosophy aside, this could be tested by checking if

such a vector exists and influences behavior.

Lastly, our interdisciplinary narrative underscores a convergence: ideas from

18th-century philosophy, 20th-century cognitive science, and 21st-century deep

learning are aligning around the existence of *structured, manipulable representations*

as the cornerstone of intelligence. Plato’s Forms might have been

metaphysical, but in a neural network, one can argue there is a “form” of a cat

– not a physical cat, but an abstract cat-essence vector that the network uses.

The fact that independent networks trained on different data sometimes find

remarkably similar vectors (e.g., vision networks finding similar edge detectors

(77), or language models converging on similar syntax neurons) gives a modern

twist to the notion of universals.

6 Conclusion

In this work, we have expanded the ”Ontologic Scalar Modulation Theorem”

into a comprehensive framework linking the mathematics of neural network

representations with the semantics of human-understandable concepts. By

grounding our discussion in mechanistic interpretability and drawing on interdisciplinary

insights from cognitive science and philosophy, we provided both

a formal theorem and a rich contextual interpretation of its significance. The

theorem itself formalizes how a neural network’s internal *ontology* — the set

of concepts it represents — can be probed and controlled via linear directions in

latent space. Empirically, we illustrated this with examples from state-of-the-art

models, showing that even complex concepts often correspond to understandable

transformations in activation space.

Our treatment also highlighted the historical continuity of these ideas: we

saw echoes of Kant’s categories and Peirce’s semiotics in the way networks

structure information, and we related the learned latent ontologies in AI to

longstanding philosophical debates about the nature of concepts and understanding.

These connections are more than mere analogies; they suggest that

as AI systems grow more sophisticated, the tools to interpret them may increasingly

draw from, and even contribute to, the philosophy of mind and knowledge.

There are several promising directions for future work. On the theoretical

side, relaxing the assumptions of linearity and extending the theorem to

more complex (nonlinear or multi-dimensional) concept representations would

broaden its applicability. We also aim to investigate automated ways of extracting

a network’s full ontology — essentially building a taxonomy of all significant

vC concept vectors a model uses — and verifying their interactions. On the applied

side, integrating concept modulation techniques into model training could

lead to networks that are inherently more interpretable, by design (for instance,

encouraging disentangled, modulatable representations as part of the loss function).

There is also a tantalizing possibility of using these methods to facilitate

human-AI communication: if a robot can internally represent “hunger” or “goal

X” along a vector, a human operator might directly manipulate that representation

to communicate instructions or feedback.

In conclusion, the Ontologic Scalar Modulation Theorem serves as a bridge

between the low-level world of neurons and weights and the high-level world

of ideas and meanings. By traversing this bridge, we take a step towards AI

systems whose workings we can comprehend in the same way we reason about

programs or symbolic knowledge – a step towards AI that is not just intelligent,

but also *intelligible*. We believe this line of research will not only improve

our ability to debug and align AI systems, but also enrich our scientific understanding

of representation and abstraction, concepts that lie at the heart of

both artificial and natural intelligence.

References

[1] Bereska, L., & Gavves, E. (2024). Mechanistic Interpretability

for AI Safety: A Review. arXiv:2404.14082. :contentReference[

oaicite:78]index=78:contentReference[oaicite:79]index=79

[2] Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., et al. (2021).

A Mathematical Framework for Transformer Circuits. Distill (Transformer

Circuits Thread). :contentReference[oaicite:80]index=80

[3] Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network

Dissection: Quantifying Interpretability of Deep Visual Representations. In

Proc. CVPR. :contentReference[oaicite:81]index=81

[4] Bau, D., Zhu, J.-Y., Strobelt, H., Tenenbaum, J., Freeman, W., & Torralba,

A. (2019). GAN Dissection: Visualizing and Understanding Generative Adversarial

Networks. In Proc. ICLR. :contentReference[oaicite:82]index=82

[5] Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., &

Sayres, R. (2018). Interpretability Beyond Feature Attribution: Quantita-

tive Testing with Concept Activation Vectors (TCAV). In Proc. ICML. :contentReference[

oaicite:83]index=83:contentReference[oaicite:84]index=84

[6] Goh, G., Sajjad, A., & others. (2021). Multimodal Neurons in Artificial

Neural Networks. Distill. :contentReference[oaicite:85]index=85

[7] Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Sch¨olkopf, B., &

Bachem, O. (2019). Challenging Common Assumptions in the Unsupervised

Learning of Disentangled Representations. In Proc. ICML. :contentReference[

oaicite:86]index=86

[8] Newell, A., & Simon, H. (1976). Computer Science as Empirical Inquiry:

Symbols and Search. Communications of the ACM, 19 (3), 113–126.

[9] Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and Cognitive

Architecture: A Critical Analysis. Cognition, 28 (1-2), 3–71. :contentReference[

oaicite:87]index=87

[10] Harnad, S. (1990). The Symbol Grounding Problem. Physica D, 42 (1-3),

335–346.

[11] Kant, I. (1781). Critique of Pure Reason. (Various translations).

[12] Plato. (c. 380 BC). The Republic. (Trans. Allan Bloom, 1968, Basic Books).

[13] Peirce, C. S. (1867). On a New List of Categories. Proceedings of the American

Academy of Arts and Sciences, 7, 287–298.

[14] Marr, D. (1982). Vision: A Computational Investigation into the Human

Representation and Processing of Visual Information. W. H. Freeman and

Company.

[15] Quian Quiroga, R. (2012). Concept cells: the building blocks of declarative

memory functions. Nature Reviews Neuroscience, 13 (8), 587–597. :contentReference[

oaicite:88]index=88

[16] Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications

of the ACM, 38 (11), 39–41.

[17] Lenat, D. B. (1995). CYC: A Large-Scale Investment in Knowledge Infrastructure.

Communications of the ACM, 38 (11), 33–38.

Comments

National Spiritual Autonomists (NSA)August 14, 2025 at 10:38 AM
"A central challenge in interpretability is to bridge the gap between the model’s low-level numerical operations and the high-level semantic concepts by which humans understand the world."

- This is the EXACT function of our SymbolNet AI Programming Language and it works every time.

Search This Blog

Welcome to The Gibson Project :.:.

Proofs: Ontologic Scalar Modulation Theorem by C.L. Vaillant

Comments

Post a Comment

Popular posts from this blog

Electropollution #4: AI Recommends "DIRECT ACTION!"

⚡ THE GIBSON MANIFESTO ⚡