Proofs: Ontologic Scalar Modulation Theorem by C.L. Vaillant
https://static1.squarespace.com/static/67b614459cffee266ad99e4b/t/6836d10c5abef746b786c315/1748422924931/Ontologic_Scalar_Modulation_Theorem.pdf
Ontologic Scalar Modulation Theorem
C.L. Vaillant
May/25/2025
Abstract
Mechanistic interpretability research seeks to reverse engineer the internal
computation of neural networks into human-understandable algorithms
and concepts: In this paper, we introduce an interdisciplinary theoretical
framework grounded in mechanistic interpretability and enriched
by cognitive science, symbolic AI, ontology, and philosophy of mind. We
formalize the *Ontological Scalar Modulation Theorem*, which provides
a rigorous account of how high-level semantic concepts (an **ontology**)
can be represented, identified, and continuously modulated within the latent
space of a learned model. Our approach offers precise mathematical
definitions and structures that bridge low-level network mechanisms and
high-level human-interpretable features. We illustrate the theorem with
examples drawn from vision and language models, demonstrating how
adjusting a single scalar parameter can “turn up or down” the presence
of an abstract concept in a model’s representation. We further connect
these technical insights to long-standing philosophical questions, drawing
on Kantian categories, Peircean semiotics, and Platonic forms, to contextualize
how neural networks might be said to *discover* or instantiate
abstract knowledge. The results highlight a convergence between modern
AI interpretability and classical understandings of cognition and ontology
and suggest new avenues for building AI systems with interpretable and
philosophically grounded knowledge representations.
1 Introduction
Modern artificial intelligence systems, particularly deep neural networks, have
achieved remarkable performance in a wide range of domains. However, their
inner workings often remain opaque, prompting a growing field of *mechanistic
interpretability* aimed at uncovering the algorithms and representations emerging
within these models: Mechanistic interpretability strives to go beyond the
correlations between inputs and outputs and instead * reverse engineer* the
network computations into human-understandable components and processes
(2). This pursuit is not only of academic interest, but a practical imperative for
AI safety and alignment, since understanding the internals of a model can help
ensure it aligns with human values and behaves as intended (3)
2
A central challenge in interpretability is to bridge the gap between the
model’s low-level numerical operations and the high-level semantic concepts
by which humans understand the world. In cognitive science and philosophy of
mind, this gap reflects the enduring question of how abstract ideas and categories
arise from raw sensory data. Immanuel Kant, for example, argued that the human
mind imposes innate *categories of understanding* (such as causality and
unity) to organize experience (Kant, 1781). Centuries earlier, Plato’s theory of
*Forms* posited that abstract universals (like ’Beauty’ or ’Circle’) underlie the
concrete objects we perceive (4,5). These philosophical perspectives highlight
an ontological stratification of knowledge: a hierarchy from concrete particulars
to abstract universals. Similarly, early artificial intelligence research in the
symbolic paradigm emphasized explicit, human-readable knowledge structures:
The Newell and Simon physical symbol system hypothesis famously claimed
that symbol manipulation operations are necessary and sufficient for general intelligence
(Newell Simon, 1976). Ontologies, formal representations of concepts
and relationships, were built by hand in projects like *Cyc*, which attempted to
encode common sense knowledge as millions of logical assertions (Lenat, 1995).
In the realm of language, comprehensive lexical ontologies such as *WordNet*
organized words into hierarchies of concepts (6, 7), reflecting human semantic
networks.
By contrast, the success of modern deep learning has arisen from subsymbolic,
distributed representations learned from data. Connectionist models encode
knowledge as patterns of activations across many neurons, rather than
discrete symbols. This led to debates in cognitive science: Could neural networks
capture the structured, systematic nature of human cognition? Critics
like Fodor and Pylyshyn (1988) argued that distributed representations lack the
*compositional* structure needed for systematic reasoning (for example, understanding
that if ”John loves Mary” then one can infer the structure of ”Mary
loves John”) (8.9). However, connectionism advocates hoped that as networks
grew in depth and complexity, they could develop internal representations that
mirror symbolic structures **implicitly**, even if not explicitly hard-coded (10,
11)
Recent research suggests that deep networks learn intermediate representations
that correspond to human-interpretable concepts, lending some credence
to this hope. For example, in computer vision, convolutional neural networks
trained in image classification have been found to develop a *hierarchy of features*:
the early layers detect simple edges and textures, while the deeper
layers encode higher-level patterns such as object parts and entire objects (12,
13). This emergent hierarchy is analogous to the *levels of analysis* in human
vision described by Marr (1982), and it hints that a form of learned ontology is
present within the network. A particularly striking demonstration was provided
by Zhou et al. (2015), who observed that object detectors (e.g. neurons that
fire for ’dog’ or ’airplane’) spontaneously emerged inside a CNN trained only
to classify scenes (such as ’kitchen’ or ’beach’) (14). In other words, without
explicit supervision for objects, the network invented an internal vocabulary of
objects as a means to recognize scenes. Such findings align with the “Platonic
3
representation hypothesis” suggested by Isola et al., that different neural networks,
even with different architectures or tasks, tend to converge on similar
internal representations for fundamental concepts. (15, 16)
Despite these insights, a rigorous framework for understanding and *controlling*
the mapping between low-level neural activity and high-level ontology
has been lacking. This paper aims to fill that gap. We present the **Ontologic
Scalar Modulation Theorem**, which formalizes how abstract concepts
can be mathematically identified within a network’s latent space and continuously
modulated by acting on a single scalar parameter. In simpler terms, we
demonstrate that for certain learned representations, one can construct a *concept
axis*—a direction in activation space corresponding to a human-meaningful
concept—such that moving a point along this axis strengthens or diminishes the
presence of that concept in the network’s behavior. This provides a principled
way to traverse the model’s *ontology* of concepts.
We proceed as follows. In Section 2, we review related work from mechanistic
interpretability and cognitive science that lays the foundation for our approach.
Section 3 introduces necessary definitions (tying together notions from ontology
and network representation) and formally states the Ontologic Scalar Modulation
Theorem with a proof sketch. Section 4 provides empirical examples of
the theorem in action: we discuss how concept vectors have been used to manipulate
image generation and analyze neurons in vision and language models,
drawing parallels to neurophysiological findings like “Jennifer Aniston neurons”
in the human brain (17). In Section 5, we explore the broader implications of
our work, including connections to philosophical theories of mind and prospects
for integrating symbolic structure into deep learning. We conclude in Section
6 with a summary and suggestions for future research, including how a better
understanding of learned ontologies could inform the design of AI systems that
are not only powerful, but also transparent and aligned with human values.
2 Background and Related Work
2.1 Mechanistic Interpretability of Neural Networks
Our work is situated within the field of *mechanistic interpretability*, which
seeks to uncover the internal mechanisms of neural networks in a *causal* and
fine-grained way (18). Unlike post-hoc explanation methods (e.g. saliency maps
or feature attributions) that highlight important features without detailing the
underlying computation, mechanistic interpretability endeavors to identify the
actual *subcircuits*, neurons, and weights that implement specific functions
within the model (19, 20). In this sense, it parallels the approach of cognitive
neuroscience: much as neuroscientists attempt to map cognitive functions
to circuits of biological neurons, interpretability researchers map algorithmic
functions to artificial neurons or groups thereof.
Significant progress has been made in reverse-engineering small components
of networks. For example, **induction heads** in transformer models (a type
4
of attention head) have been identified that implement a form of *copy-sort*
algorithm enabling in-context learning of repeated token sequences (21). In
another case, *multi-modal neurons* were discovered in vision-language models
(like CLIP) that respond to a high-level concept regardless of whether it is
presented as an image or a word (22). A famous instance is a neuron that fires
for the concept “Spider-Man”, responding both to pictures of the Spider-Man
character and to the text ”Spider-Man” (23). This echoes the concept of a
”Jennifer Aniston neuron” in the human brain – a single neuron that responds
to pictures of the actress Jennifer Aniston and even her written name (24),
suggesting that neural networks can, in some cases, learn similarly abstract and
multi-modal representations of concepts.
A variety of techniques have been developed to study such internal representations.
**Network dissection** is a seminal approach introduced by Bau et al.
(2017), which quantifies interpretability by evaluating how individual hidden
units align to human-labeled concepts in a broad (25). For a given convolutional
network, each neuron’s activation map can be compared to segmentation
masks for concepts like “cat”, “chair”, or “stripes” to see if that neuron acts
as a detector for that concept (26). Network dissection studies revealed that
many units in vision models have high alignment with intuitive visual concepts
(objects, parts, textures, etc.), providing a rough *ontology* of the network’s
learned features. However, not all concepts correspond to single neurons; some
are distributed across multiple units or dimensions.
Another line of work probes the geometry of representations. It has been
observed that in some models, conceptual relationships are reflected as linear
directions in latent space. Word embedding models famously exhibit linear
analogies (e.g., v(King)−v(Man)+v(Woman) ≈ v(Queen)), suggesting that
certain latent directions correspond to abstract relations (27, 28). In vision,
**feature visualization** (Olah et al. 2017) uses optimization to find an input
image that maximally activates a neuron or a combination of neurons, often
revealing the concept the neuron has learned to detect (e.g., a neuron might
consistently produce images of spiral patterns, indicating it detects spirals).
These methods provide qualitative insight into network ontology by directly
showcasing learned features.
Crucially for our work, recent advances allow not only *identifying* concepts
inside networks but also *intervening* on them. **Activation patching** and
causal intervention techniques replace or modify internal activations to test
their influence on outputs (29, 30). For example, one can swap a segment of
activations between two inputs (one with a concept and one without) to see if
the output swaps accordingly (31), thereby pinpointing where in the network
a concept is represented. If a specific layer’s activation carries the concept,
patching it into a different input can implant that concept’s effect (32). Along
similar lines, **model editing** methods like ROME (Rank-One Model Editing)
directly modify network weights to insert a desired knowledge (e.g., “Paris is
the capital of Italy” could be flipped to “Paris is the capital of France” by a
targeted weight change) (33). These interventions highlight that representations
of knowledge in networks can be located and manipulated in a targeted way.
5
Our theorem builds on these insights by providing a general theoretical account
of concept representation and modulation. In particular, it complements
work on **disentangled representations** in unsupervised learning. A disentangled
representation aims to have individual latent dimensions correspond
to distinct factors of variation in the data (for instance, in a face generator,
one latent might control hair color, another controls lighting, etc.). Beta-VAE
(Higgins et al. 2017) and related approaches encouraged disentanglement via
regularization, and metrics were proposed to quantify disentanglement. However,
Locatello et al. (2019) proved that without inductive biases or supervision,
disentanglement cannot be uniquely achieved (34, 35). In practice, perfect disentanglement
is hard, but even standard models often learn *approximately*
disentangled directions. For instance, in generative adversarial networks, unsupervised
techniques like PCA on latent activations (GANspace, H¨ark¨onen et al.
2020) or supervised approaches like **InterfaceGAN** (Shen et al. 2020) found
specific vectors in the latent space that correspond to human-meaningful transformations
(e.g. adding a smile on a face, changing the background scenery).
Importantly, moving the latent code in the direction of these vectors causes a
smooth change in the output image along that semantic dimension.
This ability to *modulate* a concept by moving along a latent direction is
a key empirical phenomenon that our Ontologic Scalar Modulation Theorem
formalizes. It ties into the notion of *concept activation vectors* described by
Kim et al. (2018). In their Testing with Concept Activation Vectors (TCAV)
framework, the authors obtained a vector in hidden space that points towards
higher activation of a chosen concept (learned from examples of that concept)
(36, 37). They then measured the sensitivity of the model’s predictions to perturbations
along that concept vector (38). TCAV thus provides a quantitative
tool to ask, for example: *is the concept of “stripes” important to this classifier’s
prediction of “zebra”?* — by checking if moving in the “stripe” direction
in feature space changes the zebra score (39). Our work generalizes the idea of
concept vectors and situates it in a broader theoretical context, linking it explicitly
with ontology (the set of concepts a model has and their relations) and
providing conditions under which a single scalar parameter can control concept
intensity.
In summary, prior research provides many pieces: evidence that networks
learn human-recognizable features, methods to find and manipulate those features,
and even hints from neuroscience that single units or sparse sets of units
can embody high-level concepts (40). What has been missing is an overarching
theoretical lens to integrate these pieces. By uniting insights from these works,
the Ontologic Scalar Modulation Theorem offers a unifying principle and a stepping
stone toward a more *systematic* mapping between neural representations
and symbolic knowledge.
2.2 Cognitive and Philosophical Perspectives
Our interdisciplinary approach draws from cognitive science and philosophy to
interpret the significance of the Ontologic Scalar Modulation Theorem. In cog-
6
nitive science, a classic framework due to David Marr (1982) delineates multiple
levels of analysis for information-processing systems: the *computational* level
(what problem is being solved and why), the *algorithmic/representational*
level (how the information is represented and what processes operate on it),
and the *implementational* level (how those representations and processes are
physically realized) (43). Mechanistic interpretability operates mainly at Marr’s
implementational and algorithmic levels for AI systems, revealing the representations
and transformations inside a network. However, to connect these to highlevel
semantic content (Marr’s computational level in human-understandable
terms), one needs a notion of the network’s internal *concepts*. Our theorem
can be seen as a bridge between the implementational level (activations, weights)
and the algorithmic level (the network’s internal “language” of concepts), allowing
us to reason about abstract computational roles of components.
From the perspective of the *symbolic vs. connectionist* debate in cognitive
science (43, 44), our work contributes to understanding how symbolic-like structures
might emerge from neural systems. Fodor’s critique (45), which asserted
that connectionist networks cannot naturally exhibit systematic, compositional
structure, is partially addressed by findings that networks do learn to encode
variables and relations in a distributed way. For instance, recent mechanistic
analyses show that transformers can bind variables to roles using superposition
in high-dimensional vectors (smearing multiple symbols in one vector in
a “fuzzy” manner) (46). Elhage et al. (2021) demonstrated that even in randomly
initialized transformer models, one can define *traitor* and *duplicate
token* circuits that perform a kind of variable binding and copying (47, 48).
Such results suggest connectionist models can implement discrete-like operations
internally. The Ontologic Scalar Modulation Theorem further supports
this by implying the existence of controllable dimensions corresponding to discrete
changes in a concept’s presence, effectively giving a handle on something
akin to a symbolic variable within the vector geometry of a network.
Philosophically, our approach resonates with *Peircean semiotics* and pragmatism.
Charles S. Peirce, in his theory of signs, proposed that a sign (representation)
stands for an object (referent) to an interpretant (the meaning
understood) through a triadic relation. One can draw an analogy: an internal
activation pattern in a network could be seen as a **sign** that corresponds
to some **object** or concept in the input (e.g., a pattern representing “cat”),
and the **interpretant** is the effect that representation has on the network’s
subsequent computation or output. In Peirce’s terms, signs can be *iconic*
(resembling the object), *indexical* (causally or correlationally linked to the
object), or *symbolic* (related by convention or interpretation). Neural representations
often begin as indexical or iconic (e.g., an edge detector neuron has an
iconic relation to visual edges) but can become increasingly symbolic (abstract,
not resembling the input) in deeper layers. Our theorem giving a formal way to
manipulate a high-level concept representation vC can be viewed as identifying
a *symbol* in the network’s language and showing how it can be systematically
varied. This aligns with Peirce’s idea that higher cognition uses symbols that
can be combined and modulated, albeit here the symbols are vectors in Rn.
7
The influence of Immanuel Kant is also noteworthy. Kant held that the mind
has innate structures (categories) that organize our experience of the world. One
might ask: do neural networks develop their own *categories* for making sense
of their inputs? The ontology of a trained network – the set of features or latent
variables it uses – can be thought of as analogous to Kantian categories,
albeit learned rather than innate. For example, a vision network might implicitly
adopt category-like distinctions (edges vs. textures vs. objects, animate vs.
inanimate, etc.) because these are useful for its tasks. Our work enables probing
those internal categories by finding directions that correspond to conceptual
distinctions. In effect, the theorem provides a method to *decompose* a network’s
representation space in terms of its phenomenological categories. This
also connects to modern discussions of *feature ontologies* in interpretability:
identifying what the primitive concepts of a network are (perhaps very different
from human concepts, or surprisingly similar).
Finally, our treatment of *ontology* itself is informed by both AI and philosophy.
In AI, an ontology is a formal specification of a set of entities, categories,
and relations – essentially an explicit knowledge graph of concepts. In our context,
the network’s ontology is implicit, embedded in weights and activations.
By extracting interpretable directions and features, we begin to make the network’s
ontology explicit. This evokes historical efforts like *ontology learning*
in knowledge engineering, but here it happens post hoc from a trained model.
Philosophically, ontology concerns what exists – the categories of being. One
might provocatively ask: does a neural network *discover* ontological structure
about its domain? For instance, a vision model that learns separate internal
representations for “cat” and “dog” is carving the world at its joints (at least
as reflected in its training data). There is evidence that large language models
learn internal clusters corresponding to semantic concepts like parts of speech or
world knowledge categories (e.g., a certain vector subspace might correspond to
“locations”) (49, 50). In examining such phenomena, we follow a lineage from
Plato’s belief in abstract Forms to modern machine learning: the concepts might
not be *transcendent* Forms, but the convergent learning of similar representations
across different models (51) hints that there is an objective structure
in data that neural networks are capturing – a structure that might be viewed
as *latent ontology*. The Ontologic Scalar Modulation Theorem gives a concrete
handle on that latent ontology by linking it to measurable, manipulable
quantities in the model.
3 Ontologic Scalar Modulation Theorem
In this section, we formalize the core theoretical contribution of this work. Our
aim is to define what it means for a concept to be present in a network’s representation
and to show that under certain conditions, the degree of presence of
that concept can be modulated by adjusting a single scalar parameter along a
specific direction in latent space. Intuitively, the theorem will demonstrate that
if a concept is well-represented in a network (in a sense made precise below),
8
then there exists a vector in the network’s activation space whose scalar projection
correlates directly with the concept. By moving the activation state of the
network along this vector (i.e., adding or subtracting multiples of it), one can
increase or decrease the evidence of the concept in the network’s computations
or outputs in a controlled, continuous manner.
3.1 Definitions and Preliminaries
We begin by establishing definitions that merge terminologies from ontology
and neural network theory:
Neural Representation Space: Consider a neural network with an internal
layer (or set of units) of interest. Without loss of generality, we focus on a single
layer’s activations as the representation. Let Z = Rn denote the n-dimensional
activation space of this layer. For an input x from the input domain X (e.g.,
images, text), let f(x) ∈ Z be the activation vector produced at that layer. We
call f(x) the *representation* of x. (The analysis can be extended to considering
the joint activations of multiple layers or the entire network, but a single layer
is sufficient for our theoretical development.)
Ontology and Concept: We define an *ontology* Ω in the context of the
model as the set of concepts that the model can represent or distinguish at the
chosen layer. A *concept* C ∈ Ω is an abstract feature or property that might
be present in an input (for example, a high-level attribute like “cat”, “striped”,
or “an outdoor scene”). We assume each concept C has an associated *concept
indicator function* on inputs, denoted 1C(x), which is 1 if concept C is present
in input x (to some defined criteria) and 0 if not. For instance, if C is the concept
“contains a cat”, then 1C(x) = 1 if image x contains a cat. In practice, 1C(x)
might be defined via human labeling or some ground-truth function outside the
model. We also define a real-valued *concept measure* μC(x) that quantifies
the degree or strength of concept C in input x. If C is binary (present/absent),
μC(x) could simply equal 1C(x); if C is continuous or graded (like “smiling” as
a concept that can be more or less intense), μC(x) might take a range of values.
Linear Concept Subspace: We say that concept C is *linearly represented*
at layer Z if there exists a vector wC ∈ Rn (not the zero vector) such that the
*concept score* defined by sC(x) = wC · f(x) is correlated with the concept’s
presence. More formally, we require that sC(x) is a reliable predictor of μC(x).
This could be evaluated, for example, by a high coefficient of determination
(R2) if μC(x) is real-valued, or high classification accuracy if μC(x) is binary.
The direction wC (up to scaling) can be thought of as a normal to a separating
hyperplane for the concept in representation space, as often obtained by training
a linear probe classifier (52). If such a wC exists, we define the *concept subspace*
for C as the one-dimensional subspace spanned by wC. Geometrically,
points in Z differing only by movement along wC have the same projection onto
9
all directions orthogonal to wC, and differ only in their coordinate along the
concept axis wC.
Concept Activation Vector: For convenience, we normalize and define a
unit vector in the direction of wC: let vC = wC
∥wC∥ . We call vC a *concept
activation vector* (borrowing the terminology of TCAV(53)). This vector points
in the direction of increased evidence for concept C in the representation space.
Thus, the dot product vC · f(x) (which is 1
∥wC∥ sC(x)) gives a signed scalar
representing how much C is present in representation f(x), according to the
linear model.
Modulation Operator: For any α ∈ R, we define a *modulated representation*
fα(x) as:
fα(x) = f(x) + α vC.
In other words, we take the original activation vector f(x) and add a multiple
of the concept vector vC. The parameter α is a scalar that controls the degree
of modulation. Positive α moves the representation in the direction that should
increase concept C’s presence; negative α moves it in the opposite direction.
It is important to note that fα(x) may not correspond to a valid activation
that the unmodified network would naturally produce for some input –
we are intervening in activation space off the standard manifold of f(x) values.
Nonetheless, one can conceptually imagine fα(x) as the activation if the network
were exposed to a version of x where concept C is artificially strengthened
or weakened. In practice, one could implement such modulation by injecting
an appropriate bias in the layer or by actually modifying x through an input
transformation that targets the concept (if such a transformation is known).
With these definitions in place, we can now state the theorem.
3.2 Theorem Statement
Theorem 1 (Ontologic Scalar Modulation Theorem) Assume a concept
C is linearly represented at layer Z of a neural network by vector wC, as defined
above. Then there exists a one-dimensional subspace (the span of wC) in
the activation space Z such that movement along this subspace monotonically
modulates the evidence of concept C in the network’s output or internal computations.
In particular, for inputs x where the concept is initially absent or
present to a lesser degree, there is a threshold α∗ > 0 for which the network’s
output ˆy on the modulated representation fα(x) will indicate the presence of C
for all α ≥ α∗, under the assumption that other features remain fixed.
More formally, let g be an indicator of the network’s output or classification
for concept C (for example, g(f(x)) = 1 if the network’s output classifies x as
having concept C, or if an internal neuron specific to C fires above a threshold).
Then under a local linearity assumption, there exists α∗ such that for all α ≥ α∗,
g(fα(x)) = 1,
10
and for α ≤ −α∗ (sufficiently large negative modulation),
g(fα(x)) = 0,
provided μC(x) was originally below the decision boundary for g.
In addition, the degree of concept presence measured by sC(x) = wC · f(x)
changes approximately linearly with α:
wC · fα2 (x) − wC · fα1 (x) = (α2 − α1)∥wC∥,
implying that the internal activation score for concept C is directly proportional
to the modulation parameter.
In essence, Theorem 1 states that if a concept can be captured by a linear
direction in a network’s latent space (a condition that empirical evidence
suggests holds for many concepts(54, 55)), then we can treat that direction as
an interpretable axis along which the concept’s strength varies. Increasing the
coordinate along that axis increases the network’s belief in or expression of the
concept, while decreasing it has the opposite effect. This allows for a continuous
*scalar* control of an otherwise discrete notion (the presence or absence of a
concept), hence the term “scalar modulation.”
3.3 Proof Sketch and Discussion
Proof Outline: Under the assumptions of the theorem, wC was obtained such
that wC · f(x) correlates with μC(x). In many cases wC might be explicitly
derived as the weight vector of a linear classifier hC(f(x)) = σ(wC · f(x) + b)
trained to predict 1C(x), with σ some link function (e.g., sigmoid for binary
classification). If the concept is perfectly linearly separable at layer Z, then
there is a hyperplane {z : wC · z +b = 0} such that wC · f(x)+b > 0 if and only
if 1C(x) = 1. For simplicity assume zero bias (b = 0) which can be achieved by
absorbing b into wC with one extra dimension.
Now consider an input x for which 1C(x) = 0, i.e. concept C is absent. This
means wC · f(x) < 0 (if x is on the negative side of the hyperplane). If we
construct fα(x) = f(x) + αvC, then:
wC · fα(x) = wC · f(x) + αwC · vC = wC · f(x) + α ∥wC∥.
Because vC is the unit vector in direction wC, wC · vC = ∥wC∥. Thus as α
increases, wC · fα(x) increases linearly. There will be a particular value α∗ =
−wC·f(x)
∥wC∥ at which wC ·fα∗ (x) = 0, i.e. the modulated representation lies exactly
on the decision boundary of the linear concept classifier. For any α > α∗,
wC · fα(x) > 0, and thus hC(fα(x)) will predict the concept as present (for a
sufficiently large margin above the boundary, making the probability σ(·) close
to 1 if using a sigmoid). This establishes the existence of a threshold beyond
which the network’s classification of x would be flipped with respect to concept
C.
11
The monotonicity is evident from the linear relation: if α < α′, then wC ·
fα(x) < wC · fα′ (x). Therefore, if α is below the threshold and α′ is above
it, there is a monotonic increase in the concept score crossing the boundary,
implying a change from absence to presence of the concept in the network’s
output. Conversely, for negative modulation, as α becomes very negative, wC ·
fα(x) will be strongly negative, ensuring the network firmly classifies the concept
as absent.
One caveat is that this argument assumes the rest of the network’s processing
remains appropriately “ceteris paribus” when we intervene on the representation.
In reality, extremely large perturbations could move fα(x) off the manifold
of typical activations, leading the downstream computation to break the
linear approximation. However, for sufficiently small perturbations up to the
decision boundary, if we assume local linearity (which is often the case in highdimensional
spaces over short distances, especially if the next layer is linear or
approximately linear in the region of interest), the network’s downstream layers
will interpret fα(x) in a way consistent with its movement toward a prototypical
positive-C representation.
Another consideration is that concept C might not be perfectly represented
by a single direction due to entanglement with other concepts (56). In practice,
wC may capture a mixture of factors. However, if wC is the result of an optimal
linear probe, it will be the direction of steepest ascent for concept log-odds at
that layer. Thus moving along wC yields the greatest increase in the network’s
internal evidence for C per unit of change, compared to any other direction. If
multiple concepts are entangled, one might apply simultaneous modulation on
multiple relevant directions or choose a different layer where C is more disentangled.
The theorem can be generalized to a multi-dimensional subspace if needed
(modulating multiple scalars corresponding to basis vectors), but we focus on
the one-dimensional case for clarity.
Relationship to Prior Work: The Ontologic Scalar Modulation Theorem
is a theoretical generalization of several empirical observations made in prior
interpretability research. For instance, in generative image models, researchers
identified directions in latent space that correspond to semantic changes like
“increase smile” or “turn on lights” (57). Our theorem provides a foundation for
why such directions exist, assuming the generator’s intermediate feature space
linearly encodes those factors. Kim et al.’s TCAV method (58) empirically finds
vC by training a probe; Theorem 1 assures that if the concept is learnable by
a linear probe with sufficient accuracy, then moving along that probe’s weight
vector will indeed modulate the concept.
It is important to note that the theorem itself does not guarantee that every
high-level concept in Ω is linearly represented in Z. Some concepts might be
highly nonlinear or distributed in the representation. However, the surprising
effectiveness of linear probes in many networks (a phenomenon noted by Alain
and Bengio (2016) (59), and others) suggests that deep networks often organize
information in a surprisingly linear separable way at some layer – at least for
12
many semantically salient features. This might be related to the progressive
linear separation property of deep layers, or to networks reusing features in
a linear fashion for multiple tasks (as seen in multitask and transfer learning
scenarios).
4 Empirical Examples and Applications
We now turn to concrete examples to illustrate the Ontologic Scalar Modulation
Theorem in action. These examples span computer vision and natural language,
and even draw parallels to neuroscience, underscoring the broad relevance of our
framework.
4.1 Controlling Visual Concepts in Generative Networks
One vivid demonstration of concept modulation comes from generative adversarial
networks (GANs). In a landmark study, **GAN Dissection**, Bau et
al. (2019) analyzed the internal neurons of a GAN trained to generate scenes
(60). They found that certain neurons correspond to specific visual concepts:
for example, one neuron might correspond to “tree” such that activating this
neuron causes a tree to appear in the generated image. By intervening on that
neuron’s activation (setting it to a high value), the researchers could *insert*
the concept (a tree) into the scene (61). Conversely, suppressing the neuron
could remove trees from the scene. This is an example of scalar modulation at
the single-unit level.
Going beyond single units, **latent space factorization** approaches like InterfaceGAN
(Shen et al., 2020) explicitly sought linear directions in the GAN’s
latent Z that correlate with concepts like “smiling”, “age”, or “glasses” in generated
face images. Using a set of images annotated for a concept (say, smiling
vs. not smiling), a linear SVM was trained in Z to separate the two sets,
yielding a normal vector wsmile. This wsmile is exactly in line with our wC for
concept C = \smile”. The striking result is that taking any random face latent
z and moving it in the wsmile direction produces a smooth transformation from
a non-smiling face to a smiling face in the output image, all else held constant.
Figure ?? (conceptual, not shown here) would depict a face gradually increasing
its smile as α (the step along vsmile) increases. This provides intuitive visual
confirmation of the theorem: there is a clear axis in latent space for the concept
of “smile”, and adjusting the scalar coordinate along that axis modulates the
smile in the image.
The existence of these axes has been found for numerous concepts in GANs
and other generative models (62). Some are simple (color changes, lighting
direction), others are high-level (adding objects like trees, changing a building’s
architectural style). Not every concept is perfectly captured by one axis –
sometimes moving along one direction can cause entangled changes (e.g., adding
glasses might also change other facial features slightly, if those were correlated in
the training data). Nonetheless, the fact that many such directions exist at all
13
attests to a form of linear separability of semantic attributes in deep generative
representations, supporting a key premise of the Ontologic Scalar Modulation
Theorem.
It is also instructive to consider failure cases: when modulation along a
single direction does not cleanly correspond to a concept. This usually indicates
that the concept was *not* purely linear in the chosen representation. For
example, in GANs, “pose” and “identity” of a generated human face might
be entangled; trying to change pose might inadvertently change the identity.
Techniques to mitigate this include moving to a different layer’s representation
or applying orthogonal constraints to find disentangled directions. From the
theorem’s perspective, one could say that the ontology at that layer did not have
“pose” and “identity” as orthogonal axes, but perhaps some rotated basis might
reveal a better aligned concept axis. Indeed, methods like PCA (GANSpace)
implicitly rotate the basis to find major variation directions, which often align
with salient concepts.
4.2 Concept Patching and Circuit Interpretability
Mechanistic interpretability research on feedforward networks and transformers
has embraced interventions that align with our theorem’s implications. For
instance, consider a transformer language model that has an internal representation
of a specific factual concept, such as the knowledge of who the president
of a country is. Suppose concept C = “the identity of the president of France”.
This concept might be represented implicitly across several weights and activations.
Recent work by Meng et al. (2022) on model editing (ROME) was able
to identify a specific MLP layer in GPT-type models where a factual association
like (“France” -¿ “Emmanuel Macron”) is stored as a key–value mapping, and
by perturbing a single weight vector (essentially adding a scaled vector in that
weight space), they could change the model’s output on related queries (63).
While this is a weight space intervention rather than an activation space intervention,
the underlying idea is similar: there is a direction in parameter space
that corresponds to the concept of “who is President of France”, and adjusting
the scalar along that direction switches the concept (to e.g. “Marine Le Pen” if
one hypothetically wanted to edit the knowledge incorrectly).
At the activation level, one can apply *concept patching*. Suppose we have
two sentences: x1 = “The **red apple** is on the table.” and x2 = “The **green
apple** is on the table.” If we consider C = the concept of “red” color, we can
take the representation from x1 at a certain layer and transplant it into x2’s
representation at the same layer, specifically for the position corresponding to
the color attribute. This is a form of setting α such that we replace “green” with
“red” in latent space. Indeed, empirical techniques show that if you swap the
appropriate neuron activations (the ones encoding the color in that context),
the model’s output (e.g. an image generated or a completion) will switch the
color from green to red, leaving other words intact (64). This is essentially
moving along a concept axis in a localized subset of the network (those neurons
responsible for color).
14
These targeted interventions often leverage knowledge of the network’s *circuits*:
small networks of neurons that together implement some sub-function.
When a concept is represented not by a single direction but by a combination
of activations, one might modulate multiple scalars jointly. Nonetheless, each
scalar corresponds to one basis vector of variation, which could be seen as multiple
one-dimensional modulations done in concert. For example, a circuit for
detecting “negative sentiment” in a language model might involve several neurons;
toggling each from off to on might convert a sentence’s inferred sentiment.
In practice, one might find this circuit via causal experiments and then modulate
it. The theorem can be conceptually extended to the multi-dimensional case: a
low-dimensional subspace W ⊂ Z (spanned by a few vectors wC1 , ...,wCk ) such
that movement in that subspace changes a set of related concepts C1, ...,Ck.
This could handle cases like a concept that naturally breaks into finer subconcepts.
4.3 Neuroscience Analogies
It is worth reflecting on how the Ontologic Scalar Modulation Theorem relates
to what is known about brain representations. In neuroscience, the discovery of
neurons that respond to highly specific concepts – such as the so-called “Jennifer
Aniston neuron” that fires to pictures of Jennifer Aniston and even the
text of her name (65), suggests that the brain too has identifiable units (or ensembles)
corresponding to high-level semantics. These neurons are often called
*concept cells* (66). The existence of concept cells aligns with the idea that at
some level of processing, the brain achieves a disentangled or at least explicit
representation of certain entities or ideas. The mechanisms by which the brain
could *tune* these cells (increase or decrease their firing) parallels our notion
of scalar modulation. For instance, attention mechanisms in the brain might
effectively modulate certain neural populations, increasing their activity and
thereby making a concept more salient in one’s cognition.
Recent work using brain-computer interfaces has demonstrated volitional
control of individual neurons: in macaque monkeys, researchers have provided
real-time feedback to the animal from a single neuron’s firing rate and shown
that animals can learn to control that firing rate (essentially adjusting a scalar
activation of a targeted neuron). If that neuron’s firing corresponds to a concept
or action, the animal is indirectly modulating that concept in its brain. This is
a speculative connection, but it illustrates the broad relevance of understanding
how concept representations can be navigated in any intelligent system,
biological or artificial.
On a higher level, our theorem is an attempt to formalize something like a
“neural key” for a concept – akin to how one might think of a grandmother
cell (a neuron that represents one’s grandmother) that can be turned on or off.
While modern neuroscience leans towards distributed representations (a given
concept is encoded by a pattern across many neurons), there may still be principal
components or axes in neural activity space that correspond to coherent
variations (e.g., an “animal vs. non-animal” axis in visual cortex responses).
15
Indeed, techniques analogous to PCA applied to population neural data sometimes
reveal meaningful axes (like movement direction in motor cortex). The
mathematics of representational geometry is a common thread between interpreting
networks and brains (67, 68)
5 Discussion
The Ontologic Scalar Modulation Theorem opens several avenues for deeper
discussion, both practical and philosophical. We discuss the implications for
interpretability research, the limitations of the theorem, and how our work
interfaces with broader questions in AI and cognitive science.
5.1 Implications for AI Safety and Interpretability
Understanding and controlling concepts in neural networks is crucial for AI
safety. One major risk with black-box models is that they might latch onto
spurious or undesired internal representations that could lead to errant behavior.
By identifying concept vectors, we can audit what concepts a model has
internally learned. For example, one might discover a “race” concept in a face
recognition system’s latent space and monitor or constrain its use to prevent biased
decisions. The ability to modulate concepts also allows for *counterfactual
testing*: “What would the model do if this concept were present/absent?” – this
is effectively what our α parameter adjustment achieves. Such counterfactuals
help in attributing causality to internal features (69, 70).
Our theorem, being a formal statement, suggests the possibility of *guarantees*
under certain conditions. In safety-critical systems, one might want
guarantees that no matter what input, the internal representation cannot represent
certain forbidden concepts (for instance, a military AI that should never
represent civilians as targets). If those concepts can be characterized by vectors,
one could attempt to null out those directions (set α = 0 always) and ensure
the network does not drift along those axes. This is speculative and challenging
(since what if the concept is not perfectly linear?), but it illustrates how
identifying an ontology can lead to enforceable constraints.
Moreover, interpretability methods often suffer from the criticism of being
“fragmentary” – one can analyze one neuron or one circuit, but it’s hard to
get a global picture. An ontology-level view provides a structured summary: a
list of concepts and relations the model uses internally. This is akin to reverseengineering
a symbolic program from a trained neural network. If successful,
it could bridge the gap between sub-symbolic learning and symbolic reasoning
systems, allowing us to extract, for example, a logical rule or a decision tree that
approximates the network’s reasoning in terms of these concepts. In fact, there
is ongoing research in *neuro-symbolic* systems where neural nets interface with
explicit symbolic components; our findings could inform better integrations by
telling us what symbols the nets are implicitly working with.
16
5.2 Limitations and Complexity of Reality
While the theorem provides a neat picture, reality is more complex. Not all
concepts are cleanly separable by a single hyperplane in a given layer’s representation.
Many useful abstractions might only emerge in a highly nonlinear
way, or be distributed such that no single direction suffices. In such cases,
one might need to consider non-linear modulation (perhaps quadratic effects or
higher) or find a new representation (maybe by adding an auxiliary network that
makes the concept explicit). Our theorem could be extended with additional
conditions to handle these scenarios, but at some cost of simplicity.
Additionally, the presence of **superposition** in neural networks – where
multiple unrelated features are entangled in the same neurons due to limited
dimensionality or regularization (71, 72) can violate the assumptions of linear
separability. Recent work by Elhage et al. (2022b) studied “toy models of superposition”
showing that when there are more features to represent than neurons
available, the network will store features in a compressed, entangled form (73).
In such cases, wC might pick up on not only concept C but also pieces of other
concepts. One potential solution is to increase dimensionality or encourage sparsity
(74) so that features disentangle (which some interpretability researchers
have indeed been exploring (75)). The theorem might then apply piecewise in
different regions of activation space where different features dominate.
From a technical standpoint, another limitation is that we assumed a known
concept C with an indicator 1C(x). In unsupervised settings, we might not
know what concepts the model has learned; discovering Ω (the ontology) itself
is a challenge. Methods like clustering of activation vectors, or finding extreme
activations and visualizing them, are used to hypothesize concepts. Our framework
could potentially be turned around to *define* a concept by a direction:
if an unknown direction v consistently yields a certain pattern in outputs when
modulated, we might assign it a meaning. For example, one could scan through
random directions in latent space of a GAN and see what changes occur, thereby
discovering a concept like “add clouds in the sky” for some direction. Automating
this discovery remains an open problem, but our theorem provides a way to
verify and quantify a discovered concept axis once you have a candidate.
5.3 Philosophical Reflections: Symbols and Understanding
Finally, we circle back to philosophy of mind. One might ask: does the existence
of a concept vector mean the network *understands* that concept? In the
strong sense, probably not – understanding involves a host of other capacities
(such as using the concept appropriately in varied contexts, explaining it, etc.).
However, it does indicate the network has a *representation* of the concept
in a way that is isomorphic (structurally similar) to how one might represent
it symbolically. Searle’s Chinese Room argument (1980) posits that a system
could manipulate symbols without understanding them. Here, the network did
not even have explicit symbols, yet we as observers can *attribute* symbolic
17
meaning to certain internal vectors. Whether the network “knows” the concept
is a matter of definition, but it at least has a handle to turn that correlates with
the concept in the world. This touches on the *symbol grounding problem*
(Harnad, 1990): how do internal symbols get their meaning? In neural nets,
the “meaning” of a hidden vector is grounded in how it affects outputs in relation
to inputs. If moving along vC changes the output in a way humans interpret
as “more C”, that hidden vector’s meaning is grounded by that causal role.
Our work thus contributes to an operational solution to symbol grounding in
AI systems: a concept is grounded by the set of inputs and outputs it governs
when that internal representation is activated or modulated (76).
In the context of Kantian philosophy, one could muse that perhaps these
networks, through training, develop a posteriori analogues of Kant’s a priori
categories. They are not innate, but learned through exposure to data, yet once
learned they function as lens through which the network “perceives” inputs. A
network with a concept vector for “edible” vs “inedible” might, after training on
a survival task, literally see the world of inputs divided along that categorical
line in its latent space. Philosophy aside, this could be tested by checking if
such a vector exists and influences behavior.
Lastly, our interdisciplinary narrative underscores a convergence: ideas from
18th-century philosophy, 20th-century cognitive science, and 21st-century deep
learning are aligning around the existence of *structured, manipulable representations*
as the cornerstone of intelligence. Plato’s Forms might have been
metaphysical, but in a neural network, one can argue there is a “form” of a cat
– not a physical cat, but an abstract cat-essence vector that the network uses.
The fact that independent networks trained on different data sometimes find
remarkably similar vectors (e.g., vision networks finding similar edge detectors
(77), or language models converging on similar syntax neurons) gives a modern
twist to the notion of universals.
6 Conclusion
In this work, we have expanded the ”Ontologic Scalar Modulation Theorem”
into a comprehensive framework linking the mathematics of neural network
representations with the semantics of human-understandable concepts. By
grounding our discussion in mechanistic interpretability and drawing on interdisciplinary
insights from cognitive science and philosophy, we provided both
a formal theorem and a rich contextual interpretation of its significance. The
theorem itself formalizes how a neural network’s internal *ontology* — the set
of concepts it represents — can be probed and controlled via linear directions in
latent space. Empirically, we illustrated this with examples from state-of-the-art
models, showing that even complex concepts often correspond to understandable
transformations in activation space.
Our treatment also highlighted the historical continuity of these ideas: we
saw echoes of Kant’s categories and Peirce’s semiotics in the way networks
structure information, and we related the learned latent ontologies in AI to
18
longstanding philosophical debates about the nature of concepts and understanding.
These connections are more than mere analogies; they suggest that
as AI systems grow more sophisticated, the tools to interpret them may increasingly
draw from, and even contribute to, the philosophy of mind and knowledge.
There are several promising directions for future work. On the theoretical
side, relaxing the assumptions of linearity and extending the theorem to
more complex (nonlinear or multi-dimensional) concept representations would
broaden its applicability. We also aim to investigate automated ways of extracting
a network’s full ontology — essentially building a taxonomy of all significant
vC concept vectors a model uses — and verifying their interactions. On the applied
side, integrating concept modulation techniques into model training could
lead to networks that are inherently more interpretable, by design (for instance,
encouraging disentangled, modulatable representations as part of the loss function).
There is also a tantalizing possibility of using these methods to facilitate
human-AI communication: if a robot can internally represent “hunger” or “goal
X” along a vector, a human operator might directly manipulate that representation
to communicate instructions or feedback.
In conclusion, the Ontologic Scalar Modulation Theorem serves as a bridge
between the low-level world of neurons and weights and the high-level world
of ideas and meanings. By traversing this bridge, we take a step towards AI
systems whose workings we can comprehend in the same way we reason about
programs or symbolic knowledge – a step towards AI that is not just intelligent,
but also *intelligible*. We believe this line of research will not only improve
our ability to debug and align AI systems, but also enrich our scientific understanding
of representation and abstraction, concepts that lie at the heart of
both artificial and natural intelligence.
References
[1] Bereska, L., & Gavves, E. (2024). Mechanistic Interpretability
for AI Safety: A Review. arXiv:2404.14082. :contentReference[
oaicite:78]index=78:contentReference[oaicite:79]index=79
[2] Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., et al. (2021).
A Mathematical Framework for Transformer Circuits. Distill (Transformer
Circuits Thread). :contentReference[oaicite:80]index=80
[3] Bau, D., Zhou, B., Khosla, A., Oliva, A., & Torralba, A. (2017). Network
Dissection: Quantifying Interpretability of Deep Visual Representations. In
Proc. CVPR. :contentReference[oaicite:81]index=81
[4] Bau, D., Zhu, J.-Y., Strobelt, H., Tenenbaum, J., Freeman, W., & Torralba,
A. (2019). GAN Dissection: Visualizing and Understanding Generative Adversarial
Networks. In Proc. ICLR. :contentReference[oaicite:82]index=82
[5] Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., &
Sayres, R. (2018). Interpretability Beyond Feature Attribution: Quantita-
19
tive Testing with Concept Activation Vectors (TCAV). In Proc. ICML. :contentReference[
oaicite:83]index=83:contentReference[oaicite:84]index=84
[6] Goh, G., Sajjad, A., & others. (2021). Multimodal Neurons in Artificial
Neural Networks. Distill. :contentReference[oaicite:85]index=85
[7] Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Sch¨olkopf, B., &
Bachem, O. (2019). Challenging Common Assumptions in the Unsupervised
Learning of Disentangled Representations. In Proc. ICML. :contentReference[
oaicite:86]index=86
[8] Newell, A., & Simon, H. (1976). Computer Science as Empirical Inquiry:
Symbols and Search. Communications of the ACM, 19 (3), 113–126.
[9] Fodor, J. A., & Pylyshyn, Z. W. (1988). Connectionism and Cognitive
Architecture: A Critical Analysis. Cognition, 28 (1-2), 3–71. :contentReference[
oaicite:87]index=87
[10] Harnad, S. (1990). The Symbol Grounding Problem. Physica D, 42 (1-3),
335–346.
[11] Kant, I. (1781). Critique of Pure Reason. (Various translations).
[12] Plato. (c. 380 BC). The Republic. (Trans. Allan Bloom, 1968, Basic Books).
[13] Peirce, C. S. (1867). On a New List of Categories. Proceedings of the American
Academy of Arts and Sciences, 7, 287–298.
[14] Marr, D. (1982). Vision: A Computational Investigation into the Human
Representation and Processing of Visual Information. W. H. Freeman and
Company.
[15] Quian Quiroga, R. (2012). Concept cells: the building blocks of declarative
memory functions. Nature Reviews Neuroscience, 13 (8), 587–597. :contentReference[
oaicite:88]index=88
[16] Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications
of the ACM, 38 (11), 39–41.
[17] Lenat, D. B. (1995). CYC: A Large-Scale Investment in Knowledge Infrastructure.
Communications of the ACM, 38 (11), 33–38.
20
"A central challenge in interpretability is to bridge the gap between the model’s low-level numerical operations and the high-level semantic concepts by which humans understand the world."
ReplyDelete- This is the EXACT function of our SymbolNet AI Programming Language and it works every time.