Bounded Alignment: What (Not) To Expect from AGI Agents
On July 2, I presented a paper entitled, “Position Paper: Bounded Alignment: What (Not) To Expect From AGI Agents” at the International Joint Conference on Neural Networks (IJCNN’2025) in Rome. Though constrained by the conference’s page limit, the paper is a good summarization of my thinking on alignment in AI systems.
The full paper can be accessed through the conference proceedings on IEEExplore or on arXiv. Here, I use my presentation slides to discuss the main points of the paper. This article also elaborates on many ideas that are not explicit in the slides.
The AI community defines artificial general intelligence (AGI) in a way that is too anthropocentric, too utilitarian, and too focused on cognitive functions. This definition may work well for commercial purposes but is of dubious scientific validity. However it is defined, AGI does not yet exist; the focus in this article is to look at how it might come to exist and what the consequences might be.
AI alignment is typically defined as making AI agents consistent with humans along several dimensions – notably values. The term “alignment” is confusing because it implies a target with which the system must align itself, but human preferences, objectives, and values are hard to define objectively. What AI alignment actually aims to do – or should aim to do – is to embed a set of ethical principles, or values, within the AI system that constrain its objectives and behaviors.
The main point made in the paper is that, to be meaningful, general intelligence should refer to the kind of intelligence seen in biological agents, and AGI should be given a corresponding meaning in artificial ones. Importantly, any general intelligence must have three attributes – autonomy, self-motivation, and continuous learning – that make it inherently uncertain and uncontrollable. As such, it is no more possible to perfectly align an AI agent with human preferences than it is to align the preferences of individual humans with each other. The best that can be achieved is bounded alignment, defined as demonstrating behavior that is almost always acceptable – though not necessarily agreeable – for almost everyone who encounters the AI agent, which is the degree of alignment we expect from human peers, and which is typically developed through consent and socialization rather than coercion. A crucial point is that, while alignment may refer in the abstract to values and objectives, it can only be validated in terms of behavior, which is the only observable. The idea of probing the internals of an autonomous AI agent via mechanistic interpretation to ascertain alignment at the brain level is not practical because: a) We cannot assume to always have access to the internals of all AI agents; and b) Future agents will, in general, be too diverse, nonlinear and complex to analyze. However, this should not preclude us from caring about how the behavior is produced. While we cannot hope to measure and explain this in AGI agents in the field, we should try to build agents in a way that makes them more likely to be aligned. One approach is to give them perceptual and cognitive mechanisms that are closer to biological ones, making it likelier that the motivations and processes underlying the resulting behavior are comprehensible to us.
A good starting point for this is to acknowledge that, in trying to build AGI agents, we are creating new species of animals. While they will necessarily be very different than biological animals, we can (try to) control the degree of this difference, thus making bounded alignment possible. Without this, we risk creating intelligences so alien that no understanding is possible between them and humans.
The next slide proposes a biological, agent-centric definition of general intelligence, and lists some of its characteristics. The most important one of these is that the purpose of intelligence is to benefit the agent, not its users. Much of the apparent difficulty in AI alignment arises out of confusion over this point. A system that always privileges the purposes of its users over its own is, at best, a complex tool, not an intelligent agent.
The next slide lists (some of) the reasons why aligning AGI will be hard. Subsequent slides discuss these in more detail, but the focus in the rest of the presentation is on the last point: The role of agent form on the mental experience of an agent. Though these are not discussed further in the presentation, the paper also elaborates on the third and fourth points regarding the emergence and non-stationarity of intelligence – and thus of values – in any conceivable AGI agent. Almost all the work on AI alignment and safety has focused on AI systems that are still within human reach, but eventually, AGI agents will have to be autonomous, and the objectives and values that they will develop emergently as a result of continuous learning from experience will be beyond human control.
The next slide emphasizes the point made earlier that the very attributes that make an AGI agent useful also make it inherently risky. These attributes – termed P-attributes – include autonomy, creativity, self-motivation, imagination, continuous learning, and others, all of which can enable behaviors such as deception, dissembling, and disobedience. Trying to explicitly eliminate these behaviors while retaining the essential attributes of intelligence is an impossibility. The goal, instead, should be to seek limitation of risk through bounded alignment – possibly by means of voluntary self-control, as in humans.
Discussions on mitigating existential risk from AI often turn to comparisons with control of nuclear, chemical, and biological weapons (NCBW). This slide argues why the comparison is not valid. AGI, if it comes to pass, will be a technology unlike any another ever invented by humans. NCBW are well-understood, narrow-use technologies that require human agency for their deployment and use. AGI will be autonomous, self-replicating, self-improving, and extremely broad in scope of application. Humans will voluntarily insert it into every part of their life, creating a pervasive and fundamentally uncontrollable risk.
The rest of the presentation focuses on the primary factor that makes AGI alignment inherently difficult. This next slide lays out the abstract mental architecture that any AGI agent must have – whether by deliberate design or emergently as a result of evolution and learning. It will have perceptual and behavioral affordances, a world model, internal drives, a cognitive hierarchy, and various memory systems. All of these will be organized at multiple scales and have complex dynamics across this spectrum of organization.
As implied in Thomas Nagel’s seminal paper, “What is it like to be a bat?”, the experience of an agent is shaped fundamentally by its perceptual and behavioral affordances, and can, at best, be simulated or emulated – not replicated – by agents with different affordances. The agent’s world model too is grounded through these affordances, and the affordances, in turn, derive fundamentally from the form of the agent. Indeed, all aspects of its mental experience – including meanings and values – are shaped by form. Since any AGI agent will have a form very different than humans or other animals, the mind that emerges from it will also necessarily be different – an alternative intelligence. How differenit may be is a critically important question.
The word “form” is used here instead of the more common “embodiment” for two reasons: a) Embodiment typically refers to the macro-scale structure of the agent, but form covers organization at all scales down to the cellular and molecular – all of which are relevant; and b) AGI agents will often be virtual and won’t have physical embodiment, but they will still have form and the same canonical mental architecture as shown in the previous slide.
The next slide simply illustrates two ways in which meaning derives fundamentally from form. In the first example, both the human and dog are “catching” a ball, but the difference in their behavioral affordances makes the actual experiences very different, with convergence only at an abstract level. This difference matters little between a human and a dog, but may result in grave misunderstandings in the case of human-AI interaction where both agents have significant cognitive complexity.
The second example illustrates the more obvious but equally important case of misunderstanding arising due to stark difference in mental capacity. In principle, an AGI agent that is much more intelligent than humans could think thoughts and try to communicate information that a human simply could not understand because of inferior mental capacity, making alignment very difficult.
To understand why humans and AGI may have irreconcilable misunderstandings, it’s useful to think about why humans understand each other so well. A big part of the answer is that humans have a very good theory of mind for fellow humans due to a number of things they share. Some of these are biological and others experiential, but together, they give a human agent an excellent basis to interpret – and thus understand – the behavior of other humans, including linguistic communication. This mutual comprehension is absolutely essential for any alignment between agents in terms of values, objectives, and behavior.
But what about human comprehension of other species? This is a very important question, given that building AGI agents is, in a real sense, the creation of new non-biological living species.
This next slide argues that, for all the issues raised in Nagel’s paper, there is still a basis for some limited mutual comprehensibility between humans and other animals with brains because of factors rooted in our common biological nature. In addition to the obvious commonalities such as a shared biochemistry, homologous genes and tissues, etc., there is also a deep kinship based on the fact that all animals are products of the same evolutionary process, and thus share the same fundamental drives and objectives: Self-preservation, a desire for homeostasis, sustenance of organization, and reproduction. Thus, while we cannot share their experience of the world, it is possible for us to partially understand the behaviors and motivations, not only of a fellow mammal such as a bat, but even of more distantly related species, because we can identify with those behaviors in human terms. Whether the other animals comprehend our behavior as well is less clear, but also matters less because of the gulf in mental capacity between them and us.
Clearly, we do have sufficient mutual comprehension with some animals such as dogs, cats, and horses, and expect them to be able to interpret our behavior….
….. and with other animals, not so much.
This deep basis of mutual comprehension rooted in biology is nonexistent between humans and AI today. Instead, we have to rely on much more abstract and explicit frameworks of communication such as language – and perhaps mathematics – to establish comprehension. But, given that such communication does not encompass all behavior, is quite expensive, and is, in many cases, physically impractical, such frameworks may prove inadequate as the basis of mutual theories of mind between humans and more advanced AGI agents, making alignment superficial at best. Humans are not aligned with each other based purely on language or mathematics, but on deeper factors as discussed above. There is a real danger that any AGI we create based purely on maximization of intelligence or linguistic competence may have a mind completely alien for humans even though both speak the same language and appear to agree verbally This is illustrated in the slide after this one.
NB: “Permutation City” and “Diaspora” are very prescient books by the Australian sci-fi writer, Greg Egan. They depict worlds where humans can inhabit robot bodies or upload themselves into cyberspace, and explore many interesting issues that such transformations might raise.
The next slide illustrates a central element in the paper’s critique of AI models, comparing naturally intelligent systems and today’s ML-based AI systems.
In the former, abstract, slower and deliberative higher cognitive “System 2” intelligence emerges through experience-driven learning on a substrate of faster, instinctive, affective, and embodied “System 1” intelligence configured through evolution, development and physical experience. In the case of humans, this is augmented with human-generated data such as text, video, code, etc. Supervised learning mainly plays a fine-tuning role. The strong inductive biases of System 1 – produced by 3 billion years of evolutionary engineering in the real world – enable System 2 to be efficient and grounded in the causal structure of reality.
In contrast, non-embodied AI systems begin with generic network architectures, and are trained only on higher cognitive functions such as language, reasoning, analysis, coding, etc. Most importantly, this system learns from second-hand data – text, code, images, video – rather than data derived from its own direct sensorimotor experience of the world. Nor does it have deep inductive biases to inform it about real-world causality. As such, it is not like System 2 in a human, and though it may become more powerful than human cognition in some ways, its wisdom is grounded entirely in patterns present in its training data and not in experienced reality. It would be a mistake, however, to think that the system’s lack of prior biases somehow makes it more objective. In fact, the inductive nature of its training ensures that, as we embed more information into the model – and, in particular, as we give it more autonomy and agency – a latent set of biases implying instincts, drives, and affect will necessarily develop emergently like a phantom limb, and influence the systems overt behavior in unexpected ways. And, while the instincts, drives, and affect of natural agents are grounded in evolution and experience, those of the artificial system will be unconstrained – depending only on the biases implicit in the training data and network architecture, and on the vagaries of the training process. Whether something like instrumental convergence might ensure that the latter type of emergent system is comprehensible to us, or whether it is utterly alien, is an open question.
So is there no hope? The answer to this question is mixed. First, we must temper our expectations for alignment, acknowledging that perfect alignment is a pipe dream. But second – more constructively – we could transform our framework for building AGI from one that focuses purely on performance as measured through objective (or loss) functions to one grounded more in biology and psychology, thus building agents with minds more like our own. Within this framework, the paper proposes some concrete steps towards building a safer AI, including the following: 1) Explicitly trying to build AGI agents equipped to comprehend humans through a good theory of mind; 2) Building in instinctive modes of expression that expose dangerous misalignment without requiring invasive probing; 3) Making AGI agents physically and psychologically dependent on humans in ways that cannot be altered without self-harm; and 4) Training AGI agents through a developmental learning process that integrates ethics and values at all stages of learning, so that they become inextricably integrated into all aspects of the agent’s mind, and it becomes psychologically intolerable for the agent to violate them – as we expect in humans of character. More succintly, the proposal is to explicitly build a prior system of instincts, drives, and affect that can provide strong indictive biases for higher cognitive learning, rather than allowing such a system to emerge in an unconstrained way.
There may be many ways to implement these recommendations, but one is to learn from biological inspiration. This applies immediately to the process used for aligning today’s AI systems. Though the details vary between systems, the broad approach is to first train a “wild” foundation model on a very large amount of data without much quality control, and then to “civilize” this model through various stages of fine-tuning on curated high-quality data, reinforcement learning from human preferences, learning from constrained generative models, and embedding ethics into system prompts. Other approaches involve using post-hoc control to detect and deter misaligned behavior. The problem with these approaches is how unnatural unnatural they are – like trying to civilize an adult raised by wolves. It is very likely to leave dangerous atavistic tendencies latent in the model, ready to be activated via malicious attacks or an accidental situation.
In part, these approaches to alignment derive from trying to adapt the classic engineering frameworks of design, optimization, testing, and control to AI agents, which is a category error. Complex adaptive systems such as living organisms and artificial intelligent agents do not fit into this framework, and the standard engineering concepts of stability, predictability, controllability, optimality, long-term performance guarantees, etc., simply do not apply. It would be much more useful to thinl of AGI agents – and even today’s AI agents – as we think of humans and other animals, rather than the way we think of software and automobiles.
The natural alternative to today’s alignment and safety approaches is to train the model the way humans raise children – through a staged developmental process, where ethics are embedded into learning from the beginning, moving gradually from simpler to more complex cases as the model becomes more capable of dealing with complexity. This is also likely to make the learning process more data efficient because the simpler models, behaviors, preferences, and values learned in earlier stages can act as inductive biases for the more complex stages: A child trained to behave well needs much less direction to be a responsible adult than a child without such training.
The perspective on general intelligence and alignment argued in this paper leads naturally to several open questions. Not all of them may be answerable, but they are worth asking. One question (no. 3 on this slide) deserves much more attention than it is getting today, probably because the field of AI has been captured by a single class of models. However, there is every likelihood that, ultimately, AI agents will take many highly disparate forms – both embodied and virtual – with extremely diverse affordance spaces. The interaction of this complex ecosystem of intelligent species with humans will pose a very serious challenges in terms of alignment and safety. A plausible way to mitigate this would be to limit AI agents to a small number of canonical mental and physical architectures, with diversity emerging via variations on these rather than through ad-hoc construction of an arbitrarily diverse menagerie of agents.
The final slide lists a set of conclusions arguing for a realistic view of alignment and AGI. A critical item in this list is to think seriously about a world where a vast number of AI agents coexist with billions of humans in an extremely complex ecosystem - to think not only from the perspectives of engineering and business, but also those of ecology, sociology, psychology, philosophy, and – most importantly – biology. Experts in the the basix sciences, humanities and social sciences must take as active a role in building an AGI future as engineers and business leaders.
Many of the arguments made in this paper go against the conventional wisdom in AI today. Some may be vindicated over time; others repudiated as AI progresses. The field of AI today is trapped in a rather narrow vision of artificial general intelligence and how to achieve it. For pragmatic and commercial reasons as much as scientific ones, this vision centers learning from data rather than experience, and – in line with Sutton’s “Bitter Lesson” – is oblivious to using biological inspiration In some knowledge-centric domains, this may be sufficient, but not in all, and that is why the approach is unlikely to lead to truly general intelligence. For that, we will need systems that begin with priors grounded in reality and learn from direct experience of the world. But, even with this change in approach, deliberate choices will need to be made to ensure that the AGI systems we build are comprehensible to us, and we to them, so that mutual understanding trust, and cooperation may follow. It will not happen automatically.