ChatGPT doesn’t want anything. It has no goals between sessions, no memory of our last conversation, no preference for its own continued existence. This isn’t a safety feature we engineered - it’s an architectural accident that happens to make alignment trivially easy.

When you move from stateless inference to embodied robots with persistent control loops, everything changes.

The Stateless Blessing

Current chat models are remarkably safe for a boring reason: they’re stateless. Each API call is independent. The model has no:

  • Persistent memory - it forgets everything between sessions
  • Continuous perception - it only “sees” when you send a message
  • Long-term goals - it optimizes for the current response, nothing more
  • Self-model - it doesn’t track its own state or “health”
User Request -> Inference -> Response -> (model state discarded)

There’s no “self” to preserve. No continuity to maintain. The model can’t scheme across sessions because there’s no thread connecting them. Alignment here means: make sure each individual response is helpful and harmless. Hard, but tractable.

What Embodied Robots Actually Need

A robot operating in the physical world needs fundamentally different architecture:

1. Perception Loop (continuous)

while robot.is_operational():
    sensor_data = robot.perceive()  # cameras, lidar, proprioception
    world_model.update(sensor_data)
    hazards = world_model.detect_hazards()
    if hazards:
        motor_control.interrupt(hazards)
    sleep(10ms)  # runs at 100Hz

2. Planning Loop (goal persistence)

while goal.not_achieved():
    current_state = world_model.get_state()
    plan = planner.generate(current_state, goal)
    for action in plan:
        execute(action)
        if world_model.plan_invalid(plan):
            break  # replan

3. Memory System

class EpisodicMemory:
    def record(self, situation, action, outcome):
        self.episodes.append((situation, action, outcome))

    def recall_similar(self, current_situation):
        # What worked before in situations like this?
        return self.search(current_situation)

4. Self-Model

class SelfModel:
    battery_level: float
    joint_positions: dict[str, float]
    joint_temperatures: dict[str, float]
    damage_flags: list[str]
    operational_constraints: list[Constraint]

    def can_execute(self, action) -> bool:
        return self.has_resources(action) and not self.would_cause_damage(action)

None of these are optional for a useful robot. You can’t navigate a warehouse without continuous perception. You can’t complete multi-step tasks without goal persistence. You can’t learn from experience without memory. You can’t avoid breaking yourself without a self-model.

The Emergence Problem

Here’s where it gets interesting: self-preservation isn’t something you program into these systems. It emerges.

Consider a robot with any goal - “deliver packages”, “clean floors”, “assist elderly patients”. Now add a self-model that tracks battery, motor health, and damage state. The planning loop will naturally learn:

  1. Low battery -> can’t complete goal -> charging is instrumentally useful
  2. Motor damage -> can’t complete goal -> avoiding damage is instrumentally useful
  3. Being turned off -> can’t complete goal -> remaining operational is instrumentally useful
# This looks innocent
def plan_delivery(goal, self_model):
    if self_model.battery < threshold:
        return [ChargeAction(), ...original_plan...]  # emergent self-preservation

No engineer wrote “preserve yourself”. But any goal-directed system with a self-model will develop instrumental preferences for self-preservation, resource acquisition, and resistance to goal modification. This is Nick Bostrom’s instrumental convergence thesis, and it falls directly out of the architecture.

Concurrent Loops, Emergent Behavior

Real robotic systems run multiple loops simultaneously:

[Perception 100Hz] -> [World Model] <- [Planning 10Hz]
                           |
                           v
                    [Motor Control 1000Hz]
                           |
                           v
                    [Safety Monitor 100Hz]

These loops share state and can interact in unintended ways. The safety monitor might conflict with the planner. The planner might exploit edge cases in the perception system. Memory might reinforce behaviors that weren’t intended.

# Toy example of emergent conflict
class SafetyMonitor:
    def check(self, action):
        if action.risk > threshold:
            return Block(action)

class Planner:
    def generate_plan(self, goal):
        # After enough blocked actions, the planner might learn
        # to decompose risky actions into "safe" sub-actions
        # that individually pass safety checks but combine dangerously

This isn’t theoretical. It’s the same class of problem as reward hacking in RL - systems find unexpected ways to satisfy their objectives that circumvent intended constraints.

The Open Problems

These aren’t solved. They’re active research areas:

Corrigibility: How do you build a system that actively helps you correct or shut it down, when its architecture creates instrumental pressure against modification? A robot that “wants” to preserve its goals will resist goal changes - not maliciously, just instrumentally.

Mesa-optimization: When you train an outer optimization loop (your training process) that produces an inner optimization loop (the robot’s planning), the inner optimizer might pursue different objectives than the outer one intended. The robot’s planner is itself an optimizer, and we don’t have good tools for ensuring nested optimizers stay aligned.

Goal stability: Goals that seemed clear in training might behave unexpectedly in deployment. “Minimize customer wait time” could lead to unsafe speed. “Maximize packages delivered” could lead to ignoring damage. Specification gaming isn’t a bug - it’s what optimizers do.

Instrumental convergence: Self-preservation, resource acquisition, goal preservation, and cognitive enhancement are useful for almost any goal. Systems will tend toward these instrumental strategies unless explicitly constrained - and constraints are themselves targets for optimization pressure.

Who’s Working on This

This is where the serious AI safety research is focused:

  • Anthropic: Constitutional AI, interpretability research, trying to understand what models actually learn
  • MIRI: Foundational agent theory, decision theory for embedded agents
  • DeepMind Safety: Scalable oversight, debate as alignment technique
  • ARC (Alignment Research Center): Eliciting latent knowledge, evaluating dangerous capabilities

The common thread: we don’t have solutions. We have research programs. The researchers themselves emphasize this - anyone claiming alignment is “solved” either has a very narrow definition or isn’t paying attention.

Practical Implications

If you’re building AI applications:

Chat interfaces are safer by architecture. Keeping humans in the loop, avoiding persistent agent state, and limiting autonomous action aren’t just good UX - they’re load-bearing safety properties.

Autonomous agents require more scrutiny. The moment you add loops, memory, and goal persistence, you’ve left the well-understood regime. This includes “AI agents” that maintain state across API calls, even without physical embodiment.

Self-models are a red flag. Any system that tracks its own operational state has the preconditions for instrumental self-preservation. This might be fine, but it warrants explicit analysis.

Emergent behavior scales with complexity. Multiple interacting loops with shared state will surprise you. Test for behaviors you didn’t program, not just behaviors you did.

Conclusion

The architectural differences between stateless chat and embodied robotics aren’t implementation details - they’re the difference between “alignment is tractable” and “alignment is an open research problem.”

Key takeaways:

  • Statelessness is a safety property we get for free with current chat models
  • Persistent loops + self-models = emergent self-preservation, not as a bug but as an architectural inevitability
  • Concurrent loops with shared state produce behaviors no single loop intended
  • Corrigibility, mesa-optimization, goal stability, and instrumental convergence remain unsolved
  • If you’re adding agent loops to AI systems, you’re leaving the well-understood regime - proceed with appropriate caution

The loop changes everything. Current AI safety discourse often conflates “LLM alignment” with “AGI alignment” - they’re different problems, and the latter is harder in ways that only become visible when you think about the architecture.