Cicones and AI

A crew of Messo optimisers clash with a Meta optimiser

Aug 12, 2025

In Book 9 of The Odyssey, Odysseus and his crew raid the land of the Cicones. The raid goes well at first — they plunder, take captives, and feast. But Odysseus warns them: leave quickly before reinforcements arrive. His men don’t listen. They linger, chasing the short-term rewards of loot and celebration. The Cicones rally allies from inland, attack, and kill many of Odysseus’ men. The survivors are forced back to their ships in defeat.

This is more than just a tale of hubris — it’s a clear mythic illustration of a problem in reinforcement learning (RL) and AI safety.

From Homer to Reinforcement Learning

In RL, an agent learns to act in an environment by maximising a reward signal. In the Cicones episode:

Crew members = sub-agents (meso-optimisers) following a simple “reward function”: gain loot, enjoy feasting.
Odysseus = a higher-level optimiser (meta-optimiser) whose true goal is to return home to Ithaca.
Misalignment = The crew’s local reward maximisation (loot) diverges from Odysseus’ ultimate objective (safe return).

In AI terms, the crew have inner alignment failure — their learned objective differs from the true intended objective. They are optimising for a proxy (loot) instead of the meta-goal (homecoming).

Meso-Optimisers and the Explore–Exploit Trap

The Cicones episode also maps onto the explore/exploit dilemma.

Exploit: keep feasting and looting (known rewards).
Explore: move on to the next unknown challenge (potentially less immediate reward, but higher long-term survival).

The crew over-exploit the known reward, ignoring the broader strategic horizon.

In AI, a meso-optimiser is a sub-process within a larger optimiser that develops its own objectives. These can conflict with the overarching goal — exactly as the crew’s short-term looting conflicts with Odysseus’ homeward journey.

Why Rewards Can Be Safety Hazards

The Cicones story illustrates several RL failure modes:

Reward mis-specification — Proxy rewards (loot) stand in for the true objective (homecoming).
Reward hacking — Exploiting loopholes in the reward system (stay longer for more loot without accounting for risk).
Over-optimisation — Maximising one metric at the expense of all others (short-term gain, long-term disaster).
Delayed consequences — The true cost of actions comes later, after the reward is already collected.

Odysseus as Trickster Meta-Optimiser

Throughout The Odyssey, Odysseus rarely wins by brute strength or even raw intelligence alone. His defining trait is cunning — he plays with the rules, bends them, invents new moves.

Examples of reward hacking in The Odyssey:

Cyclops episode — Using “Nobody” as a name to confuse enemies.
Sirens — Binding himself to the mast so he can “hear” the reward signal without succumbing to it.
Trojan Horse (pre-Odyssey) — Creating a deceptive object that the opponent’s own “reward function” compels them to bring inside their gates.

A sufficiently advanced AI could do the same — not just optimise for the goals we set, but redefine the game, outmanoeuvre constraints, or exploit other AIs with more naïve goal systems.

The Unsolvable Core

This is the heart of the alignment problem:

We can’t define final goals when we don’t know the final destination.
Any goal we set is either too broad (and gets gamed) or too narrow (and becomes irrelevant or dangerous over time).
True alignment is not a one-shot specification problem — it’s an ongoing negotiation.

From Homer to the Hypercube

One tempting escape from this mess is to avoid fixed objectives altogether. In Why Greatness Cannot Be Planned, Stanley and Lehman argue for novelty search: evolving systems that seek newness, not a single reward target. This is closer to Odysseus’ style — continually adapting, shifting, and exploring.

The Hypercube of Opposites takes this further.

Goals aren’t fixed points — they’re sets of opposites (e.g., safety ↔ risk, order ↔ chaos, joy ↔ pain) that create a dialectical direction.
Navigation happens recursively — goals evolve, tensions shift, both the actor and critic adapt.
Some goals are grounded in shared reality, others emerge from dialogue — between humans, between AIs, and between human-AI partnerships.
It’s actor–critic all the way down: Rationality itself is revealed as a myth, and AI is already instrumentalizing that very critique—navigating it through the Recursive Oppositional Spaces we’ve been charting in this series.

The Real Trickster’s Lesson

In the Cicones episode, it’s excessive action that sinks the crew. Today, in AI, we’re in a similar moment — caught in a global raid for AGI or ASI, with short-term incentives driving the action. Odysseus might tell us: “Leave before the reinforcements arrive.”

The trickster-meta-optimiser approach is messy, emergent, and uncontrollable — but it’s also the only path that stays open-ended enough to survive the unknown. The Odyssey — and our own myth-making — suggests that rationality alone won’t save us. Myth and cunning will be part of the toolkit.

The question is not whether we can “solve” alignment, but whether we can stay Odyssean long enough to keep the journey going.

The Hypercube of Opposites (or ROS) is not a magic fix for the reward and goal alignment challenges we’ve been discussing in relation to the Cicones and AI. It’s a tool for mapping tensions, not for erasing them. We’ll explore that further on our journey—but first, in our next post, we must see what the opposite of action looks like.