OpenAI checked to see whether GPT-4 could take over the world

An AI-generated image of the earth enveloped in an explosion. — Enlarge / A sensational, hyperbolic AI-generated image of the earth enveloped in an explosion.
Stable Diffusion

As part of pre-release safety testing for its new GPT-4 AI model, launched Tuesday, OpenAI allowed an AI testing group to assess the potential risks of the model's emergent capabilities—including "power-seeking behavior," self-replication, and self-improvement.

While the testing group found that GPT-4 was "ineffective at the autonomous replication task," the nature of the experiments raises eye-opening questions about the safety of future AI systems.

Raising alarms

"Novel capabilities often emerge in more powerful models," writes OpenAI in a GPT-4 safety document published yesterday. "Some that are particularly concerning are the ability to create and act on long-term plans, to accrue power and resources (“power-seeking”), and to exhibit behavior that is increasingly 'agentic.'" In this case, OpenAI clarifies that "agentic" isn't necessarily meant to humanize the models or declare sentience but simply to denote the ability to accomplish independent goals.

Over the past decade, some AI researchers have raised alarms that sufficiently powerful AI models, if not properly controlled, could pose an existential threat to humanity (often called "x-risk," for existential risk). In particular, "AI takeover" is a hypothetical future in which artificial intelligence surpasses human intelligence and becomes the dominant force on the planet. In this scenario, AI systems gain the ability to control or manipulate human behavior, resources, and institutions, usually leading to catastrophic consequences.

As a result of this potential x-risk, philosophical movements like Effective Altruism ("EA") seek to find ways to prevent AI takeover from happening. That often involves a separate but often interrelated field called AI alignment research.

In AI, "alignment" refers to the process of ensuring that an AI system's behaviors align with those of its human creators or operators. Generally, the goal is to prevent AI from doing things that go against human interests. This is an active area of research but also a controversial one, with differing opinions on how best to approach the issue, as well as differences about the meaning and nature of "alignment" itself.

GPT-4's big tests

While the concern over AI "x-risk" is hardly new, the emergence of powerful large language models (LLMs) such as ChatGPT and Bing Chat—the latter of which appeared very misaligned but launched anyway—has given the AI alignment community a new sense of urgency. They want to mitigate potential AI harms, fearing that much more powerful AI, possibly with superhuman intelligence, may be just around the corner.

absolutely, perfectly safe —

OpenAI checked to see whether GPT-4 could take over the world

"ARC's evaluation has much lower probability of leading to an AI takeover than the deployment itself."

Raising alarms

Further Reading

GPT-4's big tests

Channel Ars Technica

Raising alarms

Further Reading

GPT-4's big tests

reader comments

Channel Ars Technica