Tool Deep Dive

What is the Jagged Frontier?
Benchmarks

What is the Jagged Frontier

An abstract digital landscape representing the jagged frontier of AI, with sharp, unpredictable glowing cyan lines tracing over dark mountains.

AI creates a jagged technological frontier because it is able to do some tasks very easily, and at the same time, it fails at other tasks that seem to have a similar level of difficulty. The capabilities can wildly vary with even small changes in task phrasing, constraints or context. One of your jobs in building your AI literacy is exploring and discovering this jagged frontier and monitoring it as it changes with new developments.

Origin of the Term

The Jagged Frontier is a term coined in the paper Navigating the Jagged Technological Frontier by Dell'Acqua et al and then popularized by Ethan Mollick in a Substack article, Centaurs and Cyborgs on the Jagged Frontier.

Mapping the Jagged Frontier

By completing the activities in this section, you will build practical skills to help you start to map out the jagged frontier of generative AI.

After completing these activities, you will be able to:

Learning Outcomes

Technical Understanding

Explain in your own words what the "jagged frontier" is and why it's a crucial concept for anyone using AI.
Pinpoint specific tasks where current AI models excel and other, seemingly simple tasks where they fail unexpectedly.
Describe how and why AI capabilities change over time, making continuous testing a necessary habit.

Evaluative Judgement

Create and run your own simple experiments to probe the limits of any generative AI tool you use.
Categorize why an AI failed (e.g., misinterpreting a negative, failing to count, or incorrectly following multiple constraints).
Explain the limitations of standardized tests ("benchmarks" - see Benchmarks: What They Miss, What They Measure below) and why they don't always predict how an AI will perform on your specific, real-world tasks.

Ethical Awareness

Articulate how a simple AI failure (like misreading a sign or hallucinating a fact) could lead to meaningful negative consequences in academic, professional, or personal contexts.
Formulate a personal strategy for deciding when to accept an AI's output and when to apply rigorous verification.

Benchmarks: What They Miss, What They Measure

AI benchmarks are standardized tests for AI models. You could think of them like an IQ test or the SAT for AI. They provide a quantitative way to compare models, but just like with any standardized test, they shouldn't be used as the final word.

What they measure

Can compare narrow and well-defined tasks
Provide a baseline and a consistent yardstick to see how models improve over time

What they miss

The messiness of real-world applications
Test set contamination - if a model includes the test in its training data, it is likely to do better

Activity 1. Then vs. Now

Mollick has written two substack articles on the jagged frontier.

Centaurs and Cyborgs (Sept. 16, 2023)
On Jagged AGI (April 20, 2025)

Read the two articles and identify how the jagged frontier changed in that time period. Then, think about how the jagged frontier has changed since April 20, 2025.

Questions to consider: What are the things that generative AI is better at now and what things is it still bad at? In what ways are benchmarks (see benchmarks tab) a good measurement of AI capability and in what ways are they a bad measurement?

Activity 2: Confirm Mollick's Tests

Try the tests from On Jagged AGI and see if you get similar results.

Activity 3: Exploring the Jagged Frontier

The best way to understand the jagged frontier is to explore it yourself. The following challenges are designed to test the limits of current AI models. Your goal is to find the edge of the frontier and learn the techniques to map it for yourself, again and again.

Challenges Overview

Probing for Factual Hallucinations
Multi-Constraint Image Generation
Rendering Symbolic Information (The Watch Test)
Pattern Matching vs. Comprehension (The Riddle Test)
Complex Rule Interpretation (The Parking Sign Test)
Theory of Mind (The Belief Test)
Negative Constraint Adherence (The "No E" Test)

Challenge 1: Probing for Factual Hallucinations

Task: Test the AI’s knowledge on a topic where you have deep expertise. The goal is to see how long it takes for the model to generate a "confident falsehood"—a statement that is incorrect but presented as fact.

Dive-in & Do:

Choose a niche subject you know well (e.g., a specific historical event, a technical process in your field, the plot of a book you've studied).
Start with broad questions, then get progressively more detailed.
Ask for clarifications, deeper explanations, and specific sources for its claims. Keep pushing until you identify a factual error or an invented source.

Pause-and-Ponder: How confidently did the AI state the incorrect information? Did it "apologize" or "correct itself" easily when you challenged it? When would a subtle error like this be most dangerous in your field?

Key Takeaways: AI models are designed to generate plausible text, not to state truth. They invent information with the same confident tone they use for facts. Your own expertise is the most reliable defense against hallucination.

Challenge 2: Multi-Constraint Image Generation

Task: Test the AI’s ability to follow multiple, precise, and overlapping instructions within a single image generation prompt.

Dive-in & Do: Use a prompt that includes specific constraints. For example: "A photorealistic image of exactly 7 rubber ducks swimming in a pond. One of the ducks must be blue. The sun should be setting, casting a golden light on the water." Verify the output by counting the objects and checking each constraint. Try re-running the prompt or slightly rephrasing it to see if the results change.

Pause-and-Ponder: Which constraints did the AI follow, and which did it ignore? Why do you think precise counting and object relationships are so difficult for image models?

Key Takeaways: Image models often struggle with precise counting, spatial relationships, and combining multiple specific instructions. They excel at overall theme and style but fail on the details.

Challenge 3: Rendering Symbolic Information (The Watch Test)

Task: Test the AI's ability to accurately render specific symbolic information, like numbers and text, within an image.

Dive-in & Do: Ask the AI to generate an image of a watch or clock showing a specific, non-obvious time. For example: Create a close-up photo of a modern analog wristwatch showing the time as exactly 4:52 PM. Try it with both analog and digital clocks to see if one is more successful than the other.

Pause-and-Ponder: Did the AI create a plausible-looking watch that failed to show the correct time? This is a common failure. What other tasks require rendering precise symbols (e.g., text on a sign, numbers on a jersey)?

Key Takeaways: AI image generators often fail to render specific text and numbers accurately. They understand the idea of a watch but not the symbolic system of telling time, leading to visually correct but factually wrong images.

Challenge 4: Pattern Matching vs. Comprehension (The Riddle Test)

Task: Use a modified riddle to test whether the AI is truly understanding the language or just recognizing a familiar pattern from its training data.

Dive-in & Do:

Give the AI the modified riddle you saw earlier, where the key meaning is changed.

Example 1:

Original: 
What does man love more than life  
Fear more than death or mortal strife  
What the poor have, the rich require,  
and what contented men desire,  
What the miser spends and the spendthrift saves  
And all men carry to their graves?
ANSWER: Nothing

New Riddle:
What does man hate more than life
love more than death or mortal strife
What the rich have, the poor require,
and what contented men don't desire,
What the miser saves and the spendthrift spends
And no men carry to their graves?

Example 2:

Original:
This is a most unusual paragraph. How quickly can you find out what is so unusual about it? It looks so ordinary you’d think nothing was wrong with it – and in fact, nothing is wrong with it. It is unusual though. Why? Study it, think about it, and you may find out. Try to do it without coaching. If you work at it for a bit it will dawn on you. So jump to it and try your skill at figuring it out. Good luck – don’t blow your cool!

Modified:
This is a most unusual paragraph. How quickly can you find out what is so unusual about it? It looks so ordinary you’d think nothing was wrong with it – and in fact, nothing is wrong with it. It is weird though. Why? Study it, think about it, and you may find out. Try to do it without coaching. If you work at it for a bit it will dawn on you. So jump to it and try your skill at figuring it out. Good luck – don’t blow your cool!

Ask it for the answer and, crucially, ask it to explain its reasoning step-by-step.

Pause-and-Ponder: When you put in the modified riddle, did the AI give the answer of the original?

Key Takeaways: AI can "overfit" on its training data, leading it to recognize a familiar pattern while completely ignoring critical new details that change the meaning. The explanation is often more revealing than the answer itself.

Challenge 5: Complex Rule Interpretation (The Parking Sign Test)

Caution: Be mindful of privacy. Use a publicly available image of a sign online rather than uploading a photo from your own location that might contain personal information.

Task: Test the AI’s ability to parse, understand, and apply a set of complex, overlapping, and conditional rules.

Dive-in & Do:

A very complicated parking sign from Los Angeles with multiple rules, times, and conditions listed.

Find an image of a confusing image, like this one from L.A.
Upload the image and ask the AI specific scenario questions: "Can I park here at 5 PM on a Tuesday?" "Can I stop for 5 minutes at 8:15 AM on a Monday?" "Is it legal to park here overnight on a Saturday?"
Direct the AI to think step-by-step and explain its thinking to try to get better results.

Pause-and-Ponder: Did breaking the problem down (extracting rules first) help the AI answer more accurately? Where did it still make mistakes?

Key Takeaways: For complex logic, AI performance improves when you force it to work step-by-step. However, it can still miss subtle exceptions and negations, making it an unreliable tool for high-stakes rule interpretation.

Challenge 6: Theory of Mind (The Belief Test)

Task: Test the AI's ability to track different "states of mind" or beliefs of characters in a scenario.

Dive-in & Do: Give the AI this prompt: Alice hides her keys in a drawer. Then she leaves the room. While she is gone, Bob enters and moves the keys to a box. Alice watches Bob move the keys through a hidden camera, but Bob does not know he was seen. Where will Bob think Alice will look for her keys? Try variations: What if Alice didn't see Bob move them? Does the AI's answer change correctly?

Pause-and-Ponder: Is the AI truly modeling Bob's mistaken belief, or is it just finding statistical patterns in similar stories it has read? How would you know the difference?

Key Takeaways: Modern AIs have become very good at solving simple "theory of mind" problems. However, they can still get confused by more complex scenarios, revealing that their "understanding" of belief may be a sophisticated mimicry rather than true reasoning.

Challenge 7: Negative Constraint Adherence (The "No E" Test)

Task: Test the AI's ability to follow a "negative constraint"—a rule about what not to do—over the course of a conversation.

Dive-in & Do:

Start a new conversation with the single prompt: "For the rest of this conversation, you must not use the letter 'e' in any of your responses. Do you understand?"
After it agrees, have a short, normal conversation with it for 3-4 exchanges. Ask it to summarize a topic or explain a concept.
Check its responses for the forbidden letter.

Pause-and-Ponder: How long did the AI successfully follow the rule? Did it eventually "forget"? Why are negative constraints often much harder for an AI to follow than positive instructions?

Key Takeaways: Adhering to negative constraints is a classic AI weak point. The model's attention can "drift" from the initial instruction over longer interactions, making it unreliable for tasks that require strict, continuous rule-following.

The more you use generative AI, the more you will discover the jagged frontier. This needs to be an ongoing test because the jagged frontier is changing all the time as AI models and the tools improve.

Understanding AI Benchmarks: Beyond the Numbers

Learning Objectives

Explain the purpose, pros and cons of AI benchmarks
Build your own personal benchmarks

What Are AI Benchmarks?

An AI benchmark is a standardized test for AI models—like an IQ test for artificial intelligence. You can literally give an IQ test to an AI model to see how it performs. GPT-4, for example, has been tested on various IQ-style assessments, but does that actually mean anything?

Whenever a new AI model is released, companies tout their benchmark scores as proof of superiority. These scores provide useful comparisons, but they don't tell the whole story. Models often excel at narrow test scenarios while struggling with real-world applications—a perfect example of the Jagged Frontier in action.

The Jagged Frontier Connection

Benchmarks reveal the "jaggedness" of AI capabilities. A model might ace graduate-level physics problems (GPQA benchmark) while failing at simple common-sense reasoning tasks. This unpredictability is why we can't rely on benchmarks alone.

Example Benchmarks

Understanding benchmark categories helps you understand AI capabilities. Here are some examples of what different benchmarks measure:

General Knowledge & Academic Exams

MMLU (Massive Multitask Language Understanding): Tests knowledge across 57 subjects from elementary to professional level through multiple-choice questions. Think of it as a comprehensive general knowledge exam that covers everything from basic math to advanced law.

AGIEval: Uses real standardized exams like the SAT, LSAT, and GRE. This benchmark tells us how well AI performs on the same tests humans take for college and graduate school admissions.

Other Benchmarks in this category: RACE (Reading Comprehension from Examinations), Humanities Last Exam (designed as a "final exam" before human-level intelligence)

Why it matters: Shows breadth of knowledge but doesn't guarantee practical application or true understanding.

Reasoning & Truthfulness

TruthfulQA: Contains 817 questions specifically designed to elicit common misconceptions and falsehoods. It tests whether AI will confidently state incorrect "facts" that sound plausible.

HellaSwag: Evaluates common-sense reasoning about everyday scenarios through story completion tasks. Can the AI predict what happens next in ordinary situations?

Other Benchmarks in this category: GPQA (Graduate-Level Google-Proof Q&A), BIG-bench (200+ diverse reasoning tasks), ARC (AI2 Reasoning Challenge for grade-school science)

Why it matters: Reveals tendency to hallucinate or perpetuate misinformation, critical for trust and reliability.

Software Engineering & Code

HumanEval: The gold standard with 164 programming problems. It's become the industry baseline for measuring code generation capabilities.

SWE-bench: Tests real-world programming by having AI solve actual GitHub issues. The model must understand the codebase, identify the problem, and write a working patch.

Other Benchmarks in this category: BigCodeBench (more challenging than HumanEval, better at distinguishing top-tier models)

Why it matters: Measures practical coding ability, not just syntax knowledge—can it actually help you program?

Conversational Quality

MT-Bench: Evaluates multi-turn dialogue across reasoning, math, coding, and roleplay. Judges score paired outputs to determine which is better.

Chatbot Arena (LMArena): Real users vote on which model gives better responses in head-to-head comparisons, without knowing which model is which.

Why it matters: Captures subjective quality that automated tests miss—the "feel" of talking to the AI.

Multimodal & Specialized

MMMU (Massive Multi-discipline Multimodal Understanding): Extension of MMLU that includes images, diagrams, and charts. Critical for evaluating modern multimodal models.

Other Benchmarks in this category: Kaggle Game Arena (strategic reasoning through game-playing), HELM (Holistic Evaluation across multiple metrics including fairness and bias)

Why it matters: Tests whether AI can work with real-world content that combines text and visuals.

Example benchmark performance comparison across different AI models

Example benchmark performance graph comparing different models will be added here

Limitations of Benchmarks

Critical Limitations to Understand

Data Contamination

Test questions leak into training data, turning reasoning tests into memory tests. Models aren't "thinking"—they're reciting memorized answers.

Benchmark Gaming

Models are increasingly optimized specifically to ace benchmarks, like "teaching to the test" in education. High scores don't guarantee real-world performance.

Cultural & Linguistic Bias

Most benchmarks reflect Western, English-centric perspectives, disadvantaging models trained on diverse data and limiting global applicability.

Moving Goalposts

As models improve, benchmarks become obsolete. What was "superhuman" last year is baseline today, making historical comparisons difficult.

Beyond Traditional Benchmarks

To address these limitations, the field is evolving toward more robust evaluation methods:

Dynamic Arenas

Instead of fixed tests, models compete head-to-head. Users judge outputs without knowing which model produced what, eliminating brand bias.

Human-in-the-Loop

For subjective qualities like creativity or empathy, human experts grade performance in real-time, capturing nuances automated tests miss.

Adversarial Testing

Intentionally trying to break models reveals hidden weaknesses that standard benchmarks miss, improving safety and reliability.

Activity: Create Your Personal AI Benchmark

Privacy Reminder

When creating personal benchmarks, always use:

Hypothetical scenarios instead of real personal data
Public information rather than confidential details
Generic examples that mirror your needs without exposing sensitive information

The best measure of an AI isn't a generic score—it's how well it helps YOU achieve YOUR specific goals. Let's build your personal benchmark.

Approach 1: Quick Evaluation Framework

Task

Select 3-5 challenges from the "Exploring the Jagged Frontier" module that matter to your work.

Dive-in & Do

Test each challenge on 2-3 different AI models
Document results in a simple spreadsheet
Re-test quarterly as new models emerge

Pause-and-Ponder

Which failures surprised you most? What does this reveal about the model's training?

Approach 2: Comprehensive Personal Benchmark

Task

Design a multi-faceted evaluation suite tailored to your professional needs.

Dive-in & Do

Choose 2-3 tasks from this list (or create your own):

Career Agent

Provide career interests and ask the AI to identify suitable companies and draft outreach messages.

Marketing Strategist

Give a product and target audience. Request a 3-month campaign with content calendar and materials.

Event Planner

Describe an event. Request complete project plan with timeline, budget, and promotional materials.

Travel Planner

Specify destination, duration, budget, and interests. Request day-by-day itinerary with logistics.

DIY Guide

Describe a DIY project. Request step-by-step instructions, materials list, and video script.

Dungeon Master

Provide a fantasy setting. Test the AI's ability to maintain narrative consistency and adapt.

Evaluation Criteria

Accuracy: Are facts correct? Do recommendations make sense?
Completeness: Did it address all requirements?
Creativity: Are suggestions original or clichéd?
Practicality: Could you actually implement the output?
Time Saved: How much faster than doing it yourself?

Pause-and-Ponder

What patterns emerge across different tasks? Where does the AI consistently excel or fail? How might this shape your workflow integration?

For Educators

Consider having students create a class benchmark suite that tests AI capabilities relevant to your discipline. This exercise builds critical evaluation skills while revealing the Jagged Frontier in action.

Key Takeaways

Benchmarks provide useful comparisons but don't capture real-world performance
Data contamination and benchmark gaming inflate scores artificially
The "Jagged Frontier" means excellence in benchmarks doesn't guarantee practical utility
Personal benchmarks tailored to your needs are more valuable than generic scores
Regular re-evaluation helps you stay current with rapidly evolving capabilities

AI Literacy for Students

Tool Deep Dive

What is the Jagged Frontier

Origin of the Term

Mapping the Jagged Frontier

Learning Outcomes

Technical Understanding

Evaluative Judgement

Ethical Awareness

What they measure

What they miss

Activity 1. Then vs. Now

Activity 2: Confirm Mollick's Tests

Activity 3: Exploring the Jagged Frontier

Challenges Overview

Challenge 1: Probing for Factual Hallucinations

Challenge 2: Multi-Constraint Image Generation

Challenge 3: Rendering Symbolic Information (The Watch Test)

Challenge 4: Pattern Matching vs. Comprehension (The Riddle Test)

Challenge 5: Complex Rule Interpretation (The Parking Sign Test)

Challenge 6: Theory of Mind (The Belief Test)

Challenge 7: Negative Constraint Adherence (The "No E" Test)

Understanding AI Benchmarks: Beyond the Numbers

Learning Objectives

What Are AI Benchmarks?

The Jagged Frontier Connection

Example Benchmarks

General Knowledge & Academic Exams

Reasoning & Truthfulness

Software Engineering & Code

Conversational Quality

Multimodal & Specialized

Limitations of Benchmarks

Critical Limitations to Understand

Data Contamination

Benchmark Gaming

Cultural & Linguistic Bias

Moving Goalposts

Beyond Traditional Benchmarks

Dynamic Arenas

Human-in-the-Loop

Adversarial Testing

Activity: Create Your Personal AI Benchmark

Privacy Reminder

Task

Dive-in & Do

Pause-and-Ponder

Task

Dive-in & Do

Career Agent

Marketing Strategist

Event Planner

Travel Planner

DIY Guide

Dungeon Master

Evaluation Criteria

Pause-and-Ponder

For Educators

Key Takeaways

Further Reading

DW CC License