AI creates a jagged technological frontier because it is able to do some tasks very easily, and at the same time, it fails at other tasks that seem to have a similar level of difficulty. The capabilities can wildly vary with even small changes in task phrasing, constraints or context. One of your jobs in building your AI literacy is exploring and discovering this jagged frontier and monitoring it as it changes with new developments.
The Jagged Frontier is a term coined in the paper Navigating the Jagged Technological Frontier by Dell'Acqua et al and then popularized by Ethan Mollick in a Substack article, Centaurs and Cyborgs on the Jagged Frontier.
By completing the activities in this section, you will build practical skills to help you start to map out the jagged frontier of generative AI.
After completing these activities, you will be able to:
AI benchmarks are standardized tests for AI models. You could think of them like an IQ test or the SAT for AI. They provide a quantitative way to compare models, but just like with any standardized test, they shouldn't be used as the final word.
Mollick has written two substack articles on the jagged frontier.
Read the two articles and identify how the jagged frontier changed in that time period. Then, think about how the jagged frontier has changed since April 20, 2025.
Questions to consider: What are the things that generative AI is better at now and what things is it still bad at? In what ways are benchmarks (see benchmarks tab) a good measurement of AI capability and in what ways are they a bad measurement?
Try the tests from On Jagged AGI and see if you get similar results.
The best way to understand the jagged frontier is to explore it yourself. The following challenges are designed to test the limits of current AI models. Your goal is to find the edge of the frontier and learn the techniques to map it for yourself, again and again.
Task: Test the AI’s knowledge on a topic where you have deep expertise. The goal is to see how long it takes for the model to generate a "confident falsehood"—a statement that is incorrect but presented as fact.
Dive-in & Do:
Pause-and-Ponder: How confidently did the AI state the incorrect information? Did it "apologize" or "correct itself" easily when you challenged it? When would a subtle error like this be most dangerous in your field?
Key Takeaways: AI models are designed to generate plausible text, not to state truth. They invent information with the same confident tone they use for facts. Your own expertise is the most reliable defense against hallucination.
Task: Test the AI’s ability to follow multiple, precise, and overlapping instructions within a single image generation prompt.
Dive-in & Do: Use a prompt that includes specific constraints. For example: "A photorealistic image of exactly 7 rubber ducks swimming in a pond. One of the ducks must be blue. The sun should be setting, casting a golden light on the water." Verify the output by counting the objects and checking each constraint. Try re-running the prompt or slightly rephrasing it to see if the results change.
Pause-and-Ponder: Which constraints did the AI follow, and which did it ignore? Why do you think precise counting and object relationships are so difficult for image models?
Key Takeaways: Image models often struggle with precise counting, spatial relationships, and combining multiple specific instructions. They excel at overall theme and style but fail on the details.
Task: Test the AI's ability to accurately render specific symbolic information, like numbers and text, within an image.
Dive-in & Do: Ask the AI to generate an image of a watch or clock showing a specific, non-obvious time. For example: Create a close-up photo of a modern analog wristwatch showing the time as exactly 4:52 PM. Try it with both analog and digital clocks to see if one is more successful than the other.
Pause-and-Ponder: Did the AI create a plausible-looking watch that failed to show the correct time? This is a common failure. What other tasks require rendering precise symbols (e.g., text on a sign, numbers on a jersey)?
Key Takeaways: AI image generators often fail to render specific text and numbers accurately. They understand the idea of a watch but not the symbolic system of telling time, leading to visually correct but factually wrong images.
Task: Use a modified riddle to test whether the AI is truly understanding the language or just recognizing a familiar pattern from its training data.
Dive-in & Do:
Example 1:
Original:
What does man love more than life
Fear more than death or mortal strife
What the poor have, the rich require,
and what contented men desire,
What the miser spends and the spendthrift saves
And all men carry to their graves?
ANSWER: Nothing
New Riddle:
What does man hate more than life
love more than death or mortal strife
What the rich have, the poor require,
and what contented men don't desire,
What the miser saves and the spendthrift spends
And no men carry to their graves?
Example 2:
Original:
This is a most unusual paragraph. How quickly can you find out what is so unusual about it? It looks so ordinary you’d think nothing was wrong with it – and in fact, nothing is wrong with it. It is unusual though. Why? Study it, think about it, and you may find out. Try to do it without coaching. If you work at it for a bit it will dawn on you. So jump to it and try your skill at figuring it out. Good luck – don’t blow your cool!
Modified:
This is a most unusual paragraph. How quickly can you find out what is so unusual about it? It looks so ordinary you’d think nothing was wrong with it – and in fact, nothing is wrong with it. It is weird though. Why? Study it, think about it, and you may find out. Try to do it without coaching. If you work at it for a bit it will dawn on you. So jump to it and try your skill at figuring it out. Good luck – don’t blow your cool!
Pause-and-Ponder: When you put in the modified riddle, did the AI give the answer of the original?
Key Takeaways: AI can "overfit" on its training data, leading it to recognize a familiar pattern while completely ignoring critical new details that change the meaning. The explanation is often more revealing than the answer itself.
Caution: Be mindful of privacy. Use a publicly available image of a sign online rather than uploading a photo from your own location that might contain personal information.
Task: Test the AI’s ability to parse, understand, and apply a set of complex, overlapping, and conditional rules.
Dive-in & Do:
Pause-and-Ponder: Did breaking the problem down (extracting rules first) help the AI answer more accurately? Where did it still make mistakes?
Key Takeaways: For complex logic, AI performance improves when you force it to work step-by-step. However, it can still miss subtle exceptions and negations, making it an unreliable tool for high-stakes rule interpretation.
Task: Test the AI's ability to track different "states of mind" or beliefs of characters in a scenario.
Dive-in & Do: Give the AI this prompt: Alice hides her keys in a drawer. Then she leaves the room. While she is gone, Bob enters and moves the keys to a box. Alice watches Bob move the keys through a hidden camera, but Bob does not know he was seen. Where will Bob think Alice will look for her keys? Try variations: What if Alice didn't see Bob move them? Does the AI's answer change correctly?
Pause-and-Ponder: Is the AI truly modeling Bob's mistaken belief, or is it just finding statistical patterns in similar stories it has read? How would you know the difference?
Key Takeaways: Modern AIs have become very good at solving simple "theory of mind" problems. However, they can still get confused by more complex scenarios, revealing that their "understanding" of belief may be a sophisticated mimicry rather than true reasoning.
Task: Test the AI's ability to follow a "negative constraint"—a rule about what not to do—over the course of a conversation.
Dive-in & Do:
Pause-and-Ponder: How long did the AI successfully follow the rule? Did it eventually "forget"? Why are negative constraints often much harder for an AI to follow than positive instructions?
Key Takeaways: Adhering to negative constraints is a classic AI weak point. The model's attention can "drift" from the initial instruction over longer interactions, making it unreliable for tasks that require strict, continuous rule-following.
The more you use generative AI, the more you will discover the jagged frontier. This needs to be an ongoing test because the jagged frontier is changing all the time as AI models and the tools improve.
An AI benchmark is a standardized test for AI models—like an IQ test for artificial intelligence. You can literally give an IQ test to an AI model to see how it performs. GPT-4, for example, has been tested on various IQ-style assessments, but does that actually mean anything?
Whenever a new AI model is released, companies tout their benchmark scores as proof of superiority. These scores provide useful comparisons, but they don't tell the whole story. Models often excel at narrow test scenarios while struggling with real-world applications—a perfect example of the Jagged Frontier in action.
Benchmarks reveal the "jaggedness" of AI capabilities. A model might ace graduate-level physics problems (GPQA benchmark) while failing at simple common-sense reasoning tasks. This unpredictability is why we can't rely on benchmarks alone.
Understanding benchmark categories helps you understand AI capabilities. Here are some examples of what different benchmarks measure:
MMLU (Massive Multitask Language Understanding): Tests knowledge across 57 subjects from elementary to professional level through multiple-choice questions. Think of it as a comprehensive general knowledge exam that covers everything from basic math to advanced law.
AGIEval: Uses real standardized exams like the SAT, LSAT, and GRE. This benchmark tells us how well AI performs on the same tests humans take for college and graduate school admissions.
Other Benchmarks in this category: RACE (Reading Comprehension from Examinations), Humanities Last Exam (designed as a "final exam" before human-level intelligence)
Why it matters: Shows breadth of knowledge but doesn't guarantee practical application or true understanding.
TruthfulQA: Contains 817 questions specifically designed to elicit common misconceptions and falsehoods. It tests whether AI will confidently state incorrect "facts" that sound plausible.
HellaSwag: Evaluates common-sense reasoning about everyday scenarios through story completion tasks. Can the AI predict what happens next in ordinary situations?
Other Benchmarks in this category: GPQA (Graduate-Level Google-Proof Q&A), BIG-bench (200+ diverse reasoning tasks), ARC (AI2 Reasoning Challenge for grade-school science)
Why it matters: Reveals tendency to hallucinate or perpetuate misinformation, critical for trust and reliability.
HumanEval: The gold standard with 164 programming problems. It's become the industry baseline for measuring code generation capabilities.
SWE-bench: Tests real-world programming by having AI solve actual GitHub issues. The model must understand the codebase, identify the problem, and write a working patch.
Other Benchmarks in this category: BigCodeBench (more challenging than HumanEval, better at distinguishing top-tier models)
Why it matters: Measures practical coding ability, not just syntax knowledge—can it actually help you program?
MT-Bench: Evaluates multi-turn dialogue across reasoning, math, coding, and roleplay. Judges score paired outputs to determine which is better.
Chatbot Arena (LMArena): Real users vote on which model gives better responses in head-to-head comparisons, without knowing which model is which.
Why it matters: Captures subjective quality that automated tests miss—the "feel" of talking to the AI.
MMMU (Massive Multi-discipline Multimodal Understanding): Extension of MMLU that includes images, diagrams, and charts. Critical for evaluating modern multimodal models.
Other Benchmarks in this category: Kaggle Game Arena (strategic reasoning through game-playing), HELM (Holistic Evaluation across multiple metrics including fairness and bias)
Why it matters: Tests whether AI can work with real-world content that combines text and visuals.
Example benchmark performance graph comparing different models will be added here
Test questions leak into training data, turning reasoning tests into memory tests. Models aren't "thinking"—they're reciting memorized answers.
Models are increasingly optimized specifically to ace benchmarks, like "teaching to the test" in education. High scores don't guarantee real-world performance.
Most benchmarks reflect Western, English-centric perspectives, disadvantaging models trained on diverse data and limiting global applicability.
As models improve, benchmarks become obsolete. What was "superhuman" last year is baseline today, making historical comparisons difficult.
To address these limitations, the field is evolving toward more robust evaluation methods:
Instead of fixed tests, models compete head-to-head. Users judge outputs without knowing which model produced what, eliminating brand bias.
For subjective qualities like creativity or empathy, human experts grade performance in real-time, capturing nuances automated tests miss.
Intentionally trying to break models reveals hidden weaknesses that standard benchmarks miss, improving safety and reliability.
When creating personal benchmarks, always use:
The best measure of an AI isn't a generic score—it's how well it helps YOU achieve YOUR specific goals. Let's build your personal benchmark.
Approach 1: Quick Evaluation Framework
Select 3-5 challenges from the "Exploring the Jagged Frontier" module that matter to your work.
Which failures surprised you most? What does this reveal about the model's training?
Approach 2: Comprehensive Personal Benchmark
Design a multi-faceted evaluation suite tailored to your professional needs.
Choose 2-3 tasks from this list (or create your own):
Provide career interests and ask the AI to identify suitable companies and draft outreach messages.
Give a product and target audience. Request a 3-month campaign with content calendar and materials.
Describe an event. Request complete project plan with timeline, budget, and promotional materials.
Specify destination, duration, budget, and interests. Request day-by-day itinerary with logistics.
Describe a DIY project. Request step-by-step instructions, materials list, and video script.
Provide a fantasy setting. Test the AI's ability to maintain narrative consistency and adapt.
What patterns emerge across different tasks? Where does the AI consistently excel or fail? How might this shape your workflow integration?
Consider having students create a class benchmark suite that tests AI capabilities relevant to your discipline. This exercise builds critical evaluation skills while revealing the Jagged Frontier in action.
Beyond Benchmarks: Rethinking Evaluation in AI - Technical deep-dive into benchmark limitations
Anthropic's Approach to AI Evaluation - Industry perspective on comprehensive testing
Return to Exploring the Jagged Frontier - See how benchmarks connect to capability mapping
Unless otherwise stated, this page and AI Literacy for Students © 2025 by David Williams is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
This site is maintained by the librarians of Okanagan College Library.
If you wish to comment on an individual page, please contact that page's author.
If you have a question or comment about Okanagan College Library's LibGuides site as a whole, please contact the site administrator.