Prompt Engineering Unearths Hidden AI Skills

Recent episodes show prompt engineering’s growing role in surfacing unexpected capabilities in large multimodal models. A high-profile case involved OpenAI’s o3 model: a tailored “GeoGuessr” prompt enabled surprisingly precise photo geolocation, though follow-ups revealed inconsistent results. The instance highlights how iterative prompting can expose latent abilities and the need for careful evaluation and stewardship. Parallel developments, like a community-built Claude skill (nidhinjs/prompt-master) that crafts efficient, context-aware prompts for any AI tool, emphasize efforts to standardize prompt design and conserve compute. Together, these stories signal a trend toward tooling and techniques that systematically unlock and manage hidden model behaviors.

Latest Changes

A benchmark found the famous o3 GeoGuessr prompt did not improve geolocation and performed slightly worse than a basic prompt.

Initial reports had highlighted o3's unusually strong photo geolocation after a tailored prompt demonstration in April.

A community Claude skill, nidhinjs/prompt-master, emerged to generate efficient, context-aware prompts that conserve tokens and memory.

Timeline

2026-04-01 — Kelsey Piper highlighted o3's precise geolocation ability using a tailored GeoGuessr-style prompt.

2026-05-18 — nidhinjs released prompt-master, a Claude skill to craft accurate prompts and conserve tokens.

2026-05-21 — A benchmark study reported the o3 GeoGuessr prompt did not improve geolocation over a basic prompt.

2026-05-21 — Follow-up tests across 200 real-world images confirmed the tailored prompt underperformed versus a simple prompt.

Recent News (4)

The famous O3 "GeoGuessr" prompt did not work

A benchmark study found that the celebrated GeoGuessr-style prompt for OpenAI’s o3 image model didn’t improve geolocation accuracy and in fact performed slightly worse than a simple prompt. The author evaluated 200 images from Wikimedia Commons, Geograph, and iNaturalist, comparing a detailed, iteratively refined “GeoGuessr” prompt against a basic instruction to infer location. Median and mean error distances were higher with the elaborate prompt, and it increased inference time modestly. The piece argues that iterative prompt engineering can create illusions of improved capability because models will endorse prompt additions, and stresses the importance of controlled benchmarks to validate claimed gains.

20pts

Zeliingve2h ago

The famous O3 "GeoGuessr" prompt did not work

A benchmark test shows OpenAI’s o3 model’s celebrated “GeoGuessr” prompt didn’t improve geolocation performance: across 200 real-world images the author ran o3 with a basic prompt and with Kelsey Piper’s elaborate GeoGuessr prompt, and the default prompt yielded slightly better median and mean distance-to-true-location scores. Both prompts performed well overall, and the long prompt only modestly increased inference time. The experiment suggests prompt engineering can create persuasive narratives about capability gains without real improvement, especially when researchers iterate prompts with the model itself. The author notes such checks are straightforward and cheap but weren’t performed when the prompt became widely shared.

36pts

HNingve3h ago

Prompt Engineering Unearths Hidden AI Skills

Why It Matters

Latest Changes

Timeline

What to Watch

Recent News (4)