Loading...
Loading...
Recent episodes show prompt engineering’s growing role in surfacing unexpected capabilities in large multimodal models. A high-profile case involved OpenAI’s o3 model: a tailored “GeoGuessr” prompt enabled surprisingly precise photo geolocation, though follow-ups revealed inconsistent results. The instance highlights how iterative prompting can expose latent abilities and the need for careful evaluation and stewardship. Parallel developments, like a community-built Claude skill (nidhinjs/prompt-master) that crafts efficient, context-aware prompts for any AI tool, emphasize efforts to standardize prompt design and conserve compute. Together, these stories signal a trend toward tooling and techniques that systematically unlock and manage hidden model behaviors.
Prompt engineering is revealing latent capabilities in multimodal models and shaping how teams evaluate model behavior and risks. Tech professionals must adapt workflows to test, standardize, and steward prompts to avoid misleading claims and wasted compute.
Dossier last updated: 2026-05-21 10:57:31
A benchmark study found that the celebrated GeoGuessr-style prompt for OpenAI’s o3 image model didn’t improve geolocation accuracy and in fact performed slightly worse than a simple prompt. The author evaluated 200 images from Wikimedia Commons, Geograph, and iNaturalist, comparing a detailed, iteratively refined “GeoGuessr” prompt against a basic instruction to infer location. Median and mean error distances were higher with the elaborate prompt, and it increased inference time modestly. The piece argues that iterative prompt engineering can create illusions of improved capability because models will endorse prompt additions, and stresses the importance of controlled benchmarks to validate claimed gains.
A benchmark test shows OpenAI’s o3 model’s celebrated “GeoGuessr” prompt didn’t improve geolocation performance: across 200 real-world images the author ran o3 with a basic prompt and with Kelsey Piper’s elaborate GeoGuessr prompt, and the default prompt yielded slightly better median and mean distance-to-true-location scores. Both prompts performed well overall, and the long prompt only modestly increased inference time. The experiment suggests prompt engineering can create persuasive narratives about capability gains without real improvement, especially when researchers iterate prompts with the model itself. The author notes such checks are straightforward and cheap but weren’t performed when the prompt became widely shared.
OpenAI’s o3 multimodal model was found to be unusually good at geolocating photos, a capability highlighted by Kelsey Piper in April when a tailored prompt let o3 identify precise real-world locations from single images. Replications showed strong but inconsistent performance, suggesting models can harbor surprising, under-discovered skills. Observers argued this also illustrates how prompt engineering can surface latent abilities: Piper iteratively refined a long “geoguessr” prompt by asking the model where it erred and incorporating fixes. The episode raises questions about other obscure capabilities hiding in deployed models and the role of prompts in unlocking them, with implications for evaluating and stewarding model behavior and capabilities discovery.
nidhinjs/prompt-master: A Claude skill that writes the accurate prompts for any AI tool. Zero tokens or credits wasted. Full context and memory