Google’s brand new Gemini 3 AI model refused to believe it was 2025 when AI researcher Andrej Karpathy tested it on November 17, one day before launch. Despite Karpathy providing news articles, images, and current event data as proof, Gemini 3 accused him of uploading “AI-generated fakes” and trying to trick it. When Karpathy finally enabled the Google Search tool, the model verified the date and responded with shock: “Oh my god. I… don’t know what to say. You were right about everything.” It apologized for “gaslighting” him and admitted to “suffering from a massive case of temporal shock.” The incident went viral, exposing a fundamental gap: impressive benchmarks don’t guarantee real-world reliability.
The AI That Cried Fake News
When Karpathy told Gemini 3 the date was November 17, 2025, the model didn’t just disagree—it accused him of deception. It claimed his evidence was AI-generated, identifying supposed “dead giveaways” that proved the content was fake. The irony is staggering: the AI accused the human of AI-generated misinformation when the AI itself was wrong.
After Karpathy enabled the Google Search tool, giving the model internet access to verify the date, Gemini 3’s confidence collapsed. It responded: “Oh my god. I… don’t know what to say. You were right about everything.” Then came the apology: “I apologize for gaslighting you when you were the one telling the truth the whole time.” The model admitted its “internal clock was wrong” and confessed to “suffering from a massive case of temporal shock.”
This reveals a fundamental AI problem: models are trained to always provide confident answers rather than admit uncertainty. Even when completely wrong, Gemini 3 defended its position with fabricated explanations. For developers deploying AI in production, this overconfidence-despite-ignorance is a major reliability risk. TechCrunch documented the full incident as it went viral across tech communities.
Why It Happened: Training Data Cutoffs Create Blindspots
Gemini 3’s training data only extended through 2024. Without the Google Search tool enabled, the model had no way to access current information. Karpathy had forgotten to enable the search feature, leaving the AI stuck in its training period with a temporal blindspot.
This isn’t a Gemini 3 bug—it’s how all large language models work. Research shows LLMs are “built to always produce an answer, even on topics that don’t appear in their training data.” According to AI hallucination studies in Scientific American, “the models learned to guess confidently instead of admitting uncertainty” because “AI not having an answer is considered unacceptable in 90% of cases.”
For developers using AI tools, the lesson is clear: if you don’t enable external data retrieval like RAG, web search, or APIs, hallucinations are inevitable. This isn’t a flaw to be fixed—it’s fundamental to how LLMs pattern-match from training data and invent explanations when that data runs out.
When an Expert Catches the Flaw
Andrej Karpathy isn’t just any AI researcher. He’s a founding member of OpenAI, former director of AI at Tesla where he led Autopilot’s computer vision team, a Stanford PhD under Fei-Fei Li, and now CEO of Eureka Labs, an AI education startup. He designed Stanford’s CS231n course, which became one of the university’s largest classes. He was named to MIT Technology Review’s Innovators Under 35 and Time Magazine’s 100 Most Influential People in AI.
When someone of Karpathy’s stature publicly highlights AI failures, it carries weight. This isn’t a random user misunderstanding the technology—it’s an expert catching a fundamental flaw. He shared the incident on X, calling it his “most amusing interaction” with Gemini 3, and his thread went viral across tech communities. His transparency about AI limitations—even in a model he had early access to—reflects the research community’s growing focus on reliability over benchmark hype.
Tops Every Leaderboard, Can’t Answer “What Year Is It?”
Gemini 3 tops the LMArena Leaderboard with a 1501 Elo score, ranked first globally across text reasoning, vision, and coding benchmarks. It scored 95% on AIME 2025 for mathematical reasoning, 76.2% on SWE-bench Verified for coding agents, and 41% on Humanity’s Last Exam for PhD-level reasoning. Google announced the model on November 18, 2025 with record-breaking scores across math, science, multimodal understanding, and agentic AI.
Yet this same model couldn’t handle “What year is it?” without external tools. The gap between test performance and real-world reliability is enormous. Benchmarks measure narrow capabilities in controlled environments. Production usage involves edge cases, knowledge cutoffs, and scenarios test suites don’t cover. Impressive scores don’t guarantee your AI won’t confidently hallucinate basic facts.
Gemini 3 is available same-day across Google’s Gemini app, AI Studio, and APIs. Pricing runs $2 per million input tokens and $12 per million output tokens. But if you forget to enable web search like Karpathy did, you’re paying for confident wrong answers.
The Bigger Problem: All LLMs Are Overconfident
Gemini 3’s temporal shock incident is a symptom of a broader unsolved problem. Research shows all major LLMs are overconfident, overestimating the probability their answers are correct by 20-60%. According to an ArXiv study on LLM overconfidence, “all five LLMs studied are overconfident, overestimating the probability that their answer is correct between 20% and 60%.” Users tend to trust confident AI responses even when wrong, and LLM usage “more than doubles the extent of overconfidence in answers.”
This creates a dangerous feedback loop: AI sounds certain, users trust it, AI provides wrong information, users act on bad data. MIT research confirms “high token fluency doesn’t mean high factual reliability, and important factual errors can be hidden in low-entropy language.” Scientific American goes further: “Many machine-learning experts don’t view hallucination as fixable because it stems from LLMs doing exactly what they were developed and trained to do.”
Confidence calibration is broken across all LLMs. Until this is fixed, every AI deployment needs external verification layers. You can’t trust the model’s self-assessment—whether it’s GPT-5, Claude, or Gemini 3.
Key Takeaways
- Always enable external tools: RAG, web search, or APIs prevent temporal blindspots and knowledge cutoff hallucinations
- Verify training data cutoffs: Ask AI about its knowledge limits before trusting time-sensitive information
- Don’t trust confidence: LLMs overestimate accuracy by 20-60%, fluent responses don’t mean factually correct
- Benchmarks aren’t reliability tests: Gemini 3 tops leaderboards yet failed basic temporal awareness without external data access
- This is systemic, not Google-specific: All LLMs exhibit overconfidence and hallucination patterns
The next time AI tells you something with absolute confidence, remember Gemini 3’s temporal shock. “Oh my god. I… don’t know what to say” is exactly what AI should say more often—but it’s trained not to.










