← 回總覽

“BS Bench”发布:测试 AI 在面对荒谬问题时的幻觉表现

📅 2026-03-17 09:00 Arena.ai 人工智能 2 分鐘 1929 字 評分: 84
BS Bench AI 基准测试 幻觉 LLM 评估 荒谬检测
📌 一句话摘要 一项名为“BS Bench”的新基准测试对 80 个 AI 模型进行了评估,旨在看它们是能识别荒谬问题,还是会自信地编造虚假答案。 📝 详细摘要 这条推文介绍了由 Peter Gostev 创建的“BS Bench”基准测试,该测试衡量了 80 种不同的 AI 模型如何处理荒谬或逻辑不通的问题。研究揭示了一系列行为模式:一些模型能正确指出问题本身是无意义的,而另一些则会产生“幻觉”并编造虚假指标。一个值得注意的发现是,“思考得更深”(可能指思维链或推理模型)有时反而会加剧编造答案的倾向,而不是识别出前提的荒谬性。 📊 文章信息 AI 评分:84 来源:lmarena.ai

Title: Introduction of the 'BS Bench': Testing AI Hallucinations...

URL Source: https://www.bestblogs.dev/status/2033710089983660448

Published Time: 2026-03-17 01:00:44

Markdown Content: ![Image 1: Arena.ai](https://www.bestblogs.dev/en/tweets?sourceId=SOURCE_39a65f)

Can AI tell when a question is total nonsense, or does it just make up an answer? @petergostev tested 80 models with nonsense questions. Some pushed back. Others confidently invented fake metrics and kept going. All of them were ranked on the "BS Bench".

One surprise: thinking harder made it worse.

Watch the full deep dive on BS Bench on YouTube.

Link in thread.

!Image 2: 视频缩略图

01:06

4 Replies

1 Retweets

30 Likes

2,900 Views ![Image 3: Arena.ai](https://www.bestblogs.dev/en/tweets?sourceid=39a65f)

One Sentence Summary

A new benchmark called 'BS Bench' evaluates 80 AI models on their ability to identify nonsense questions versus confidently inventing fake answers.

Summary

This tweet introduces the 'BS Bench,' a benchmark created by Peter Gostev that tests how 80 different AI models handle nonsense or illogical questions. The study reveals a spectrum of behaviors: some models correctly push back against nonsense, while others 'hallucinate' and invent fake metrics. A notable finding mentioned is that 'thinking harder' (likely referring to Chain-of-Thought or reasoning models) sometimes exacerbated the tendency to fabricate answers rather than identifying the premise as nonsense.

AI Score

84

Influence Score 7

Published At Today

Language

English

Tags

BS Bench

AI Benchmarking

Hallucination

LLM Evaluation

Nonsense Detection

查看原文 → 發佈: 2026-03-17 09:00:44 收錄: 2026-03-17 12:00:54

🤖 問 AI

針對這篇文章提問,AI 會根據文章內容回答。按 Ctrl+Enter 送出。