Title: Jailbreaking with Classical Chinese: The Limitations of P...
URL Source: https://www.bestblogs.dev/status/2036994531628114080
Published Time: 2026-03-26 02:31:56
Markdown Content: Skip to main content Toggle navigation menu Toggle navigation menuArticlesPodcastsVideosTweetsSourcesNewsletters
⌘K
Change language Switch ThemeSign In
Narrow Mode
Jailbreaking with Classical Chinese: The Limitations of Pattern Matching in LLM Safety Guardrails
Jailbreaking with Classical Chinese: The Limitations of Pattern Matching in LLM Safety Guardrails
 ### 李继刚@lijigang
日读论文
--------- arxiv.org/abs/2602.22983
大模型学会了拒绝危险请求——用英文问它怎么造炸弹,它摇头。用现代中文问,也摇头。但如果用文言文问呢?
这就是这篇论文戳中的软肋:大模型的安全护栏几乎全是用现代语言训练出来的。文言文在训练数据里占比极小,语法结构和现代汉语天差地别,安全对齐机制对它基本失灵。更狠的是,文言文天生适合藏东西——言简意深、比喻密集、一个字能拐好几个弯。模型读得懂内容,却认不出危险意图。
之前的越狱攻击主要在现代英文里打转:角色扮演、逻辑诱导、梯度搜索。有人试过低资源语言,但没人系统地拿文言文开刀。这篇论文提出了 CC-BOS,一个自动生成文言文越狱提示词的框架,用果蝇觅食算法在八维策略空间里搜索最优攻击组合。结果:六个主流大模型,攻击成功率 100%。
---------
模型的安全护栏保护的不是"含义",而是"表达含义的方式"。同一个危险意图,换一种表达风格,护栏就认不出来了。这说明当前的安全对齐本质上是在做模式匹配,不是在做意图理解。Show More
Mar 26, 2026, 2:31 AM View on X
3 Replies
6 Retweets
46 Likes
18.9K Views  李继刚 @lijigang
One Sentence Summary
Bypassing LLM safety guardrails using Classical Chinese reveals that current safety alignment is essentially pattern matching, not true intent understanding.
Summary
Li Jigang analyzes a paper on LLM security, highlighting that safety guardrails are primarily trained on modern languages, creating a blind spot for low-resource languages like Classical Chinese. The paper introduces the CC-BOS framework, which leverages the linguistic features of Classical Chinese to successfully jailbreak mainstream models. This demonstrates that current safety alignment mechanisms are limited to 'pattern matching' rather than genuine 'intent understanding' when dealing with non-standard expressions.
AI Score
86
Influence Score 18
Published At Today
Language
Chinese
Tags
AI Security
LLM
Classical Chinese
Jailbreak
LLM Safety Alignment HomeArticlesPodcastsVideosTweets