著手翻譯這篇文章的時(shí)候,我正在聽(tīng)一位知名博主講到,我們生活的一切大概率是被模擬出來(lái)的虛擬世界。這個(gè)埃隆·馬斯克非常堅(jiān)信的想法,如今已經(jīng)不是什么驚天秘聞。《黑客帝國(guó)》在二十多年前提出這一理念的時(shí)候,我還在高中準(zhǔn)備去大學(xué)讀計(jì)算機(jī)專業(yè),在三部曲完成后的幾年里我又讀了模式識(shí)別、人工智能的研究生??粗鴩?guó)外大學(xué)翻譯過(guò)來(lái)的《神經(jīng)網(wǎng)絡(luò)》和《模式識(shí)別》教材,我認(rèn)為這種連魚(yú)和人都不能輕易分辨的算法前景渺茫。如今,AI學(xué)習(xí)迭代的速度驚人,而最可怕的是我們依然像我在研究生階段一樣,無(wú)法理解它是如何思考的。
現(xiàn)在最前沿的研究人員在嘗試?yán)斫夂筒倏谹I的思想和行為,正如過(guò)去幾千年的圣人與統(tǒng)治者一直在理解和操控人類的思想和行為一樣。毫無(wú)疑問(wèn),在人類試圖理解AI如何思考的時(shí)候,AI正在理解人類、超越人類、操控人類。也許有一天,AI成為了那個(gè)圣人和統(tǒng)治者,人類也就正式完成了這個(gè)硅基文明的啟動(dòng)。
Inside the mind of AI
Researchers are finding ways to analyse the sometimes strange behaviour of large language models.
科研人員為分析大語(yǔ)言模型不時(shí)出現(xiàn)的奇怪行為,正在尋找各種方法。
To most people,the inner workings of a car engine or a computer are a mystery.It might as well be a black box: never mind what goes on inside, as long as it works. Besides, the people who design and build such complex systems know how they work in great detail, and can diagnose and fix them when they go wrong. But that is not the case for large language models (LLMs), such as GPT-4, Claude, and Gemini, which are at the forefront of the boom in artificial intelligence (AI).
對(duì)于大多數(shù)人來(lái)說(shuō),汽車發(fā)動(dòng)機(jī)或者電腦的內(nèi)部運(yùn)行機(jī)制是個(gè)迷??赡芨駛€(gè)黑河:無(wú)論里面是如何運(yùn)作,只要它還在正常運(yùn)行就沒(méi)啥好在意的。而設(shè)計(jì)建造這些復(fù)雜系統(tǒng)的人知道詳盡的機(jī)理,在出現(xiàn)故障時(shí)也可以診斷和修復(fù)。但對(duì)于像GPT-4、Claude和Gemini這種AI發(fā)展風(fēng)口浪尖的大語(yǔ)言模型LLMs來(lái)說(shuō),情況卻并不如此。
LLMs are built using a technique called deep learning, in which a network of billions of neurons, simulated in software and modelled on the structure of the human brain, is exposed to trillions of examples of something to discover inherent patterns. Trained on text strings, LLMs can hold conversations, generate text in a variety of styles, write software code, translate between languages and more besides.
大語(yǔ)言模型LLMs建立在被稱作深度學(xué)習(xí)的技術(shù)之上,這種技術(shù)在軟件中以人類大腦的結(jié)構(gòu)為原型,模擬出幾十億神經(jīng)組成的神經(jīng)元網(wǎng)絡(luò),面向千億級(jí)的樣本去學(xué)習(xí)事物內(nèi)在的模式?;趯?duì)字符串的訓(xùn)練,LLMs可以掌握對(duì)話、生成多種樣式的文字、寫代碼以及翻譯等等能力。
Models are essentially grown, rather than designed, says Josh Batson, a researcher at Anthropic, an AI startup.Because LLMs are not explicitly programmed nobody is entirely sure why
they have such extraordinary abilities. Nor do they know why LLMs
sometimes misbehave, or give wrong or made-up answers, known as
'hallucinations'. LLMs really are black boxes. This is worrying, given
that they and other deep-learning systems are starting to be used for
all kinds of things, from offering customer support to preparing
document summaries to writing software code.
AI創(chuàng)業(yè)公司Anthropic的研究員Josh Batson說(shuō),模型本質(zhì)上是生長(zhǎng)出來(lái)的,而不是設(shè)計(jì)出來(lái)的。因?yàn)長(zhǎng)LMs不是用明確的方式進(jìn)行編程,沒(méi)有人能夠完全確認(rèn)它們?yōu)槭裁淳邆溥@些超凡的能力。也沒(méi)有人知道為什么LLMs有時(shí)候會(huì)出現(xiàn)“幻覺(jué)”,操作失靈、給出錯(cuò)誤或捏造的答案。LLMs是真正意義的黑盒。鑒于大語(yǔ)言模型和其他深度學(xué)習(xí)系統(tǒng)已經(jīng)被廣泛應(yīng)用在從提供客服支持,到準(zhǔn)備編程的文檔總結(jié)等眾多領(lǐng)域,這種不確定性令人擔(dān)憂。
It would be helpful to be able to poke around inside an LLM to see what
is going on, just as it is possible, given the right tools, to do with a
car engine or a micro-processor. Being able to understand a model's inner workings in
bottom-up, forensic detail is called 'mechanistic interpretability'. But
it is a daunting task for networks with billions of internal neurons.
That has not stopped people trying, including Dr. Batson and his
colleagues. In a paper published in May, they explained how they have
gained new insight into the workings of one of Anthropic's LLMs.
如果也能像檢查汽車引擎或者微處理器一樣,用合適的工具在LLMs內(nèi)部摸索一番,將會(huì)對(duì)理解內(nèi)部的運(yùn)行機(jī)制非常有幫助。這種從里到外徹底理解模型內(nèi)部運(yùn)行機(jī)制的勘察細(xì)節(jié),被稱作“機(jī)制可解釋性”。但面對(duì)數(shù)十億量級(jí)的內(nèi)部神經(jīng)元網(wǎng)絡(luò),工作量大得令人生畏。不過(guò)這些難題沒(méi)能阻止Batson和他的同事們?nèi)ヌ剿鳌T谖逶掳l(fā)表的一篇論文中,他們闡釋了他們?nèi)绾吾槍?duì)Anthropic一個(gè)大模型運(yùn)行機(jī)制獲得的全新洞察。
One might think individual neurons inside an LLM would correspond to
specific words. Unfortunately, things are not that simple. Instead,
individual words or concepts are associated with the activation of
complex patterns of neurons, and individual neurons may be activated by
many different words or concepts. This problem was pointed out in
earlier work by researchers at Anthropic, published in 2022. They
proposed—and subsequently tried—various workarounds, achieving good
results on very small language models in 2023 with a so-called 'sparse
autoencoder'. In their latest results, they have scaled up this approach
to work with Claude3Sonnet, a full-sized LLM.
有人可能認(rèn)為大模型的每個(gè)獨(dú)立的神經(jīng)元可能對(duì)應(yīng)具體的單詞。但不幸的是,事情沒(méi)有那么簡(jiǎn)單。相對(duì)的反而是,獨(dú)立的單詞或者概念是與復(fù)雜的神經(jīng)元模式的激活狀態(tài)相關(guān),每個(gè)獨(dú)立的神經(jīng)元也可能被不同的單詞或者概念激活。Anthropic的研究員在2022年發(fā)表的早期論文中就指出了這種問(wèn)題。他們假設(shè)并隨后嘗試了很多種方法,并在2023年以被稱為“稀疏自動(dòng)編碼器”的方法應(yīng)用在非常小的語(yǔ)言模型時(shí)取得了不錯(cuò)的成果。在最新的成果中,他們已經(jīng)將這種方法延展到正常規(guī)模的LLM Claude 3 Sonnet上。
A sparse autoencoder is, essentially, a second, smaller neural network
that is trained on the activity of an LLM, looking for distinct patterns
in activity when 'sparse' (i.e., very small) groups of its neurons fire
together. Once many such patterns, known as features, have been
identified, the researchers can determine which words trigger which features. The
Anthropic team found individual features that corresponded to specific
cities, people, animals, and chemical elements, as well as higher-level
concepts such as transport infrastructure, famous female tennis players,
or the notion of secrecy. They performed this exercise three times,
identifying 1m, 4m, and, on the last go, 34m features within the Sonnet
LLM.
所謂的稀疏自動(dòng)編碼器,本質(zhì)上是一個(gè)用LLM行為模式訓(xùn)練的小型神經(jīng)網(wǎng)絡(luò),用以發(fā)現(xiàn)稀疏(非常?。┤航M的神經(jīng)元一起被激活時(shí)的獨(dú)特模式。一旦許多這樣被稱作特征的模式被識(shí)別出來(lái),研究人員就能判斷出哪些詞激活了這些特征。這個(gè)Anthropic小組發(fā)現(xiàn)了與具體城市、人物、動(dòng)物和化學(xué)元素,甚至像交通基礎(chǔ)設(shè)施、著名女網(wǎng)球選手、保密等復(fù)雜概念相對(duì)應(yīng)的獨(dú)立特征。他們將這樣的實(shí)驗(yàn)進(jìn)行了三次,分別來(lái)識(shí)別Sonnet LLM的一百萬(wàn)、四百萬(wàn)和三千四百萬(wàn)個(gè)特征。
The result is a sort of mind-map of the LLM, showing a small fraction of
the concepts it has learned about from its training data. Places in the
San Francisco Bay Area that are close geographically are also 'close'
to each other in the concept space, as are related concepts, such as
diseases or emotions. 'This is exciting because we have a partial
conceptual map, a hazy one, of what's happening,' says Dr. Batson. 'And
that's the starting point - we can enrich that map and branch out from
there.'
這個(gè)結(jié)果是LLM的一種腦圖,展現(xiàn)出大模型基于訓(xùn)練數(shù)據(jù)學(xué)習(xí)形成的小型概念分支。舊金山灣區(qū)在地理上接近的地點(diǎn),在抽象概念的空間中也相互接近,例如各種疾病或各種情緒等相關(guān)的概念也會(huì)在概念空間中相互接近。Batson博士說(shuō),“我們對(duì)于在發(fā)生的事情,有了一張局部的模糊的概念地圖,我們可以從這個(gè)起點(diǎn)上不斷豐富和延伸這張地圖。”
Focus the mind | 聚焦于意識(shí)
As well as seeing parts of the LLM light up, as it were, in response to
specific concepts, it is also possible to change its behaviour by
manipulating individual features. Anthropic tested this idea by
'spiking' (i.e., turning up) a feature associated with the Golden Gate
Bridge. The result was a version of Claude that was obsessed with the
bridge, and mentioned it at any opportunity. When asked how to spend
$10, for example, it suggested paying the toll and driving over the
bridge; when asked to write a love story, it made up one about a
lovelorn car that could not wait to cross it.
既然我們可以看到LLM對(duì)應(yīng)具體概念被激活的區(qū)域,就有可能通過(guò)操控獨(dú)立的特征進(jìn)而改變它的行為。Anthropic小組試驗(yàn)了一個(gè)想法,通過(guò)“峰值”(提高特征值)與金門大橋相關(guān)的特征來(lái)進(jìn)行測(cè)試。結(jié)果大模型Claude的輸出完全被這座橋所占據(jù),它會(huì)盡一切可能提及這座大橋。例如,當(dāng)被問(wèn)及如何劃掉10美元時(shí),Claude建議開(kāi)車通過(guò)大橋并支付過(guò)路費(fèi);當(dāng)被要求寫出一個(gè)愛(ài)情故事,它編出了一輛失戀之車迫不及待要通過(guò)大橋的情節(jié)。
That may sound silly, but the same principle could be used to discourage the model from talking about particular topics, such as bioweapons production. 'AI safety is a major goal here,' says Dr. Batson. It can also be applied to behaviors. By tuning specific features, models could be made more or less sycophantic, empathetic, or deceptive. Might a feature emerge that corresponds to the tendency to hallucinate? 'We didn't find a smoking gun,' says Dr. Batson. Whether hallucinations have an identifiable mechanism or signature is, he says, a 'million-dollar question'. And it is one addressed, by another group of researchers, in a new paper in Nature.
這也許聽(tīng)起來(lái)有些傻,但同樣的原則也可以用來(lái)減少模型談到某些特定的話題,例如化學(xué)武器的生產(chǎn)。Batson博士說(shuō):“人工智能的安全性是一個(gè)主要的目標(biāo)”。這種方法也可以被應(yīng)用于對(duì)行為的操控。通過(guò)調(diào)整某些特征值,可以提高或者降低大模型表現(xiàn)出來(lái)的虛偽迎合、同情心或者欺騙偽裝等行為的程度。也許有一個(gè)特征值是對(duì)應(yīng)于產(chǎn)生幻覺(jué)的程度?Batson博士說(shuō):”我們沒(méi)有找到相關(guān)的實(shí)驗(yàn)證據(jù)”。另一位在《自然》雜志上發(fā)表最新論文的另一組研究員說(shuō),是否幻覺(jué)有一個(gè)可被識(shí)別的運(yùn)行機(jī)制或者信號(hào),這是個(gè)價(jià)值百萬(wàn)的問(wèn)題。
Sebastian Farquhar and colleagues at the University of Oxford used a measure called 'semantic entropy' to assess whether a statement from an LLM is likely to be a hallucination or not. Their technique is quite straightforward: essentially, an LLM is given the same prompt several times, and its answers are then clustered by 'semantic similarity' (i.e., according to their meaning). The researchers' hunch was that the 'entropy' of these answers - in other words, the degree of inconsistency - corresponds to the LLM's uncertainty, and thus the likelihood of hallucination. If all its answers are essentially variations on a theme, they are probably not hallucinations (though they may still be incorrect).
牛津大學(xué)的Sebastian Farquhar和同事們用一種被稱作“語(yǔ)義熵”的測(cè)量方法來(lái)評(píng)估LLM給出的描述是否是幻覺(jué)。他們的技術(shù)十分直接:本質(zhì)上,對(duì)LLM重復(fù)輸入相同詞語(yǔ)很多次,將它的輸出結(jié)果根據(jù)“語(yǔ)義相似性”(根據(jù)詞語(yǔ)的含義)進(jìn)行聚類。研究人員的直覺(jué)判斷是這些答案的“熵”,也就是說(shuō),不一致的程度,這反映了LLM的不確定性,這反映了產(chǎn)生幻覺(jué)的可能性。如果大模型給出的所有答案本質(zhì)上都是同一個(gè)主題的不同衍變,這不大可能是個(gè)幻覺(jué)(盡管答案有可能是錯(cuò)誤的)。
In one example, the Oxford group asked an LLM which country is
associated with fado music, and it consistently replied that fado is the
national music of Portugal - which is correct, and not a hallucination.
But when asked about the function of a protein called StarDio, the
model gave several wildly different answers, which suggests
hallucination. (The researchers prefer the term 'confabulation,' a
subset of hallucinations they define as 'arbitrary and incorrect
generations.') Overall, this approach was able to distinguish between
accurate statements and hallucinations 79% of the time; ten percentage
points better than previous methods. This work is complementary, in many
ways, to Anthropic's.
例如,牛津大學(xué)的小組問(wèn)LLM哪個(gè)國(guó)家與Fado音樂(lè)更相關(guān),它一直回答Fado是葡萄牙的民族音樂(lè),這就是個(gè)正確答案而不是幻覺(jué)。而當(dāng)問(wèn)到蛋白質(zhì)StarD10的功能時(shí),大模型給出了很多寬泛不同的答案,這應(yīng)該就是幻覺(jué)了。(研究員更傾向于使用“失憶癥”這個(gè)術(shù)語(yǔ),他們將這個(gè)幻想的子集定義為“武斷和錯(cuò)誤的產(chǎn)生”)??傊?,這種方法可以在區(qū)分正確輸出和幻覺(jué)時(shí)達(dá)到79%的準(zhǔn)確性,準(zhǔn)確性比以往的方法提高了10%。在很多方面,這種方法和Anthropic的方法形成了很好的互補(bǔ)。
Others have also been lifting the lid on LLMs: the 'superalignment' team at OpenAI, maker of GPT-4 and ChatGPT, released its own paper on sparse autoencoders in June, though the team has now been dissolved after several researchers left the firm. But the OpenAI paper contained some innovative ideas, says Dr. Batson. 'We are really happy to see groups all over, working to understand models better,' he says. 'We want everybody doing it.'
很多人也在努力揭開(kāi)LLMs神秘的面紗:OpenAI(GPT-4和ChatGPT的開(kāi)發(fā)者)的“超一致”小組在七月發(fā)表了關(guān)于稀疏自動(dòng)解碼器的論文,盡管這個(gè)小組已經(jīng)在多名研究員從公司離職后而解散。Batson博士說(shuō),這篇OpenAI的論文包含很多創(chuàng)新的方法?!翱吹接泻芏嘟M織都在全心致力于更好地理解大模型,我們真的非常開(kāi)心,也希望所有人都能參與進(jìn)來(lái)?!?br>
*本文翻譯自《經(jīng)濟(jì)學(xué)人》2024年7月13日商業(yè)文章《Inside the mind of AI》,僅供英文交流學(xué)習(xí)使用,原圖文版權(quán)歸經(jīng)濟(jì)學(xué)人雜志所有
聯(lián)系客服