The app for independent voices

Fascinating, if you’re a nerdy translator in the tech field:

There’s a whole debate on how the word “token” has landed in Chinese. TLDR: despite "词元" (word element) being recommended, it isn’t quite right, says commentator Wang Zijian on Jiemian, a media blog. Tokens are more like "符元" (symbol elements), as they process much more than text, and work across multimodal systems.

————————————————————————

China’s National Committee for the Examination and Approval of Scientific and Technological Terms, has recommended translating “token" as "词元" (word element). Pinning down the argument, People's Daily published a lengthy defense of the decision. Internet commentator Wang Zijian finds the case for "词元" weak. 

Writing on Jiemian, Wang Zijian argues that the word token is being mistranslated in Chinese, saying that the cost of a misaligned term may only appear years later, when every explanation has to begin by correcting the first impression the name created.

Chen Xilin from the Chinese Academy of Sciences Institute of Computing Technology argues Token's original role was as a "basic semantic unit of language," so "词元" captures its essence. But initial application is not structural identity. Tokens now process text, images, speech, and physical signals across multimodal systems. Ascribing the term according to its origins uses the same logic that would have named the internet the "Cold War military network" because that was its first use.

Dong Yuxiao, associate professor in Tsinghua's Computer Science Department, argues that image patches and audio segments can be understood as "words in a broad sense," citing "词云" and "词袋" as supporting analogies. But when tokens run to tens of billions to trillions of daily calls and are embedded in compute billing, model training, and academic measurement, their name needs to map to what they actually are, and not just be a metaphor that requires explanation to stay coherent.

Not to mention, "词元" already means something in linguistics. Lemma, the normalized root form of a word. This is bound to cause a documentation problem in textbooks, APIs, and academic papers.

"符元" (symbol element) avoids these issues. It maps to the token's computational identity as a discrete symbolic unit, carries no prior semantic baggage in Chinese academic usage, and needs no extended metaphor to cover multimodal applications.

Apr 13
at
6:38 AM
Relevant people

Log in or sign up

Join the most interesting and insightful discussions.