Alphabetized Chinese writing

According to this article more than 95% of Chinese written in pinyin is perfectly comprehensible even without tone marks. With regard to the remaining 5%, often the context is sufficient to determine the intended word. To prevent any remaining ambiguities, marking less used homophones with tone marks would suffice:

Pinyin No Tones Tone to be Indicated
wo 我 (16,790) 窩 (96) 握 (91) 臥 (30)
kan 看 (4,682) 刊 (5) 砍 (81) 坎 (7)
shang 上 (10,602) 傷 (132) 賞 (12) 尚 (42)

The numbers which appear after each character refer to their relative numbers of occurrence as noted in the Xiandai Hanyu Pinyin Cidian (Beijing Language Institute Press, 1986). Based on these figures, only 1.3% of occurrences of Pinyin monosyllabic words spelled wo need to have their tones marked to distinguish them from the word wo which refers to first person singular (the unmarked case); only 2% of cases of Pinyin monosyllabic kan require tone markings to distinguish them from the most commonly occurring word meaning “to see”; only 1.8% of occurrences of shang require tone markings to differentiate them from the word meaning “on, above, mount”, and so on. Obviously, beginning students need only learn the unmarked forms which constitute the majority of cases as these are the easiest to remember.

The same method will be used in dealing with bisyllabic words. In any group of words which have the same segmental Pinyin spelling (that is, the same pronunciation except for tone), the tone markings will be omitted from the most common (unmarked) case, and only added to the remaining cases as distinguishers. For example:

    zhidao   知道   (1,603)
    zhǐdǎo   指導    (189)
    zhídào   直到     (110)

If there is one morpheme which is the same in a group of bisyllabic words which have the same segmental pronunciation and spelling except for tone, then that morpheme need not be marked for tone; for example:

    zhongxin   中心    193
    zhōngxin   忠心    6
    zhòngxin   重心    7
    quanli     權利    21
    quanlì     權力    16

After this, there remain homophones that are identical in tone as well. A number of those are very frequent one syllable words. I’ve noticed that even in Dutch (my native language, and not a language that is particularly encumbered with homophones) there are frequently used short words that have homophones. Context always seems to eliminate ambiguity. However, in theory it is possible to remove all remaining ambiguity from those words using different spelling.

Suggestions for Chinese:

Basic Forms Variant Spelling Forms
bei north bey (coverb)
de (verb suffix) d (possessive); di 地 (-ly)
guo country -go (experiential suffix)
mai buy may “sell”
mei not moi “each”
men door -mn (plural suffix)
shi ten sh “to be”
ta he taa “she”; to 它 “it”
xiang think xang “towards”; xanq 像 “like”
yi take i “one”
you have yeu “from”; iu 又 “again”
zai at zay “again”
zhe this -zh “-ing”
zi character -z (noun suffix)

With this, Chinese would be even more clear than Dutch. Isn’t that remarkable. It is often stated that Chinese could never be alphabetized because of its huge problem of homophonous words. While in fact, according to the article even pinyin with hardly any accents (tone marks) or variant spellings is perfectly comprehensible.¹

1. Full tone marks would of course be useful for foreign students learning the language. But only in a very limited sense. In order to be able to speak and understand Chinese, the student has to learn the correct pronunciation of words anyway.


Het probleem van ontleningen uit het Engels die niet alleen afwijken van onze spelling, maar ook nog eens vervoegd of verbogen moeten worden.

Dus we onderzoeken nu wat er gebeurt als je onderwerpen anders framet.

Dat werkt dus niet. Wat dan? frame-t? Of freemt?

Of dan toch maar “... we onderzoeken nu wat er gebeurt als je onderwerpen anders inkleedt”.

Is er een beter equivalent? Voor mij verdwijnt de associatie met Erving Goffman, maar die zal niet iedereen hebben. In een ander kader plaatst?

De context van de originele zin speelt natuurlijk ook mee. De schrijver van het artikel heeft zich tijd en energie bespaart door af te zien van een vertaling.

Wat zou mijn moeder denken van het werkwoord frame-en? Het zelfstandig naamwoord kent ze vrijwel zeker wel:

  • frame
    Leenwoord uit het Engels, in de betekenis van ‘raamwerk’ voor het eerst aangetroffen in het jaar 1886. [1]

Door associatie daarmee roept frame-en de bedoelde betekenis waarschijnlijk effectief genoeg op. Dus zijn we terug bij de spelling. Ik keur framet af.

Most people still use typewriter-style quote marks.

Somewhere in the 1990’s, a bunch of people hammered out the International Standard ISO/IEC 10646, which was surely a good thing.

I became aware of this standard when at one time long ago I upgraded my Linux system and got the latest fonts. Suddenly all left quote characters and right quote characters looked different. They no longer matched.

Markus Kuhn, who designed the new Linux fonts that were on my new system, warned:

Please do not use the ASCII grave accent (0x60) as a left quotation mark together with the ASCII apostrophe (0x27) as the corresponding right quotation mark (as in `quote’). Your text will otherwise appear rather strange with most modern fonts (e.g., on Windows and Mac systems). Only old X Window System fonts and some old video terminals show ASCII 0x60/0x27 as left and right quotation marks, while most modern systems follow the ISO and Unicode standards instead. If you can use only ASCII’s typewriter characters, then use the apostrophe character (0x27) as both the left and right quotation mark (as in 'quote'). If you can use Unicode characters, nice directional quotation marks are available in the form of characters U+2018, U+2019, U+201C, and U+201D (as in ‘quote’ or “quote”). ¹

Modern computer fonts look really great. Often it’s like reading a book on the screen. Except, almost twenty years later, still very few people use quote marks that match the design of these fonts. Instead they use the straight apostrophe or the double straight quote mark. Unlike the regular letters of most fonts, which are designed with nice print shapes, these standard quote signs look like they are from an old typewriter. In context they look horrible, especially when the font is a serif font.

So why don’t people use the “nice directional quotation marks” that Kuhn talks about? Simple, those signs are not on the keyboard. In fact, most physical keyboards strangely still emulate the keyboard of a typewriter.

For a while software solutions were created to solve this (like Microsoft’s “smart quote marks”), but those solutions exist only in specific programs. There is no global solution. In fact, with the introduction of lots and lots of new devices, those “nice directional quotation marks” often have been harder to find or use. For a long time, my Android keyboard did’t even have directional quote marks. And while some time ago Android’s English keyboard introduced them as an option that shows them when you long press the apostrophe or the double quote mark, hardly anyone seems to use them. (My multilingual English/Japanese keyboard didn’t receive this update. If I would want to type a directional quote mark, I’d have to change keyboards first, which is not convenient.) On the PC, with its physical keyboard, no such solution exists. For some reason, the layout of the US computer keyboard has hardly changed in decades. The only new key on the keyboard that I am typing on right now, is a key with Microsoft’s symbol on it.

How do I type the directional quote marks in this text? When I type in gvim I use hotkeys that I choose myself. Additionally I use the standard Linux hotkeys that work in gtk applications (on how to activate the latter: see below).

Change the keyboard

Changing the current keyboard layout is not straightforward, because almost all the symbols that are available now, have been adopted into programming languages. That doesn’t leave much room for change. Nevertheless, changing the keyboards, both physical and software keyboards, seems to be the only way to get “nice directional quotation marks” widely adopted into our otherwise beautiful screen fonts.

Type hard to reach symbols in Linux

A lot of Linux distros seem not to have defined a compose key by default (including my distro, Slackware). That is probably because keyboards have different layouts, but why not make an exception for the rather language agnostic US keyboard? Well, I don’t know, you tell me. It seems we lack a sane standard for keyboards.

Anyway, step one is to define a compose key. You can even do that temporarily on someone else’s box. Find a key on your keyboard that would be a convenient compose key (it should be a key that you don’t use to type symbols directely). I use the Menu key. Open a terminal and type xev. Press and release the key you intend to use, and note the number after keycode. On my keyboard, the Menu key has keycode 135. Close xev and enter the following command (using your own chosen keycode):

xmodmap -e "keycode 135 = Multi_key"

Now you can type lots of fancy symbols, including directional quote marks.

Press and release sequentially Multi_key, the first key and then the second key of a key combination.

Multi_key < ' or Multi_key ' < gives ‘
Multi_key > ' or Multi_key ' > gives ’

Multi_key < " or Multi_key " < gives “
Multi_key > " or Multi_key " > gives ”

Multi_key , " or Multi_key " , gives „

A few examples of other keys. Lots of combinations are quite predictable.

Multi_key ' e gives é
Multi_key ` e gives è
Multi_key " e gives ë
Multi_key / o gives ø
Multi_key - u gives ū
Multi_key 8 8 gives ∞
Multi_key v / gives √
Multi_key y - gives ¥
Multi_key c = gives €
Multi_key ^ 1 gives ¹

A complete list of compose key combinations lives in the folder of the locale you are using, for example /usr/share/X11/locale/en_US.UTF-8/Compose. Use your own locale (type echo $LANG) and create a list with:

cat /usr/share/X11/locale/en_US.UTF-8/Compose | grep Multi_key > Multi_key.txt

On my system it has 4041 combinations, including lots of weird ones (like the Korean alphabet).

You can make your Multi_key selection permanent by simply adding the command xmodmap -e "keycode 135 = Multi_key" to your .xinitrc or another startup script that your system may use.²

  1. ASCII and Unicode quotation marks

Screenshot of a font test of 繫.

It seems that the fonts 'Sazanami Gothic','さざなみゴシック' and Sazanami Mincho','さざなみ明朝' have the wrong graph at U+7e6b (繫) which should have thread 糸 at the bottom, not hand 手.

Screenshot of a font test of 繫:

A more recent version of the font may have been fixed, I'll try it later. Accessing the (possible) webpage of the developer failed:

Update: still wrong in Slackware 14.2.