UTF-8
From Gentoo Linux Wiki
| Installation • Kernel & Hardware • Networks • Portage • Software • System • X Server • Gaming • Non-x86 • Emulators • Misc |
|
Contents
|
The drawback is that this limits the number of characters that can be represented by the table. As long as the table contains all the characters you need, there are no problems. The moment one shares a file with someone who uses a different character table, things start going wrong.
Some tables (such as the ISO-8859-* tables) overlap with the same string representing the same characters. Other characters may exist in only one of the tables. These, naturally, are the main point of contention.
There are two solutions to this problem. Either one must have information about the character table used in each file that contains text, or have a table that incorporates each and every character in the world.
Unicode is an implementation of the latter. It allows users to write and exchange information without compatibility worries and with falling prices for storage, it has become very popular. Users only have to make sure that their software supports Unicode and they have fonts installed that can display all the characters they wish to use (as no single font implements all the characters in Unicode).
[edit] USE flags
Add the USE flags "unicode" and "nls" to your /etc/make.conf
| File: /etc/make.conf |
USE="unicode nls" |
To rebuild all changed packages, do a world upgrade:
| Code: |
emerge world --update --newuse |
[edit] Kernel Stuff
To activate unicode in the kernel set the following in:
| Linux Kernel Configuration: Unicode support |
File systems ---> Native Language Support ---> (utf8) Default NLS Option <*> NLS UTF8 |
Now your filenames will be encoded in utf8 per default, after you re-compile your kernel.
If you compiled it as a module, be sure to load it:
modprobe nls_utf8
To avoid doing this every time you boot, add "nls_utf8" to your /etc/modules.autoload.d/kernel-2.6 or -2.4 file.
[edit] Kernel Bugs
Please note that there exists a bug in some Linux kernel versions which affects UTF-8 locales using dead keys. The issue has reportedly been solved since kernel version 2.6.11.
[edit] Installing locales
The system locales come with the glibc package. By default almost all possible locales are installed, though you can choose to install only the locales you need. If you don't get umlauts and special characters displayed properly, it might help to use rebuild your locales. You can use locale-gen. If it's not available, upgrade to the most current version of glibc. Having generated the locales, you need to delete /etc/locales.build which is obsolet. An alternative to locale-gen is to manually create the *.UTF-8 locales. This can be done by using localedef.
| Code: |
English: localedef -c -f UTF-8 -i en_US en_US.UTF-8 German : localedef -c -f UTF-8 -i de_DE de_DE.UTF-8 |
locale -a will give you a list of all installed locales. Note: In the output, utf8 will be weighted like as UTF-8.
Now you need to modify your LAN, LC_ALL and GDM_LANG variables for your language. Note: /etc/env.d/02locale is case-sensitive!
| File: /etc/env.d/02locale |
LANG="de_DE.UTF-8" LC_ALL="de_DE.UTF-8" GDM_LANG="de_DE.UTF-8" |
To get a list of all UTF-8 supported locales, check the output of:
grep UTF-8 /usr/share/i18n/SUPPORTED
Lines in /usr/share/i18n/SUPPORTED have the format:
<locale> <charmap>
You only need the <locale> part for your /etc/env.d/02locale.
Now do the following:
| Code: |
# env-update >>> Regenerating /etc/ld.so.cache... * Caching service dependencies ... # source /etc/profile |
- See http://www.gentoo.org/doc/en/guide-localization.xml for further information.
- See TIP Specifying only needed locales for instructions.
[edit] Console setup
Add to ~/.bashrc in order to set the console into unicode mode on login (use "unicode_start foo_font" to set your custom font):
| File: ~/.bashrc |
if [[ $TERM = "linux" ]]; then unicode_start fi |
If you're having a multi-user system, you need to do this for every single user.
But, since "unicode_start" requires root privileges, you can instead configure your Gentoo system to default to unicode consoles for all logins. For this to work, you must have a recent version of sys-apps/baselayout installed (>=sys-apps/baselayout-1.11.9).
First, change the unicode setting in /etc/rc.conf
| File: /etc/rc.conf |
UNICODE="yes" |
Mind the case. UNICODE="YES" will NOT work.
Then, to install a good font for UTF-8 consoles called terminus
| Code: emerge terminus |
emerge -av media-fonts/terminus-font |
Also edit the following files, according to their comments:
/etc/conf.d/consolefont /etc/conf.d/keymaps
You change the font in /etc/conf.d/consolefont. {{Box File|/etc/conf.d/consolefont|
CONSOLEFONT=LatArCyrHeb-16 # Latin, Arabic (only isolated forms, Cyrillic, Hebrew) # take a look at /usr/bin/unicode_start (shell script)
Here are the settings for the German keyboard:
| File: /etc/conf.d/keymaps |
KEYMAP="de-latin1" #alternatively: KEYMAP="de-latin1-nodeadkeys" |
You mustn't use "-u" in KEYMAP anymore for "base layout".
One example for setting the console font is
| File: /etc/conf.d/consolefont |
CONSOLEFONT="ter-v16b" #CONSOLETRANSLATION="" |
Now, reboot the system, and the system INIT will automatically enable UTF-8 capability on all console logins. However, a particular console login won't actually display in UTF-8 until receiving a switch-to-unicode escape sequence.
The last step is to make the following change so that the switch-to-unicode escape sequence executes at each login
| File: ~/.bash_profile |
if test -t 1 -a -t 2 ; then
echo -n -e '\033%G'
fi
|
This code instructs the console to switch to unicode if running from a console TTY (and not a terminal emulator or remote shell). In fact, this code block is directly from the internals of the "unicode_start" command.
Or, to make the switch to UTF-8 global for all users (could be problematic)
| File: /etc/profile |
if test -t 1 -a -t 2 ; then
echo -n -e '\033%G'
fi
|
As a final, last-ditch alternative, you can use this init.d script to set all consoles into unicode mode on bootup:
| File: /etc/init.d/unicode |
#!/sbin/runscript
conf=/etc/env.d/02locale
# Using devfs?
if [ -e /dev/.devfsd ] || [ -e /dev/.udev -a -d /dev/vc ]; then
device=/dev/vc/
else
device=/dev/tty
fi
depend() {
need localmount
after keymaps
before consolefont
}
checkconfig() {
if [ -r ${conf} ]; then
. ${conf}
encoding=
[ -n "${LC_ALL}" ] && encoding=${LC_ALL#*.} && return 0
[ -n "${LC_MESSAGES}" ] && encoding=${LC_MESSAGES#*. } && return 0
[ -n "${LANG}" ] && encoding=${LANG#*.} && return 0
fi
eend 1 "Locale is not configured, Please fix ${conf}"
return 1
}
start() {
ebegin "setting consoles to UTF-8"
checkconfig
if [[ "${encoding}" =~ [uU][tT][fF]-?8 ]]; then
dumpkeys | loadkeys --unicode
for ((i=1; i <= "${RC_TTY_NUMBER}"; i++)); do
echo -ne "\033%G" > ${device}${i}
done
eend 0
else
eend 1 "UTF-8 is not required"
fi
}
|
| Code: to make script executable |
chmod +x /etc/init.d/unicode |
and then
| Code: add the script |
rc-update add unicode default |
Sometimes it might be needed to set LC_ALL and LANG environmental options as well, it's easy to set them following the instruction on the page of Gentoo Linux Localization Guide.
[edit] Converting old files
Once Unicode support has been added, old files may need to be re-encoded to display properly.
To re-encode the contents of plain text files you have the choice of and iconv, recode and enconv which is in app-i18n/enca).
app-text/convmv is a perl script utility that re-encodes filenames, directory names, and entire subtrees. Emerge it with
| Code: |
emerge -av app-text/convmv |
To test re-encoding a filename from ISO-8859-15 to UTF-8, try
| Code: |
convmv -f iso-8859-15 -t utf8 file-name-with-ä |
and if the produced command seems sane, add --notest to actually re-encode the name.
[edit] Applications
To enter Unicode characters that are not available on your keyboard, you need to press the keys CTRL+Shift and enter the hex value nnnn of the character. Note: You need to use the value of the Unicode notation U+nnnn and not the UTF-8 encoded value.
[edit] Terminal emulators
[edit] xterm
xterm is running in unicode mode when started with one of:
| Code: |
|
xterm -u8 uxterm |
If you want xterm to support Unicode without starting it with the parameter "-u", you can also add this to your ~/.Xresources:
| Code: xterm Unicode |
XTerm*locale: true |
After having added this line, you need to run xrdb -merge ~/.Xresources.
[edit] URxvt
URxvt from x11-terms/rxvt-unicode is always running in unicode mode. If you want it to use UTF-8, you have to set your LANG accordingly (eg LANG="en_US.UTF-8")
[edit] GNU Screen
GNU Screen must be invoked with the -U command line option.
If you are using it as a login shell you will have to write a wrapper that calls screen with the -U option and the options that are called when screen is used as a login shell:
| Code: GNU Screen wrapper |
#!/bin/sh exec /usr/bin/screen -xRR -U |
For people using it for irssi and so on, making an alias is enough.
| File: ~/.bashrc |
alias screen="screen -U" |
However, if you are running screen from an SSH or RSH session, then editing the screen configuration should be enough.
Add the following to ~/.screenrc
| File: ~/.screenrc |
defutf8 on |
[edit] Players
[edit] XMMS
XMMS isn't able to handle UTF-8 characters. A replacement is the Beep-Media-Player. It's a GTK v2.0-based XMMS-clone which supports Unicode.
emerge -av beep-media-player
Of course there are many themes and plugins for BMP (Beep Media Player):
emerge -s bmp
Another way to get XMMS running is to create a GTK v1 configuration file: gtkrc.utf8. We copy an existing one (/etc/gtk/gtkrc.iso-8859-14) to /etc/gtk/gtkrc.utf8
cp /etc/gtk/gtkrc.iso-8859-14 /etc/gtk/gtkrc.utf8
Now we need to edit it and replace every single "iso-8859-14" by a "utf8":
nano /etc/gtk/gtkrc.utf8
Afterwards it should look like:
style "gtk-default-utf8" {
fontset = "-*-helvetica-medium-r-normal--12-*-*-*-*-*-utf8,\
-*-arial-medium-r-normal--12-*-*-*-*-*-utf8,\
-*-helvetica-medium-r-normal--12-*-*-*-*-*-utf8,\
-*-arial-medium-r-normal--12-*-*-*-*-*-utf8,*-r-*"
}
class "GtkWidget" style "gtk-default-utf8"
Now XMMS shouldn't have any problems displaying UTF-8 encoded characters.
[edit] Editors
[edit] vim
Unicode should work out of the box, since version 6.3. To make vim display files in UTF-8, add this to your .vimrc:
| File: ~/.vimrc |
set enc=utf-8 |
Add this
| File: ~/.vimrc |
set fenc=utf-8 |
to make Unicode use UTF-8 for writing files.
If you're terminal is also using UTF-8, add this:
| File: ~/.vimrc |
set termencoding=utf-8 |
[edit] nano
Versions prior to 1.3.6 can't handle utf8 properly. At the time of writing, this is only needed for the alpha and ppc-macos platforms.
| Code: |
echo "=app-editors/nano-1.3.6 ~alpha" >> /etc/portage/package.keywords emerge -uDav nano |
[edit] Emacs
When run in console mode, can be configured to handle Unicode by adding the following LISP instructions to its configuration file:
| File: ~/.emacs |
(setq locale-coding-system 'utf-8) (set-terminal-coding-system 'utf-8) (set-keyboard-coding-system 'utf-8) (set-selection-coding-system 'utf-8) (prefer-coding-system 'utf-8) |
Notice, however, that the console must handle Unicode too.
[edit] Kwrite/Kate
Start Kwrite/Kate, go to Settings -> Configure Kate -> Editor Component -> Open/Save. Select UTF-8 in "Encoding". (tested on KDE 4.1 SVN)
[edit] LaTeX
Merge Unicode support for LaTeX with
| Code: |
emerge dev-tex/latex-unicode |
[edit] News reader
[edit] slrn
slrn needs either a ISO-8859-1 based console oder luit:
| Code: |
|
LC_ALL=de_DE.iso88591 LANG=en_US.iso88591 xterm -fn "-misc-fixed-medium-r-normal--18-120-100-100-c-90-iso10646-1" -e slrn
|
| Code: |
|
LC_ALL=de_DE.iso88591 LANG=en_US.iso88591 luit slrn
|
[edit] Mail
[edit] Kmail
Open Kmail and go to Settings -> Configure KMail -> Composer -> Charset. There you find a list of charsets which will be checked from top to bottom until one is found. Move "utf-8" to the first position. Then go to Settings -> Configure KMail -> Appearance -> Message Window. Select in the menu list "Fallback message encoding" the item "Unicode (UTF-8)".
[edit] Mutt printing
Mutt should work without a flaw on a Unicode console. But if you want to use pretty-printing you need a few tricks as a2ps does not support utf-8. Your best bet may be using ebuild:app-misc/muttprint as it seems to work perfect both in unicode and single-byte environments and produces very elegant output. However it requires latex to be installed on your system.
Emerge the package and put this in your ~/.muttrc
| File: ~/.muttrc |
set print_command=muttprint |
Otherwise you may emerge recode and a2ps:
emerge recode a2ps
and use this in
| File: ~/.muttrc |
set print_command="recode UTF-8..Latin-1 | a2ps -1 --portrait --borders=no -X latin1 --pretty-print=mail --strip 1 --highlight-level=heavy -P printername" |
You may also use u2ps from the gnome-u2ps package (Debian gnome-u2ps package - don't know if it's also available in Gentoo). It has native Unicode support.
[edit] /bin/mail
mail-client/mailx is not able to handle UTF-8, mail-client/nail is.
| Code: |
emerge --unmerge mailx emerge nail |
You can use /bin/mail of mail-client/mailx to send a UTF-8 encoded mail but you have to set the "charset"-header manually (tested with v8.1.2.20050715-r1):
| Code: |
echo "äöü€" | mail -a "Content-Type: text/plain; charset=utf-8" -s "${subject}" ${recipient} |
[edit] Syypheed-Claws printing
For printing with Sylpheed, we need to use a2ps and recode.
| Code: |
emerge recode a2ps |
| Code: |
Print command: cat %s | recode ..latin-1 | a2ps -1 --portrait --borders=no -X latin1 --pretty-print=mail --strip 1 --highlight-level=heavy -P <printername> |
Substitute "<printername>" with the name of your printer.
If you're using KDE:
| Code: |
Print command: cat %s | recode ..latin-1 | a2ps -1 --portrait --borders=no -X latin1 --pretty-print=mail --strip 1 --highlight-level=heavy | kprinter --stdin |
[edit] Shells
[edit] bash
Bash is unicode-aware since version 2.05b and when using readline version 4.3. Both are in portage.
emerge bash sys-libs/readline revdep-rebuild --soname libreadline.so.4 rm /lib/libreadline.so.4*
be sure you know what you do when you perform the last step (see the info from the readline ebuild).
You will also need to have the package gentoolkit installed as it contains the revdep-rebuild tool.
The above recommended manual deletion of libreadline.so.4 needs to be double checked!
When I do:
# qfile /lib/libreadline.so.4 sys-libs/readline (/lib/libreadline.so.4)
# eix -s readline sys-libs/readline-5.2_p12-r1
Apparently, libreadline.so.4 belongs to readline-5*! This is further verified with:
# qlist readline
I propose a "clean-up" on this article as further configuration files are recommended to be modified when further configuration might not be needed. See Talk/Discussion link at the top of this page for further info on these issues. I too believe a lot of this stuff should already be implemented within /etc/rc.conf and the unicode USE Flag.
[edit] zsh
Zsh handles UTF-8 perfectly since version 4.3.1. Older versions are not yet Unicode aware. It still works as long as you don't use Backspace on unicode characters. (This deletes parts of the UTF-8 character byte wise and confuses zle assumptions about the cursor position.)
[edit] mc
Mc must be compiled with the sys-libs/slang library for full Unicode support.
emerge gentoolkit euse -E slang emerge -avDN mc
[edit] X
Applications such as Fluxbox and Sylpheed-Claws might cause problems when not being merged with the USE flag +cjk. The affected applications would take long to start and consume much CPU resources. Meanwhile there are release candidates of Fluxbox 1.0 which have a better UTF-8 support. X usually obeys the LC_* environment variables; however, X is picky about how you spell your locale settings. What works in the console may not work in X. You can find a list of all acceptable locale aliases in /usr/lib/X11/locale/locale.alias. As always, CaSe matters. You should make sure that the locale you choose corresponds to one of the glibc locales "locale -a".
If you're doing advanced troubleshooting you may also be interested in the locale.dir file, in the same directory. It maps locale names to files. Make sure it maps your locale correctly (it usually does).
So to sum it up, the chain goes like this, and all of its links must be intact: LC_* -> locale.alias -> locale.dir -> [X locale definition file]
[edit] Microsoft Windows partitions
If you're using VFAT partitions, you need to modify the mount options.
| File: /etc/fstab |
/dev/hdxY /mnt/windows1 vfat iocharset=utf8,codepage=850 0 0 /dev/hdxY /mnt/windows2 ntfs nls=utf8 0 0 //samba2/share /mnt/windows3 smbfs iocharset=utf8,codepage=cp850 0 0 //samba3/share /mnt/windows4 smbfs iocharset=utf8 0 0 /dev/cdrom /media/cdrom udf,iso9660 iocharset=utf8,ro,user,noauto 0 0 |
There are differences between Samba v2.2 (DOS, Microsoft Windows 9x and Microsoft Windows Millennium) and Samba v3 (Microsoft Windows 2000 and Microsoft Windows XP). See http://us5.samba.org/samba/docs/man/Samba3-HOWTO/unicode.html
Note: VFAT requires codepage to be 850 and smbfs cp850. You can also use these values in the kernel configuration, so that they can also be used by HAL.
[edit] Samba
Disable the following option in the kernel, if set:
File systems --->
Network File Systems --->
<M> SMB file system support (to mount Windows shares etc.)
[ ] Use a default NLS
Microsoft Windows NT/2000/2003/XP are able to handle UTF-8 but DOS or other Microsoft Windows 9x or Millennium clients need them because those operating systems aren't able to handle UTF-8 and need cp850.
Alternatively you can use CIFS:
File systems --->
Network File Systems --->
<M> CIFS support (advanced network filesystem for Samba, Window and other CIFS compliant servers)
[edit] Fluxbox
BUG 1 Fluxbox doesn't fully support unicode yet. Some of its styles are selecting fonts that are not suitable for unicode. To fix this you will have to edit the Fluxbox's stylefile(s) in /usr/share/fluxbox/styles and add something like:
| File: /usr/share/fluxbox/styles/$YourStyle |
window.font: -*-*-*-*-*-*-*-*-*-*-*-*-*-u |
to at least fix the window title bug.
Solution by user Holms:
Another solution is to set locale in ~/.xinitrc For example I'm using Cyrillic most of a time. If you will write this in your ~/.xinitrc
| File: ~/.xinitrc |
export LANG="ru_RU.UTF-8" export LC_ALL="ru_RU.UTF-8" |
then all windows title will be in unicode and your locale will be Russian, set this to you country. Maybe it will be clever to put en_EN.UTF-8 instead of that, because all programs will start display everything in your language instead of english. UTF-8 shows to the system which encoding you'll be using by default so you want Unicode you get Unicode. By the way add same two line to the ~/.bashrc (at least some people prefer to do this, but didn't helped to me) and do not forget to configure your locales in /etc/locale.gen. If you haven't configured it yet, go to Gentoo handbook and read about locales. If this doesn't help try to read HOWTO_Xorg_and_Fonts. Do everything that written in "Emerging the necessary packages" section, at least that helped to me.
BUG 2 Fluxbox takes very long to load on a utf-8 locale http://bugs.gentoo.org/show_bug.cgi?id=71747
patch for fluxbox-0.9.11 here: http://www.fluxmod.org.ua/ (patch has been merged with mainline as of 0.9.14)
A workaround is available with this alias in your ~/.bashrc:
| File: ~/.bashrc |
alias startx="LC_ALL='C' startx" |
Reset the locale file to UTF-8 after having Fluxbox started:
| File: ~/.fluxbox/startup |
export LC_ALL="en_US.utf8" |
Another workaround is:
# USE="disablexmb" emerge fluxbox
It's best to add this to the /etc/portage/package.use:
x11-wm/fluxbox disablexmb
[edit] OpenOffice.org
To force OpenOffice.org to use UTF-8 (you'll have problems when entering unicode characters) you have to set the LANGUAGE variable to an appropriate value:
| File: /etc/env.d/02locale |
LANG="de_DE.UTF-8" ...a lot of LC-Variables... # For OpenOffice.org LANGUAGE="en_GB:en" |
Don't forget to run env-update && source /etc/profile after changing files in /etc/env.d/. Maybe you'll need to login again to apply the changes to your current environment.

