UTF-8

UTF-8 supports all languages and alphabets, including Asian languages and their character depth. It is a widely supported and flexible character encoding.

It's fairly simple to enable UTF-8 on your wiki pages. Current PmWiki versions have the UTF-8 file which is enabled by default in the sample-config.php.

Enabling UTF-8 on a new wiki

If you start a new wiki in any language with the latest PmWiki version, it is highly recommended to enable UTF-8. In the future, PmWiki will change to use the UTF-8 encoding by default, so if you already use it, you will not need a complex "migration" to UTF-8 later.

To enable UTF-8 for a new wiki, add this line near the beginning of config.php (the docs/sample-config.php file has this line already):

  include_once("scripts/xlpage-utf-8.php");

This line should come before a call to the XLPage() function in international wikis.

Save your config.php file encoded as UTF-8 (NO BOM). That allows entry of UTF-8 encoded characters in it. Make sure your editor does support this, and test by adding some non-ANSI UTF-8 characters, to see them in the text editor 1.

With UTF-8 thus enabled you also got use of classes rtl and ltr, which offer setting of the text direction to right-to-left, or left-to-right. This is useful for inclusion of right-to-left scripts like Arabic, Farsi (Persian), Hebrew, Urdu and others.

Enabling UTF-8 on existing wikis

Currently, this is possible only if your group and page names, as well as upload names, don't contain international characters. The names of wiki pages are used as file names, and we don't have yet an easy way to rename the disk files.

If your wiki doesn't have international page/file names, first upgrade to the latest PmWiki version.

To enable UTF-8:

  1. Delete the file wiki.d/.pageindex. This file contains a cache of links and words from your pages and is used for searches and pagelists. PmWiki will rebuild it automatically with the new encoding.
  2. Add these lines near the beginning of config.php:
  include_once("scripts/xlpage-utf-8.php");
  $DefaultPageCharset = array(''=>'ISO-8859-1'); # see below

These lines should come before a call to the XLPage() function in international wikis.

The $DefaultPageCharset line is there to fix and correctly handle some pages with missing or wrong attributes, created by older PmWiki versions.

  • Most wikis in European languages are likely to be in the ISO-8859-1 encoding and should use:
    $DefaultPageCharset = array(''=>'ISO-8859-1');
  • Wikis in Czech and Hungarian language are likely to be in the ISO-8859-2 encoding, they should use this line instead:
    $DefaultPageCharset = array(''=>'ISO-8859-2', 'ISO-8859-1'=>'ISO-8859-2');
  • Wikis in Turkish language are likely to be in the ISO-8859-9 encoding, they should use this line instead:
    $DefaultPageCharset = array(''=>'ISO-8859-9', 'ISO-8859-1'=>'ISO-8859-9');

Support for RTL right-to-left languages

Languages like Arabic, Hebrew, Farsi (Persian), Urdu and others are written in script flowing from right to left. Classes rtl and ltr can be used to specify direction of text independently of the general text direction within a page, for example:

>>rtl<<
يتدفق هذا النص من اليمين إلى اليسار
>>ltr<<
This text flows left to right.
>><<

يتدفق هذا النص من اليمين إلى اليسار

This text flows left to right.

To set text direction for a wiki generally to RTL, you could add to config.php a line like:

$HTMLStylesFmt['rtl'] = " body { direction:rtl; }"

but the skin you use may need other modifications, for instance to swap the search box and the page actions to the other side etc.

Some skins have full support for RTL, see for instance Amber.

Using UTF-8 in page names and URLs

Enabling UTF-8 allows to use international characters in page names (for file names, see $UploadNameChars).

There are good reasons to use UTF-8 in page names. Easier configuration, works out of the box. Easier management of page titles (no need to add a (:title ...:) directive). The possibility to have distinct pages for differently accented words, for example in a dictionary or vocabulary wiki. Better SEO if your URLs match certain search terms.

Also while the URLs may have URL-encoded international characters when you copy them, modern browsers display the actual characters in the URL bar, and major search engines understand the international URLs and show them decoded in the search results.

On the other hand, some people may prefer to restrict page names and file names to plain ASCII characters, especially if the language mostly uses the Latin alphabet.

One of the reasons may be to have plain URLs like your-wiki.org/Francais/Champs-Elysees instead of your-wiki.org/Fran%C3%A7ais/Champs-%C3%89lys%C3%A9es (for a page name [[Français.Champs-Élysées]]). See also ISO8859MakePageNamePatterns?.

Today there shouldn't be many problems in using international characters in page and file names in UTF-8. But if some day you change servers or operating systems, plain Latin letters are more portable and there is less risk that something breaks.

PmWiki automatically converts page text and metadata between encodings, but at the moment cannot automatically rename page files and attachments.

If you already have international characters in file names (page names, uploads), after enabling UTF-8, review your wiki pages and links - you may need to rename some of the files on the server.

Notes

  • You need to save your config.php file in the UTF-8 encoding, and "Without Byte Order Mark (BOM)". See Character encoding of config.php.
  • This page concerns the most recent versions of PmWiki. See Cookbook:UTF-8 for tips on older versions.
  • In the case your PmWiki installation displays wrong encoding, or save an UTF-8 page to an other encoding without explanation, you can double check your custom .htaccess settings at the root of your served pages.


This page may have a more recent version on pmwiki.org: PmWiki:UTF-8, and a talk page: PmWiki:UTF-8-Talk.

Page last modified on February 10, 2023, at 08:04 PM
Powered by PmWiki