Go to the first, previous, next, last section, table of contents.

6 Encodings

a2ps is trying to support the various usual encodings that its users use. This chapter presents what an encoding is, how the encodings support is handled within a2ps, and some encodings it supports.

6.1 What is an Encoding

This section is actually taken from the web pages of Alis Technologies inc.

Document encoding is the most important but also the most sensitive and explosive topic in Internet internationalization. It is an essential factor since most of the information distributed over the Internet is in text format. But the history of the Internet is such that the predominant - and in some cases the only possible - encoding is the very limited ASCII, which can represent only a handful of languages, only three of which are used to any great extent: English, Indonesian and Swahili.

All the other languages, spoken by more than 90% of the world's population, must fall back on other character sets. And there is a plethora of them, created over the years to satisfy writing constraints and constantly changing technological limitations. The ISO international character set registry contains only a small fraction; IBM's character registry is over three centimeters thick; Microsoft and Apple each have a bunch of their own, as do other software manufacturers and editors.

The problem is not that there are too few but rather too many choices, at least whenever Internet standards allow them. And the surplus is a real problem; if every Arabic user made his own choice among the three dozen or so codes available for this language, there is little likelihood that his "neighbor" would do the same and that they would thus be able to understand each other. This example is rather extreme, but it does illustrate the importance of standards in the area of internationalization. For a group of users sharing the same language to be able to communicate,

  1. the code used in the shared document must always be identified (labeling)
  2. they must agree on a small number of codes - only one, if possible (standards);
  3. their software must recognize and process all codes (versatility)

Certain character sets stand out either because of their status as an official national or international standard, or simply because of their widespread use.

First off, there is the ISO 8859 standards series that standardize a dozen character sets that are useful for a large number of languages using the Latin, Cyrillic, Arabic, Greek and Hebrew alphabets. These standards have a limited range of application (8 bits per character, a maximum of 190 characters, no combining) but where they suffice (as they do for 10 of the 20 most widely used languages), they should be used on the Internet in preference to other codes. For all other languages, national standards should preferably be chosen or, if none are available, a well-known and widely-used code should be the second choice.

Even when we limit ourselves to the most widely used standards, the overabundance remains considerable, and this significantly complicates life for truly international software developers and users of several languages, especially when such languages can only be represented by a single code. It was to resolve this problem that both Unicode and the ISO 10646 International standard were created. Two standards? Oh no! Their designers soon realized the problem and were able to cooperate to the extent of making the character set repertoires and coding identical.

ISO 10646 (and Unicode) contain over 30,000 characters capable of representing most of the living languages within a single code. All of these characters, except for the Han (Chinese characters also used in Japanese and Korean), have a name. And there is still room to encode the missing languages as soon as enough of the necessary research is done. Unicode can be used to represent several languages, using different alphabets, within the same electronic document.

6.2 Encoding Files

The support of the encodings in a2ps is completely taken out of the code. That is to say, adding, removing or changing anything in its support for an encoding does not require programming, nor even being a programmer.

See section 6.1 What is an Encoding, if you want to know more about this.

6.2.1 Encoding Map File

See section 5.2 Map Files, for a description of the map files.

The meaningful lines of the `encoding.map' file have the form:

alias      key
iso-8859-1 latin1
latin1     latin1
l1         latin1

where

alias
specifies any name under which the encoding may be used. It influences the option `--encoding', but also the encodings dynamically required, as for instance in the mail style sheet (support for MIME). When encoding is asked, the lower case version of encoding must be equal to alias.
key
specifies the prefix of the file describing the encoding (`key.edf', section 6.2.2 Encoding Description Files).

6.2.2 Encoding Description Files

The encoding description file describing the encoding key is named `key.edf'. It is subject to the same rules as any other a2ps file:

The entries are

`Name:'
Specifies the full name of the encoding. Please, try to use the official name if there is one.
Name: ISO-8859-1
`Documentation/EndDocumentation'
Introduces the documentation on the encoding (see section 5.1 Documentation Format). Typical informations expected are the other important names this encoding has, and the languages it covers.
Documentation
Also known as ISO Latin 1, or Latin 1.  It is a superset
of ASCII, and covers most West-European languages.
EndDocumentation
`Substitute:'
Introduces a font substitution. The most common fonts (e.g., Courier, Times-Roman...) do not support many encodings (for instance it does not support Latin 2). To avoid that Latin 2 users have to replace everywhere calls to Courier, a2ps allows to specify that whenever a font is called in an encoding, then another font should be used. For instance in `iso2.edf' one can read:
# Fonts from Ogonkify offer full support of ISO Latin 2
Substitute: Courier              Courier-Ogonki
Substitute: Courier-Bold         Courier-Bold-Ogonki
Substitute: Courier-BoldOblique  Courier-BoldOblique-Ogonki
Substitute: Courier-Oblique      Courier-Oblique-Ogonki
`Default:'
Introduces the name of the font that should be used when a font (not substituted as per the previous item) is called but provides to poor a support of the encoding. The Courier equivalent is the best choice.
Default: Courier-Ogonki
`Vector:'
Introduces the PostScript encoding vector, that is a list of the 256 PostScript names of the characters. Note that only the printable characters are named in PostScript (e.g., `bell' in ASCII (^G) should not be named). The special name `.notdef' is to be used when the character is not printable. Warning. Make sure to use real, official, PostScript names. Using names such as `c123' may be the sign you use unusual names. On the other hand PostScript names such as `afii8879' are common.

6.2.3 Some Encodings

Most of the following information is a courtesy of Alis Technologies inc. and of Roman Czyborra's page about The ISO 8859 Alphabet Soup. See section 6.1 What is an Encoding, is an instructive presentation of the encodings.

The known encodings are:

Encoding: ASCII (`ascii.edf')
US-ASCII.

Encoding: HPRoman (`hp.edf')
The 8 bits Roman encoding for HP.

Encoding: IBM-CP437 (`ibm-cp437.edf')
This encoding is meant to be used for PC files with drawing lines.

Encoding: IBM-CP850 (`ibm-cp850.edf')
Several characters may be missing, especially Greek letters and some mathematical symbols.

Encoding: ISO-8859-1 (`iso1.edf')
The ISO-8859-1 character set, often simply referred to as Latin 1, covers most West European languages, such as French, Spanish, Catalan, Basque, Portuguese, Italian, Albanian, Rhaeto-Romanic, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish, and English, incidentally also Afrikaans and Swahili, thus in effect also the entire American continent, Australia and the southern two-thirds of Africa. The lack of the ligatures Dutch IJ, French OE and ,,German" quotation marks is considered tolerable.

The lack of the new C=-resembling Euro currency symbol U+20AC has opened the discussion of a new Latin0.

Encoding: ISO-8859-2 (`iso2.edf')
The Latin 2 character set supports the Slavic languages of Central Europe which use the Latin alphabet. The ISO-8859-2 set is used for the following languages: Czech, Croat, German, Hungarian, Polish, Romanian, Slovak and Slovenian.

Support is provided thanks to Ogonkify.

Encoding: ISO-8859-3 (`iso3.edf')
This character set is used for Esperanto, Galician, Maltese and Turkish.

Support is provided thanks to Ogonkify.

Encoding: ISO-8859-4 (`iso4.edf')
Some letters were added to the ISO-8859-4 to support languages such as Estonian, Latvian and Lithuanian. It is an incomplete precursor of the Latin 6 set.

Support is provided thanks to Ogonkify.

Encoding: ISO-8859-5 (`iso5.edf')
The ISO-8859-5 set is used for various forms of the Cyrillic alphabet. It supports Bulgarian, Byelorussian, Macedonian, Serbian and Ukrainian.

The Cyrillic alphabet was created by St. Cyril in the 9th century from the upper case letters of the Greek alphabet. The more ancient Glagolithic (from the ancient Slav glagol, which means "word"), was created for certain dialects from the lower case Greek letters. These characters are still used by Dalmatian Catholics in their liturgical books. The kings of France were sworn in at Reims using a Gospel in Glagolithic characters attributed to St. Jerome.

Note that Russians seem to prefer the KOI8-R character set to the ISO set for computer purposes. KOI8-R is composed using the lower half (the first 128 characters) of the corresponding American ASCII character set.

Encoding: ISO-8859-7 (`iso7.edf')
ISO-8859-7 was formerly known as ELOT-928 or ECMA-118:1986. It is meant for modern Greek.

Encoding: ISO-8859-9 (`iso9.edf')
The ISO 8859-9 set, or Latin 5, replaces the rarely used Icelandic letters from Latin 1 with Turkish letters.

Support is provided thanks to Ogonkify.

Encoding: ISO-8859-10 (`iso10.edf')
Latin 6 (or ISO-8859-10) adds the last letters from Greenlandic and Lapp which were missing in Latin 4, and thereby covers all Scandinavia.

Support is provided thanks to Ogonkify.

Encoding: ISO-8859-15 (`iso15.edf')
The new Latin9 nicknamed Latin0 aims to update Latin1 by replacing some less needed symbols (some fractions and accents) with forgotten French and Finnish letters and placing the U+20AC Euro sign in the cell of the former international currency sign.

Very few fonts yet offer the possibility to print the Euro sign.

Encoding: KOI8 (`koi8.edf')
KOI-8 (+Λλ) is a subset of ISO-IR-111 that can be used in Serbia, Belarus etc.

Encoding: MS-CP1250 (`ms-cp1250.edf')
Microsoft's CP-1250 encoding (aka CeP).

Encoding: Macintosh (`mac.edf')
For the Macintosh encoding. The support is not sufficient, and a lot of characters may be missing at the end of the job (especially Greek letters).

Go to the first, previous, next, last section, table of contents.