Unicode characters and corresponding LaTeX macros

Contents

Note

The following discussion regards only LaTeX text mode. For math mode see Unicode characters and corresponding LaTeX math mode commands.

Aim:

Unique “canonical” LaTeX representation (LICR) for every Unicode character that can be represented by LaTeX (standard + commonly used packages)

Answer to the question: “Where can I find an official list of LaTeX macros for Unicode characters?”

Rationale:

Define a 7-bit ASCII representation of Unicode text that allows lossless bidirectional conversion between LaTeX-code <-> Unicode

Use cases:
  • *.dfu files for the utf8 inputenc option,

  • xunicode for XeTeX, EU2 font encoding for LuaTeX,

  • LaTeX frontends (LyX, Docutils, …) and editors,

  • encoding converters (recode, Python codec module, …).

inputenc’s extensible utf8-support is coupled to the declared font encodings. However, font encodings overlap. Canonical LICRs help to avoid incompatibilities.

1 State of the Art

2 LaTeX internal character representation (LICR)

3 Comparison with Unicode character codes

LICR

Unicode

macros with descriptive names

code point (natural number) + Unicode Character Name

∃ aliases for convenience or due to

unique (with few exceptions)

“historic reasons” or different

 

naming by different packages

 

decorations via accent macros

decorations via combining chars

(\accent{\basechar})

(basechar + combining char) or with pre-composed chars

features “programmed” into macros

features described by character

or via “knowledge” of the macro

classes

in other parts/packages

 

(e.g. @uclclist)

 

4 Naming

Some rules/suggestions for naming LICR macros:

Letters and Symbols

  • derive from Adobe glyph name or XML Entity Definitions for Characters, maybe disambiguated with the prefix text.

  • for new names, use the Unicode Character Name as arbiter for spelling variants and alternative names (e.g. “perispomeni”, not “perispomene”)

Decorations/Diacritics/Accents

  • standard accent macros (\DeclareTextAccent definitions in latex/base/...) are one-symbol macros (\' \" ... \u \v ...) .

  • tipa.sty and ucs use the “text” prefix also for accents. However, the Adobe Glyph List For New Fonts maps, e.g., “tonos” and “dieresistonos” to 0384 GREEK TONOS and 0385 GREEK DIALYTIKA TONOS, hence texttonos and textdieresistonos should be spacing characters.

  • textcomp (ts1enc.def) defines \capital... accents (i.e. without text prefix).

  • How about a common prefix \accent... or postfix \...Accent?

References

XML Entity Definitions for Characters

W3C Recommendation 01 April 2010

This document defines several sets of names, so that to each name is assigned a Unicode character or sequence of characters.

Adobe Glyph List For New Fonts (AGLFN)

list of base glyph names which are compatible with the Adobe Glyph Naming convention

The Adobe Glyph List lists also legacy glyph names used in existing fonts.