The file unimathsymbols.txt contains a mapping between Unicode math characters and LaTeX math control sequences. Due to history and conceptual differences, this mapping is sometimes ambiguous and incomplete.
This file explains the data format by example of a parser written as Python module. Load it the usual Python way
>>> import parse_unimathsymbols >>> from parse_unimathsymbols import *
Contents
Paths to the data files:
datafilename = '../data/unimathsymbols.txt' mathtypes_filename = '../data/category2mathtype.txt' packages_filename = '../data/packages.txt'
Lines starting with a NUMBER SIGN are ignored. Number signs at later positions do not start a comment.
comment_char = '#'
Packages providing (almost) all symbols of another package. They are listed in the file packages.txt (cf. datafilenames) in ConfigParser format:
superpackages = ConfigParser.RawConfigParser() superpackages.optionxform = str # case sensitive options superpackages.read(packages_filename)
Sections correspond to (sub) packages:
>>> s = superpackages.sections() >>> s.sort() >>> s[:4] ['amsfonts', 'amsmath', 'amssymb', 'bbold']
Options are packages providing the commands of the containing section:
>>> superpackages.options('txfonts') ['kmath', 'kpfonts', 'pxfonts']
option values are exceptions, i.e. commands not provided by the superpackage:
>>> superpackages.items('txfonts') [('kmath', ''), ('kpfonts', '\\mathcent\n\\invamp; mirrored instead of turned'), ('pxfonts', '')]
The is_supported() and provided_by() methods of an UniMathEntry can be used to query the superpackage data.
Classes for storing metadata for one character (UniMathEntry) and a set of characters (Table) and a function to generate a new UniMathEntry with default data (new_entry).
Data structure representing one character. Initialized with a string in the format of the datafile
The function read_data() (below) creates UniMathEntry instances for the lines in the datafile (data fields separated by the delimiter and optional whitespace):
>>> data = read_data() >>> type(data[0x24]) <class 'parse_unimathsymbols.UniMathEntry'>
class UniMathEntry(object): def __init__(self, line): """Parse one `line` of unimathsymbols.txt""" self.delimiter = delimiter fields = [i.strip() for i in line.split(delimiter)]
(self.codepoint, # Unicode Number self.utf8, # literal character in UTF-8 encoding self.cmd, # LaTeX command self.unicode_math, # macro of the unicode-math package self.math_class, # Unicode math character class self.category, # math category of the symbol self.requirements, # package(s) providing the command self.comment # aliases and comments ) = fields
Convert code point to integer (other values remain strings):
self.codepoint = int(self.codepoint, 16)
Add missing literal characters (e.g. the delimiter):
if not self.utf8: self.utf8 = unichr(self.codepoint).encode('utf8')
Example:
>>> data[0x24].codepoint, data[0x24].utf8, data[0x24].cmd (36, '$', '\\$')
Missing literal characters are regenerated:
>>> data[ord(delimiter)].utf8 '^'
The string representation of a data entry is the data line (without the optional whitespace):
def __str__(self): # do not include the delimiter as literal character if self.utf8 == delimiter: self.utf8 = '' return self.delimiter.join(('%05X' % self.codepoint, self.utf8, self.cmd, self.unicode_math, self.math_class, self.category, self.requirements, self.comment ))
Examples:
>>> str(data[0x24]) '00024^$^\\$^\\mathdollar^N^mathord^^= \\mathdollar, DOLLAR SIGN' >>> print data[0xA5] 000A5^¥^\yen^\yen^N^mathord^amsfonts^YEN SIGN
Some symbols are provided by LaTeX packages which must be loaded to prevent an Undefined control sequence error.
Recursively expand the package list in self.requirements with superpackages:
def provided_by(self, providers=[]): """Return sorted list of packages meeting providing the symbol via `self.cmd`. (The optional argument `providers` is used for recursion.) """ if not providers: providers = [pkg for pkg in self.requirements.split() if not pkg.startswith('-')] # Add "superpackages" providing the command: for pkg in providers[:]: if pkg not in superpackages.sections(): continue for (superpkg, exceptions) in superpackages.items(pkg): # skip, if `self.cmd` contains a cmd from the exceptions # (check all cmds in a sequence, trim cmd arguments): if [1 for match in re.finditer(r'\\.[a-zA-Z]+', self.cmd) if match.group(0) in exceptions]: continue # append and recurse providers.append(superpkg) providers.extend(self.provided_by([superpkg])) # Return sorted list of unique entries: return Table([(pkg, True) for pkg in providers]).sortedkeys()
Packages providing the \yen command:
>>> print data[0xA5].provided_by() # ¥ ['MnSymbol', 'amsfonts', 'amssymb', 'oz']
Recursion: eufrac is part of amsfonts is part of amssymb and related
>>> print data[0x210C].requirements # ℌ eufrak >>> data[0x210C].provided_by()[:4] ['MnSymbol', 'amsfonts', 'amssymb', 'eufrak']
A package name in the requirements field preceded by - indicates that the package redefines the command so it no longer produces the symbol.
def conflicts_with(self, clashes=[]): """Return sorted list of packages redefining `self.cmd`. (The optional argument `clashes` is used for recursion.) """ if not clashes: clashes = [pkg[1:] for pkg in self.requirements.split() if pkg.startswith('-')] # Add "superpackages" providing the command: for pkg in clashes[:]: if pkg not in superpackages.sections(): continue for (superpkg, exceptions) in superpackages.items(pkg): # skip, if `self.cmd` contains a cmd from the exceptions # (check all cmds in a sequence, trim cmd arguments): if [1 for match in re.finditer(r'\\.[a-zA-Z]+', self.cmd) if match.group(0) in exceptions]: continue # append and recurse clashes.append(superpkg) clashes.extend(self.provided_by([superpkg])) # Return sorted list of unique entries: return Table([(pkg, True) for pkg in clashes]).sortedkeys()
marvosym redefines the standard LaTeX cmd \Rightarrow to a bold arrow:
>>> data[0x21D2].requirements # ⇒ '-marvosym' >>> data[0x21D2].conflicts_with() ['marvosym']
The is_supported method can be used to test whether a given set of LaTeX packages provides the symbol via self.cmd. (The supported_cmd() method checks also alias commands.)
def is_supported(self, packages): """Check if entry is supported by the list of `packages`. """ if not self.cmd: # no LaTeX cmd return False providers = self.provided_by() or [''] clashes = self.conflicts_with() for pkg in packages[::-1]: if pkg in clashes: return False if pkg in providers: return True return False
Examples:
The yen sign is not supported by lmodern or inputenc but by amssymb and its superpackages:
>>> data[0xA5].is_supported(['lmodern', 'inputenc']) False >>> data[0xA5].is_supported(['lmodern', 'inputenc', 'amssymb']) True >>> data[0xA5].is_supported(['MnSymbol']) # superpackage of amssymb True
Standard commands are only listed as supported, if the empty string is part of the package list:
>>> data[0x20D7].is_supported(['amssymb']) # \vec False >>> data[0x20D7].is_supported(['', 'amssymb']) # \vec True
However, if the empty string is followed by a conflicting package, the entry is not supported.
>>> data[0x20D7].is_supported(['', 'wrisym']) False
Like in a LaTeX document, the last package (re)defining the command "wins": >>> data[0x20D7].is_supported(['wrisym', '']) True
name clash between two packages:
>>> print data[0x03DC] 003DC^Ϝ^\digamma^\upDigamma^A^mathalpha^amssymb -wrisym^= \Digamma (wrisym), capital digamma >>> print data[0x03DD] 003DD^ϝ^\digamma^\updigamma^A^mathalpha^wrisym -amssymb^GREEK SMALL LETTER DIGAMMA
>>> data[0x03DC].is_supported(['wrisym']) False >>> data[0x03DC].is_supported(['amssymb', 'wrisym']) False >>> data[0x03DC].is_supported(['wrisym', 'amssymb']) True
The function new_entry returns an UniMathEntry instance for the given code point with default data for number, utf8, mathtype (converted from unicodedata.category), and comment (Unicode name).
>>> print new_entry(0x00AE) 000AE^®^^^^mathord^^REGISTERED SIGN
>>> unicodedata.category(unichr(0x02c6)) 'Lm' >>> print new_entry(0x02C6), 002C6^ˆ^^^^mathalpha^^MODIFIER LETTER CIRCUMFLEX ACCENT
def new_entry(number, delimiter=delimiter): """Return a new UniMathEntry for Unicode char with `number` Raise ValueError, is there is no Unicode character with that number. """ # Mapping from Unicode category to LaTeX math category: mathtypes = ConfigParser.SafeConfigParser() mathtypes.read(mathtypes_filename) uc = unichr(number) utf8 = uc.encode('utf-8') try: mathtype = mathtypes.get('mathtypes', unicodedata.category(uc)) except ConfigParser.NoOptionError: mathtype = '' # special cases: if utf8 == delimiter: utf8 = ' ' if unicodedata.combining(uc): utf8 = 'x' + utf8 # mathtype = 'mathaccent' line = delimiter.join(('%05X' % number, # do not print the literal char if it is the delimiter: utf8, '', # command, '', # unicode-math, '', # math character class mathtype, '', # requirements, unicodedata.name(uc) # comment )) return UniMathEntry(line)
As the math characters are a subset of Unicode with "gaps" between the character numbers, the Python representation uses a dictionary with additional sorting methods (while, e.g., Lua could use the standard Table data type):
>>> ts = Table({'zero': 0, 'one': 1}) >>> tn = Table({0: 'zero', 1: 'one'})
To get a list of sorted keys, do
>>> print ts.sortedkeys(), tn.sortedkeys() ['one', 'zero'] [0, 1]
Iterating over the Table is done using the sorted keys:
>>> print [(key, value) for key, value in ts] [('one', 1), ('zero', 0)]
class Table(dict): def sortedkeys(self): """Return sorted list of keys""" keys = self.keys() keys.sort() return keys def __iter__(self): """Return iterator over sorted (key, value) pairs""" for key in self.sortedkeys(): yield key, self[key]
The add_unique function is used to avoid overwriting existing values:
>>> ts.add_unique('one', 1.)
>>> print [(key, value) for key, value in ts] [('one', 1), ('one~', 1.0), ('zero', 0)]
def add_unique(self, key, value): """Add `value`. Make `key` unique. Expects a string key. Appends "~" until the key is unique ('~' sorts after letters). """ while key in self: key += '~' self[key] = value
Functions for reading and writing the data file.
def read_data(path=datafilename): """Return Table of data entries in the data file. """ datafile = file(path, 'r') data = Table()
Read lines and add UniMathEntry instances to the data table. Skip comments and empty lines. Use the Unicode character number as key:
for line in datafile: if line.startswith(comment_char) or not line.strip(): continue try: entry = UniMathEntry(line) except: print "error in line", line raise data[entry.codepoint] = entry
Close and return:
datafile.close() return data
def read_header(path=datafilename): """Return leading comment block of the data file as list of lines. """ datafile = file(path, 'r') header = [] for line in datafile: if not line.startswith(comment_char): break header.append(line) datafile.close() return header
Write a Table instance with UniMathEntry data records (like the one returned by read_data()) to the file-like object outfile:
def write_data(data, outfile): """Write `data` to `outfile`""" try: outfile.write(''.join(data.header)) except AttributeError: print "no data.header" for (key, value) in data: outfile.write(str(value) + '\n')
Functions to sort and filter.
Return a Table with LaTeX command as key. For each Unicode char, insert records for command and aliases.
>>> cmds = sort_by_command(data) >>> print cmds[r'\ast'] 02217^∗^\ast^\ast^B^mathbin^^ASTERISK OPERATOR (Hodge star operator)
Aliases get separate entries:
>>> print cmds[r'\neg'] 000AC^¬^\neg^\neg^U^mathord^^= \lnot, NOT SIGN
>>> print cmds[r'\lnot'] 000AC^¬^\lnot^\neg^U^mathord^^= \neg, NOT SIGN
def sort_by_command(data): """Return `data` as a Table with LaTeX command as key. """ commands = Table() for (key, entry) in data: if entry.cmd: # and entry.cmd != entry.utf8: commands.add_unique(entry.cmd, entry) for alias in entry.related_commands('='): alias.comment = '= %s, %s' % (entry.cmd_comment(), alias.comment) commands.add_unique(alias.cmd, alias) return commands
if __name__ == '__main__': import sys, difflib header = read_header() data = read_data()
Add a new entries:
def add_entry(cp, cmd='', requirements=''): data[cp] = new_entry(cp) data[cp].cmd = cmd data[cp].requirements = requirements # data.add_entry(0x2620, r'\skull', 'arevmath')
Process data:
# for key, entry in data: # if entry.requirements != 'omlmathit': # if not entry.cmd.startswith(r'\mathit'): # if not re.match(r'\\[A-Z]', entry.cmd): # if not entry.cmd.startswith(r'\mathbfit'): # if not (re.search(r'\\mathrm\{\\[A-Z]', entry.comment)): # continue # print entry # if not entry.comment.startswith(r'= \mathbold'): # cc = '= ' + entry.cmd_comment() # cc = cc.replace('mathbfit', 'mathbold') # cc = cc.replace('isomath', 'fixmath') # entry.comment = cc + ', ' + entry.comment # entry.comment = re.sub(r'(\\mathit\{\\[A-Z][a-z]+\})', r'\1 (-fourier)', # entry.comment) # cc = '= ' + entry.cmd_comment() # entry.comment = cc + ', ' + entry.comment # entry.cmd = cmd # entry.requirements += '-fourier' # # normalize white-space in comments # entry.comment = ' '.join(entry.comment.split()) # Write to outfile?: outfile = None # outfile = file('../data/unimathsymbols.txt', 'w') # outfile = sys.stdout
Test for differences after a read-write cycle. Whitespace adjacent to the delimiter is not significant.
in_lines = file(datafilename, 'r').readlines() in_lines = [# '^'.join([field.strip(' \t') for field in line.split('^')]) re.sub(r'[ \t]*\^[ \t]*', '^', line).rstrip() + '\n' for line in in_lines] header = [re.sub(r' *\^ *', '^', line) for line in header] out_lines = [str(v)+'\n' for (k,v) in data] diff = ''.join(difflib.unified_diff(in_lines, header + out_lines, datafilename, '*round trip*')) if diff: print diff else: print 'no differences after round trip'
Write back to outfile:
if outfile: data.header = header write_data(data, outfile) if outfile != sys.stdout: print "Output written to", outfile.name # for (key, entry) in sort_by_command(data): # print entry
New entries
# for i in range(0x2336, 0x237A): # print new_entry(i) print '%d characters' % len(data)