Que sont les catcodes?

Plutôt que de définir des primitives pour des tâches aussi courantes que le passage en mode mathématique ou la mise en exposant ou en indice, Donald Knuth a préféré réserver certains caractères à ces tâches. Par exemple le dollar $ pour passer en mode mathématique, l'underscore _ pour passer en indice. D'autres caractères ont des significations particulières lorsqu'on écrit un document TeX: les accolades et crochets pour délimiter les arguments des commandes, ou simplement l'antislash \ pour appeler les commandes. Certains caractères sont différents des autres pour TeX.

Lorsque le moteur TeX lit un fichier, il associe à chaque caractère deux numéros: un « code de caractère » et un « code de catégorie ». TeX ne connaît rien des glyphes, il travaille uniquement avec des nombres, et cela fait partie de ses points forts. Si vous lui donnez un code de caractère, il ira le chercher dans la table de la police en cours, puis imprimera le glyphe qu'il y aura trouvé à cette position.

En ce qui concerne le code de catégorie, ou catcode, TeX l'utilise pour analyser intelligemment l'entrée. Ça lui permet par exemple, si une accolade ouvrante { apparaît dans une partie particulière du document, de chercher l'accolade fermante correspondante, etc. Cela aurait pu être écrit en dur (« l'accolade ouvrante sert toujours à faire ceci ou cela »), mais Knuth a choisi d'introduire un niveau d'abstraction, de sorte que n'importe quel caractère peut avoir n'importe quel rôle à condition qu'il ait le catcode approprié, ce qui donne une grande souplesse au moteur.

Donc quand $\TeX{}$ analyse un fichier, il attribue à chaque caractère lu un catcode. La façon dont TeX interprète ensuite l'entrée dépend à la fois du caractère et de son catcode. Il y a 16 catcodes qui peuvent être utilisés par le programmeur, plus un code interne spécial. Les 16 codes standards sont numérotés de 0 à 15:

Code	Signification	Exemple typique
0	Caractère d'échappement	`\`
1	Début de groupe	`{`
2	Fin de groupe	`}`
3	Passage en mode mathématique	`$`
4	Alignement	`&`
5	Fin de ligne	`^^M`
6	Paramètre de commande	`#`
7	Exposant mathématique	`^`
8	Indice mathématique:	`_`
9	Caractère ignoré
10	Espace	` '
11	Lettre	l'alphabet: `A`, `B`… `a`, `b`…
12	« Autre » caractère	tout le reste: `.`, `1`, `:`, etc.
13	Caractère actif, à interpréter comme une séquence de contrôle	`~`
14	Début de commentaire	`%`
15	Caractère invalide:	`[backspace]`

(^^M représente le carctère invisible que TeX met à la fin de chaque ligne d'entrée, à la place du caractère de fin de ligne éventuellement déjà présent, qui dépend du système d'exploitation)

En même temps que $\TeX{}$ attribue un catcode à chaque caractère, il découpe les unités lexicales (tokens). Par exemple s'il lit:

$ 1^{23}_a $

il va interpréter:

Un token de passage en mode mathématique, donc il passe en mode mathématique
Une espace, qui est ignorée en mode mathématique
Un token « autre » 1, qui est donc simplement imprimé
Un token de mise en exposant mathématique, ce qui signifie que le prochain élément sera en exposant
Un token de début de groupe {,
Les tokens « autre », 2 et 3, qui ne peuvent pas être imprimés avant la fin du groupe
Le token de fin de groupe }, qui autorise TeX à imprimer l'exposant
Un token de mise en indice mathématique, donc prochain élément sera en indice
La lettre a, qui n'a pas de signification particulière, est simplement imprimée
Une espace, encore une fois ignorée
Un token de passage en mode mathématique (ces tokens ont un effet de bascule: on entre dans le mode maths si on n'y était pas, on en sort si on y était), donc TeX retourne en mode horizontal

En général, pour les catcodes 0 à 8, il n'y a qu'un seul caractère par catégorie:

\  {  }  $  &  ^^M  #  ^  _

Cette unicité n'est pas obligatoire, mais assez naturelle: pourquoi aurait-on besoin de plusieurs caractères d'échappement qui auraient exactement le même effet? Ce serait souvent du gaspillage de caractères. (Voir plus loin)

La catégorie 10 est l'espace normale, mais aussi le caractère tabulation <TAB>, qui est équivalent à une série d'espaces ; les caractères de la catégorie 10 sont ignorés en début de ligne. La catégorie 5 (fin de ligne) est très particulière : elle se transforme en espace à moins qu'elle ne soit suivie d'un autre caractère de catégorie 5, auquel cas elle devient la commande \par (c'est l'astuce qui permet de laisser une ligne blanche pour terminer un paragraphe). En général, toute série continue de caractères de catégorie 10 est réduite à un seul, peu importe qu'il s'agisse d'espaces, de tabulations ou de caractères de fin de ligne convertis.

All letters have category 11 and punctuation characters such as ?, (, ) and others have category 12; this is for the rule that a command name can be any sequence of letters (better, category 11 characters) or one not 11 category character, preceded by a category 0 character. Category 11 and 12 characters, when not part of a command name may be printed; this is not the case for all other category codes. However a category code 11 or 12 character may also not show up in print, because it's discarded during processing (for example keywords or option to packages, package or file names…).

Category 9 and 15 were put into TeX because there are “dangerous” character (ASCII “null” and ASCII “delete”) that could be misinterpreted by editors. Actually category 9 has other uses: in LaTeX3 style files the space is assigned category 9, to help programmers in avoiding the dreadful “spurious spaces”.

La catégorie 14 est le fameux signe pourcent %, qui introduit les commentaires et qui fait que TeX ignore tout ce qui suit sur la ligne (y compris le caractère de fin de ligne).

Category 13 is very special; Plain TeX and the LaTeX kernel use only one active character, namely ~; an active character is treated as if it were a command and must have a definition before it can be used; the LaTeX definition is

\catcode`~=13
\def~{\nobreakspace{}}

so that typing ~ is just the same as writing \nobreakspace{}. Other active characters are also used by the LaTeX inputenc package, in such a way that, for instance, ü is translated into \“u.

When we want to typeset verbatim TeX code, many of the special characters are assigned category code 12; but when we type \verb+\xyz+, LaTeX reads \verb and prepares everything for verbatim typesetting and starts a group; the first + is swallowed and is assigned category 2, so that when it finds the second + the group is terminated and all assignments are reverted to the normal ones (including the category 2 assignment to +): it's a bit magic, but it works, provided \verb+\xyz+ doesn't appear in the argument of a command.

This is a problem: when TeX is scanning the argument to a command, it freezes the category codes: when a character enters TeX it is transformed into a pair (category code, character code) which is no more the original character and so the category code assignment can't be modified any more (well, not really, there's \scantokens, but this would require a very long discussion).

The LaTeX commands \makeatletter and \makeatother work by changing the category code of @, which is usually 12; the first one puts it into category 11, so that it can appear in command names, the second one reverts this assignment. But how can

\makeatletter
\newcommand{\xyz}{...\@xyz...}
\makeatother

work? One might expect that when TeX expands \xyz it finds the “illegal” command name \@xyz. This doesn't happen: just as a simple character is transformed into a pair of numbers, when TeX scans it, a command name becomes a symbolic token, an internal representation of the command which is independent of characters and their category codes.

If we assign category code 0 to |, we can type \LaTeX and |LaTeX: they would mean just the same thing. But having different characters sharing the same category code 4 might turn out to be useful for aligning decimal numbers at the decimal separator in a tabular. If we assign . category code 4, we may type a decimal number as 123.456 and LaTeX will interpret it as if it were 123&456, producing two table cells that to the end user appear as one; some trickery in the definition of the table column structure is required, though.

Control sequences

Category codes often become important when $\TeX{}$ is deciding on what is and is not a control sequence. With only the alphabet as 'letters', something like

\hello@

is the control sequence \hello followed by the 'other' token @. On the other hand, if I make @ a letter, using the \catcode command that is used to change the category code of a character:

\catcode`\@=11\relax
\hello@

then TeX will look for a macro called \hello@. This is commonly used in TeX code to isolate 'code' macros from 'user' ones. So you find programming macros such as \@for. Without changing the category code, this is effectively 'hidden'. The idea of this is to 'protect the user from themselves': it's hard to break the code if you cannot even get at it!

Consider that you may wish to replace curly brackets {} with square brackets [], this can be achieved by the following simple code:

\catcode `[=1
\catcode `]=2
 
\def\test[This is a test]
\bye

Try it out without the catcode changes and it will fail with a run-away definition error.

Similarly, the \ backslash character can be redefined for use in verbatim text.

\catcode `[=1
\catcode `]=2
\def\test[This is a test]
\catcode `*=0
 *def*test[This is another test]
 *test
 \bye

In the last example in lieu of \ you can type *. Run both MWE through pdfTeX. Authors don't really need them but they are invaluable for package developers.

There are many effects that can be achieved using category codes. An obvious one is the non-breaking space ~ used throughout the TeX world. This works because ~ has category code 13, and is therefore 'active'. When TeX reads ~, it looks for a definition for ~ in the same way it would for a macro. That's a lot more convenient than using a macro for these cases.

We can use different category codes to make 'private' code areas. For example, plain TeX and LaTeX2e us @ as an extra 'letter', whereas LaTeX3 uses : and _. That effectively isolates internal LaTeX3 code from LaTeX2e, when the two are used together (as at present).

Verbatim material is another area where category codes are vital (if complex!). The reason you can't nest verbatim material inside anything else is that once TeX has assigned category codes it is only partially reversible. Anything which is 'ignored' or 'comment' is thrown away: you can't get it back. (With e-TeX, you can reassign category codes, but anything that is already gone stays 'lost.)

The 'special' category code is 16, which is used in the \ifcat test, amongst other things. It is assigned to unexpandable control sequences in this situation, so that they do not match anything else other than other unexpandable control sequences.

Sources: