Chomsky Hierarchy

The Chomsky hierarchy (known as the Chomsky–Schützenberger hierarchy) is a containment hierarchy of classes of formal grammars. This hierarchy of grammars was described by Noam Chomsky in 1956. It is fundamental to computer science, in the study of programming language design, compilers, and computability.

Grammars define the rules by which strings in a language can be generated. In the production rules below:

$a$ is a terminal symbol.
$A, B, C$ are non-terminal symbols.
$\alpha, \beta, \gamma$ are strings of terminals and/or non-terminals.
$\rightarrow$ indicates a production rule (can be rewritten as).

Each level of the hierarchy is a strict superset of the level below it (e.g., all Regular grammars are Context-free, but not all Context-free grammars are Regular).

Overview Summary

Type	Grammar / Language	Recognizing Automaton	Production Rules	Example Language
0	Recursively enumerable	Turing Machine	$\gamma \rightarrow \alpha$ (Unrestricted)	Halting problem
1	Context-sensitive	Linear-Bounded Automaton (LBA)	$\alpha A \beta \rightarrow \alpha \gamma \beta$	$a^n b^n c^n$
2	Context-free	Non-deterministic Pushdown Automaton	$A \rightarrow \gamma$	$a^n b^n$
3	Regular	Finite State Automaton (FSA / NFA / DFA)	$A \rightarrow a$ or $A \rightarrow aB$	$a^* b^*$

Type-3: Regular Grammars

Type-3 grammars are the most restricted form of grammar. They are used in lexical analysis and string matching operations (e.g., standard regular expressions).

Recognizing automaton: Finite-State Automaton (FSA), either deterministic (DFA) or non-deterministic (NFA).

Production rules: Rules must be right-regular or left-regular.

$A \rightarrow a$ (A single terminal)
$A \rightarrow aB$ (Right regular)
$A \rightarrow \epsilon$ ($\epsilon$ is the empty string)

Or (Left regular):

$A \rightarrow a$
$A \rightarrow Ba$

Limitations: Regular languages cannot count or balance symbols. For instance, the language $a^n b^n$ (a string of $n$ 'a's followed by an equal number $n$ of 'b's) cannot be expressed with a regular grammar.

Type-2: Context-Free Grammars (CFG)

Type-2 grammars are important in computer science, as most programming languages and structural data formats (like HTML/XML) can be described using a CFG. They form the basis of most parser generators like Yacc or Bison.

Recognizing automaton: Non-deterministic Pushdown Automaton (NPDA). (An FSA equipped with a single stack memory).

Production rules:

$A \rightarrow \gamma$

A single non-terminal on the left-hand side can be replaced by any string of terminals and/or non-terminals $\gamma$ on the right-hand side. The replacement of $A$ is independent of the symbols surrounding it (hence "context-free").

Examples: The language $a^n b^n$ (paired brackets, matched opening and closing tags) is context-free.

Type-1: Context-Sensitive Grammars (CSG)

Type-1 grammars are uncommon in practical parsing because determining if a string belongs to a context-sensitive language is expensive to compute (PSPACE-complete). However, they model cross-serial dependencies found in natural human languages.

Recognizing automaton: Linear-Bounded Non-deterministic Turing Machine (LBA). A Turing machine bounded by a tape whose length is proportional to the input word.

Production rules:

$\alpha A \beta \rightarrow \alpha \gamma \beta$

The rule means the non-terminal $A$ is rewritten to $\gamma$ in the context of being preceded by $\alpha$ and followed by $\beta$. The length of the right-hand side must be greater than or equal to the left-hand side ($|\alpha A \beta| \le |\alpha \gamma \beta|$), with the exception of an $S \rightarrow \epsilon$ rule.

Examples: The language $a^n b^n c^n$ (an equal but unbounded number of as, bs, and cs) is context-sensitive but not context-free.

Type-0: Recursively Enumerable Grammars

Type-0 grammars are unrestricted. They represent all languages that can be recognized by a general-purpose computer.

Recognizing automaton: Turing Machine.

Production rules:

$\gamma \rightarrow \alpha$

There are no restrictions, other than the left-hand side ($\gamma$) must contain at least one non-terminal.

Notes: Any language generated by a Type-0 grammar is a recursively enumerable language. The problem of determining if a string belongs to a given Type-0 grammar is undecidable (it is equivalent to the Halting problem).