Formal Grammars

Formal Grammars

adapted from a 13 Sep 96 post by Don Blaheta

My background in grammars derives mostly from computer science, so my examples will probably be coloured in that direction, but here's a few...

A token (the natlang term is lexeme) is a specific category in which each smallest-subdivision can fall. In computer langs, for example, 'lparen' might be a token whose only member is the single character '(', while 'ident' might be any alphanumeric combination that doesn't begin with a number (or something like that). In natlangs, this would be 'noun stem' and 'verb stem' as well as 'aspect affix' and the like.

A formal grammar is a document which specifies the syntax of a language. It consists of any number of lines (aka productions, rules), each of which is divided into two parts, the Left-Hand Side and the Right-Hand Side (yes, that's the technical term!) The LHS of the first line of the grammar is the start symbol, usually something like <sentence>. A sentence in this lang can be derived by replacing the LHS of a line with its RHS.
Example:

<sentence> => <noun phr> verb .
<noun phr> => noun
<noun phr> => the noun

This simple grammar could generate sentences of the form "noun verb." or "the noun verb." Things to note: there are two types of item in a grammar, nonterminals and terminals. Nonterminals are things which can be replaced, i.e. they appear as the LHS of at least one line. Terminals are tokens, and cannot be replaced. In the above grammar, the nonterminals are <bracketed>, the literal tokens are written in monospace, and the class tokens are written as is.

Context free grammars are grammars where exactly one nonterminal appears on the LHS of each line. They are "context free" because no matter what the rest of the sentence looks like, that nonterminal can be replaced with the RHS of that line; it is independent of context. The above grammar is a (relatively degenerate) CFG.

Context sensitive grammars are grammars where there are multiple items on the LHS of a line. The replacement can only occur if the context is correct. An example of a CSG might be:

<sentence> => <a> <b>
<b> => abc
<b> => efg
<a> => xyz
<a> abc => 123 abc

This grammar can generate the following sentences: "xyz abc" "xyz efg" "123 abc". The last line allows replacement of <a> with 123 only if the <a> is followed by "abc". Languages represented by CSGs can sometimes be represented with a CFG as well... the above lang would be

<sentence> => <c>
<c> => 123 abc
<c> => xyz <d>
<d> => abc
<d> => efg

Now, a line would be directly recursive if its LHS is included in its RHS, e.g. "<q> => a <q> b". This is called left recursion if the LHS is at the very left of the RHS; centre recursion or centre embedding if the LHS is bracketed by other stuff; and right recursion or tail embedding if the LHS occurs at the end of the line. Indirect recursion is when two lines refer to each other, eg "<m> => <n> k" and "<n> => l <m>".

You may find it unwieldy to write out some of this, and you're right. The notation described above is known as Backus-Naur Form (BNF); there is also a frequently used Extended BNF which allows the following constructions: Curly braces {} can bracket an item or series of items which may occur zero or more times. Square brackets [] can bracket an optional item. A vertical bar | signifies an "or" (and can be grouped within parentheses). Also, a superscripted * indicates that the prior element may occur 0 or more times, a + means the prior element may occur 1 or more times, and a ? means the prior element may occur 0 or 1 time. So the last grammar could be written:

<sentence> => <c>
<c> => 123 abc | xyz <d>
<d> => abc | efg

The technical definition of a regular language is that it can be generated by a CFG with at most one non-terminal in each RHS, and if there is one it is the rightmost item. More practically, a language is regular if you can represent it with a regular expression (of the variety used by grep, vi, and perl). And while I'm at it: formally speaking, a language is the set of all possible sentences allowed by a given grammar.

According to this definition, a number of nonsense sentences are technically part of the language, even though they don't really mean anything... that problem (meaning) is dealt with in semantics.

Return to Conlang-related topics|Back to FAQ page