My background in grammars derives mostly from computer science, so my examples will probably be coloured in that direction, but here's a few...
A token (the natlang term is lexeme) is a specific category in which each smallest-subdivision can fall. In computer langs, for example, 'lparen' might be a token whose only member is the single character '(', while 'ident' might be any alphanumeric combination that doesn't begin with a number (or something like that). In natlangs, this would be 'noun stem' and 'verb stem' as well as 'aspect affix' and the like.
A formal grammar is a document which specifies the syntax of a
language. It consists of any number of lines (aka productions,
rules), each of which is divided into two parts, the
Left-Hand Side and the Right-Hand Side (yes, that's
the technical term!) The LHS of the first line of the grammar is the
start symbol, usually something like <sentence>.
A sentence in this lang can be derived by replacing the LHS of a line
with its RHS.
Example:
<sentence> => <noun phr> verb .
<noun phr> => noun
<noun phr> => the noun
This simple grammar could generate sentences of the form "noun verb." or "the noun verb." Things to note: there are two types of item in a grammar, nonterminals and terminals. Nonterminals are things which can be replaced, i.e. they appear as the LHS of at least one line. Terminals are tokens, and cannot be replaced. In the above grammar, the nonterminals are <bracketed>, the literal tokens are written in monospace, and the class tokens are written as is.
Context free grammars are grammars where exactly one nonterminal appears on the LHS of each line. They are "context free" because no matter what the rest of the sentence looks like, that nonterminal can be replaced with the RHS of that line; it is independent of context. The above grammar is a (relatively degenerate) CFG.
Context sensitive grammars are grammars where there are multiple items on the LHS of a line. The replacement can only occur if the context is correct. An example of a CSG might be:
<sentence> => <a> <b>
<b> => abc
<b> => efg
<a> => xyz
<a> abc => 123 abc
This grammar can generate the following sentences: "xyz abc" "xyz efg" "123 abc". The last line allows replacement of <a> with 123 only if the <a> is followed by "abc". Languages represented by CSGs can sometimes be represented with a CFG as well... the above lang would be
<sentence> => <c>
<c> => 123 abc
<c> => xyz <d>
<d> => abc
<d> => efg
Now, a line would be directly recursive if its LHS is included
in its RHS, e.g.
You may find it unwieldy to write out some of this, and you're right.
The notation described above is known as Backus-Naur Form
(BNF); there is also a frequently used Extended BNF which
allows the following constructions: Curly braces {} can bracket
an item or series of items which may occur zero or more times. Square
brackets [] can bracket an optional item. A vertical bar
| signifies an "or" (and can be grouped within parentheses).
Also, a superscripted
<sentence> => <c>
<c> => 123 abc | xyz <d>
<d> => abc | efg
The technical definition of a regular language is that it can be generated by a CFG with at most one non-terminal in each RHS, and if there is one it is the rightmost item. More practically, a language is regular if you can represent it with a regular expression (of the variety used by grep, vi, and perl). And while I'm at it: formally speaking, a language is the set of all possible sentences allowed by a given grammar.
According to this definition, a number of nonsense sentences are technically part of the language, even though they don't really mean anything... that problem (meaning) is dealt with in semantics.
Copyright © 1997, Jack Durst,
Last updated: 25 April, 1997