web.cs.city.ac.uk
Topic 4: Context Free Languages and Normalisation

Context Free Grammars
Chomsky Normal Form
Normalisation Step 1: Removal of Null Productions
Normalisation Step 2: Removal of Unit Productions
Normalisation Step 3: Removal of Mixed Right Hand Sides
Normalisation Step 4: Generation of Chomsky Normal Form

Context-Free Grammars

We have seen that all regular languages, even those containing an infinite number of strings, can be defined using REs with a finite number of terms.
Non-regular languages can also be finitely defined, but not using REs.
The first class of non-regular languages that we consider comprises the Context Free Languages (CFL), which can be defined using Context Free Grammars (CFG).
A CFG consists of a finite sequence of simple rewrite rules (also known as production rules), each of which has a single non-terminal symbol on the left hand side and an expression on the right.

Example

S -> AB | B
A -> B | CC
B -> b
C -> c

The Ô|Õ character indicates an alternative (ÔorÕ). Symbols given in upper case conventionally denote non-terminals, the lower case symbols denote terminals, which are the elements of the languageÕs alphabet.

The notation used above is known as Backus-Naur, or Backus Normal, Form (BNF) and was first used by Backus and Naur to define the syntax of the programming language Algol 60.

The language represented by a CFG is generated by applying all applicable production rules to the Ôstart symbolÕ (conventionally S) and to the non-teminals in the strings so produced until only terminals are left.
This construction generates a (possibly infinite) tree with the start symbol at the root and the (possibly infinite number of) words in the language at the leaves.
Each intermediate node in this tree contains a ÔmixedÕ string of terminals and non-terminals from which a branch exits for each production rule applicable to a non-terminal in it.
If we follow this procedure for the grammar in the example above, we generate the following tree:

which shows clearly that the language defined by the grammar is the set of words {bb, ccb, b}.
In this case, the language is finite and, of course, regular but the same notation may be used to define non-regular languages.

Example

The language {aⁿbⁿ| n>0}, which we know to be non-regular, may be defined thus:

S -> AB | ASB
A -> a
B -> b

Self Assessment Question

Draw the first few levels of the infinite tree generated by this grammar.
Note the recursion here, the non-terminal S appearing on both the left and right hand sides.

BNF also allows us to refer to the null word, denoted by the symbol /\.
This symbol is not a member of the alphabet of any language, nor is it a non-terminal (so it may not appear on the left hand side of any production rule).

Example

The language {aⁿbⁿ| n*0}, which includes the null word, might be defined thus:

S -> /\ | ASB
A -> a
B -> b

Clearly, for such infinite languages, the tree-generation procedure that we used earlier does not terminate.
In order to reason about them, and to construct machines that recognise them, we need to operate on the finite set of production rules and not with the infinite set of words that they denote.
Back to Topic 4

Chomsky Normal Form

CFGs come in all shapes and sizes.
By definition, any finite string of terminals and non-terminals may appear on the right hand side of a production, eg:

X -> YaaYbaYXZabYb.

However, as with REs, the same CFL may be defined by several different CFGs.
For the purposes of comparison and analysis, it is convenient to define a common form into which all CFGs could be transformed.
This technique, which is widely used in algebra (you will meet it the Database and Formal Methods modules, for example), is called normalisation and its result is called a normal form.

Definition

If a CFG has only productions of the form:

non-terminal -> string of exactly two non-terminals,

or of the form:

non-terminal -> one terminal,

it is said to be in Chomsky Normal Form (CNF).

Theorem

For every CFG, there is a CFG in CNF that defines the same language.

Proof

As with the equivalence theorems of the regular languages, this proof consists of a constructive algorithm that transforms any CFG into an equivalent CNF one.
The algorithm is in four steps:
Back to Topic 4

Normalisation Step 1: Removal of Null Productions

A null production is one whose right hand side contains, or is reduceable to, /\.

Theorem

If L is a CFL generated by a CFG that includes null productions, then there is a different CFG for L that has no null productions.

The proof of this theorem is a constructive algorithm that will convert any CFG that contains null productions into a CFG that does not contain them, but that generates the same language, with the possible exception of the null word, /\.
Two versions of this null-removal transformation are presented below.
The first seems, at first sight, to be correct but it has a serious flaw which is corrected in the second version.

Proof? (Version 1)

Given a CFG with start symbol S and containing a production of the form N -> /\, where N is any non-terminal, perform the following transformation:

For all productions of the form: X -> a Nb, where

X is any non-terminal (even S or N) and
a and b are any strings of teminals and non-terminals (even involving N),

add the production X -> ab and
delete the production N -> /\ but
do not delete the production X -> a Nb

Example 1:
We transform

S -> aNbNa
N -> /\

S -> abNa
S -> aNba
S -> aba (deleting both Ns).

Example 2:
We transform

S -> aSa | bSb | /\

S -> aSa | bSb | aa | bb

So far so good, but consider this grammar:
Example 3:

S -> a | Xb | aYa
X -> Y | /\
Y -> b | X

We have the null production X -> /\.
By replacement we create

S -> b (from S -> Xb) and
Y -> /\ (from Y -> X).

Then using the replacement rule to get rid of Y -> /\, we get

S -> aa (from S -> aYa) and
X -> /\ (from X -> Y)

We have failed to remove the null production X -> /\!
This version of the null-removal transformation clearly doesn't work. We need to eliminate all null productions at once.

Definition

In a given CFG, a non-terminal N is nullable if

there is a production N -> /\ or
there is a derivation that starts at N and leads to /\, i.e: N -> ... -> /\

Proof (Version 2: delete all null productions)

For every production X -> RHS,
add enough new productions of the form X -> ...
so that the right hand side will account for any modification of the RHS that can be formed by deleting all possible subsets of nullable non-terminals,
except that we do not allow X -> /\ to be formed even if all the characters in the RHS are nullable.

Example 3 (revisited):

In the CFG:

S -> a | Xb | aYa
X -> Y | /\
Y -> b | X

when we delete X -> /\, we have to see what new productions to add:

Old productions with nullables	Productions newly formed by the rule
X -> Y	Nothing
X -> /\	Nothing
Y -> X	Nothing
S -> Xb	S -> b
S -> aYa	S -> aa

The transformed CFG becomes:

S -> a | Xb | aYa | b | aa
X -> Y
Y -> b | X

This has no null productions, but generates the same language.
Back to Topic 4

Normalisation Step 2: Removal of Unit Productions

Definition:

A unit production is of the form X->Y (with just one non-terminal on each side).

We can get rid of these too, as follows:

If A -> B is a unit production and
all productions starting with B are of the form B -> a | b | ... where a, b,... are strings, then
drop the production A -> B and
include as new productions: A -> a | b | ... .

Again, this first, apparently adequate, version is problematic, this time because of circular definitions.

Self Seessment Question

Apply this transformation to any unit production in the following grammar:

S -> A | bb
A -> B | b
B -> S | a

As you see, you only generate more unit productions, and they won't go away!

So we need the following modified version of the unit production rule:

For every pair of nonterminals A and B, for which the CFG has either

a unit production A -> B or
a chain of unit productions leading from A to B, such as A -> X -> Y -> ...-> B,

if the non-unit productions from B are B -> a | b | g | ..., then

create the productions A -> a | b | g | ... and
do this for A and B simultaneously.

Example 4:

Given the following CFG:

S -> A | bb
A -> B | b
B -> S | a

construct another one that defines the same language, but has no unit productions.

First separate the units from the non-units.

Unit Productions	Others
S -> A	S -> bb
A -> B	A -> b
B -> S	B -> a

Now list all unit productions and sequences of unit productions, one non-terminal at a time, tracing each non-terminal through each sequence it heads.
Then create the new productions that allow the first non-terminal to be replaced by any of the strings that could replace the last non-terminal in the sequence:

S -> A gives S -> b
S -> A -> B gives S -> a
A -> B gives A -> a
A -> B -> S gives A -> bb
B -> S gives B -> bb
B -> S -> A gives B -> b

The transformed CFG, with no unit productions, for this language is:

S -> bb | b | bb
A -> b | a | bb
B -> a | bb | b

Back to Topic 4

Normalisation Step 3: Removal of Mixed Right Hand Sides

We have to arrive at productions that contain one of two basic forms:

non-terminal -> string of only non-terminals or
non-terminal -> one terminal.

This procedure is straightforward:

For each terminal symbol, say a and b,

add a new non-terminal, say A and B, respectively, and
add a new production, thus: A -> b and B -> b.

Now for each production involving terminals we replace each with its new non-terminal.

For example: P -> QaRSbbTa becomes

P -> QARSBBTA, a string of solid non-terminals, together with the productions
A -> a
B -> b

Back to Topic 4

Normalisation Step 4: Generation of Chomsky Normal Form

Now each production with a string of more than two non-terminals on its right hand side must be replaced with a chain of productions each of which has exactly two non-terminals on its right hand side.
For example,

S -> WXYZ should be replaced by
S -> WR where

R -> XS and
S -> YZ

And that's the end of the algorithm.
The application of these four steps, in strict sequence, to any CFG, will convert it to CNF.
That is, of course, also the end of the constructive proof of the original theorem.

Example

The following CFG defines the language Palindrome.

S -> aSa | bSb | a | b | /\

Convert it to CNF:
Step 1: Remove null productions

There is one null production: S -> /\
Removing it introduces the productions S -> aa | bb so the full grammar is now

S -> aSa | bSb | a | b | aa | bb

together with the production S -> /\, which we will ignore for the moment but reintroduce at the end.

Step 2: Remove unit productions

There are no unit productions here.
Note that S -> a and S -> b are not unit productions, because each has a single terminal symbol on the RHS.

Step 3: Remove mixed RHS

We need to define two new non-terminals for the terminals a and b with their productions:

A -> a
B -> a

and introduce them into the grammar thus:

S -> ASA | BSB | AA | BB | a | b

Note that the rules S -> a | b are already in CNF so we do not introduce the rules S -> A | B.
That would merely introduce new unit productions that would have to be removed in turn.

Step 4: Generation of CNF

We have two productions that are not in CNF:

S -> ASA | BSB

Define two new non-terminals with their productions:

K -> SA
L -> SB

and restore the original null production for S, so that the grammar becomes

S -> AK | BL | AA | BB | a | b | /\
K -> SA
L -> SB
A -> a
B -> b

which is a CNF grammar for Palindrome, as required.

Self Assessment Questions

Check that both the grammars above recognise Palindrome by choosing some strings that are, and some that are not, palindromes and finding their parse trees in each grammar.
Convert the following CFG to CNF:

S -> AB
A -> B
B -> aB | Bb | /\

Back to Topic 4
Next Topic

Start of Module

If you have comments or suggestions, email to author at b.cohen@city.ac.uk
or me Vid-Com-Ram / members.tripod.com/bummerang

หมายเหตุ : เอกสารเวปหน้านี้นำมาจากมหาวิทยาลัยแห่งหนึ่งในประเทศอังกฤษ เป็นหลักสูตรคล้ายคลึงกับ CT313 นำมาเพื่อเป็นเอกสารเปรียบเทียบและอ้างอิง มิได้หวังผลประโยชน์ใดๆ และขอบอกว่าเค้ามีลิขสิทธิ์นะ

ทฤษฏีคอมพิวเตชั่น CT313 Original form MODULE P220 Theory of Computation 2001 http://web.cs.city.ac.uk Topic 4: Context Free Languages and Normalisation

Contents

Context-Free Grammars

Example

Example

Self Assessment Question

Example

Chomsky Normal Form

Definition

Theorem

Proof

Normalisation Step 1: Removal of Null Productions

Theorem

Proof? (Version 1)

Definition

Proof (Version 2: delete all null productions)

Example 3 (revisited):

Normalisation Step 2: Removal of Unit Productions

Definition:

Self Seessment Question

Example 4:

Normalisation Step 3: Removal of Mixed Right Hand Sides

Normalisation Step 4: Generation of Chomsky Normal Form

Example

Self Assessment Questions

ทฤษฏีคอมพิวเตชั่น CT313
Original form MODULE P220
Theory of Computation 2001 http://web.cs.city.ac.uk
Topic 4: Context Free Languages and Normalisation