We have seen that all regular languages,
even those containing an infinite number of strings, can be defined using REs
with a finite number of terms.
Non-regular languages can also be finitely defined, but not using REs.
The first class of non-regular languages that we consider comprises the Context
Free Languages (CFL), which can be defined using Context Free Grammars
(CFG).
A CFG consists of a finite sequence of simple rewrite rules (also known
as production rules), each of which has a single non-terminal
symbol on the left hand side and an expression on the right.
S -> AB | B
A -> B | CC
B -> b
C -> c
The Ô|Õ character indicates an alternative (ÔorÕ). Symbols given in upper case conventionally denote non-terminals, the lower case symbols denote terminals, which are the elements of the languageÕs alphabet.
The notation used above is known as Backus-Naur, or Backus Normal, Form (BNF) and was first used by Backus and Naur to define the syntax of the programming language Algol 60.
The language represented by a CFG is generated
by applying all applicable production rules to the Ôstart symbolÕ
(conventionally S) and to the non-teminals in the strings so produced until
only terminals are left.
This construction generates a (possibly infinite) tree with the start symbol
at the root and the (possibly infinite number of) words in the language at the
leaves.
Each intermediate node in this tree contains a ÔmixedÕ string of
terminals and non-terminals from which a branch exits for each production rule
applicable to a non-terminal in it.
If we follow this procedure for the grammar in the example above, we generate
the following tree:
which shows clearly that the language defined by the grammar is the set of words
{bb, ccb, b}.
In this case, the language is finite and, of course, regular but the same notation
may be used to define non-regular languages.
The language {anbn| n>0}, which we know to be non-regular, may be defined thus:
S -> AB | ASB
A -> a
B -> b
Draw the first few levels of the infinite
tree generated by this grammar.
Note the recursion here, the non-terminal S appearing on both the left
and right hand sides.
BNF also allows us to refer to the null
word, denoted by the symbol /\.
This symbol is not a member of the alphabet of any language, nor is it a non-terminal
(so it may not appear on the left hand side of any production rule).
The language {anbn| n*0}, which includes the null word, might be defined thus:
S -> /\ | ASB
A -> a
B -> b
Clearly, for such infinite languages, the
tree-generation procedure that we used earlier does not terminate.
In order to reason about them, and to construct machines that recognise them,
we need to operate on the finite set of production rules and not with the infinite
set of words that they denote.
Back
to Topic 4
CFGs come in all shapes and sizes.
By definition, any finite string of terminals and non-terminals may appear on
the right hand side of a production, eg:
X -> YaaYbaYXZabYb.
However, as with REs, the same CFL may
be defined by several different CFGs.
For the purposes of comparison and analysis, it is convenient to define a common
form into which all CFGs could be transformed.
This technique, which is widely used in algebra (you will meet it the Database
and Formal Methods modules, for example), is called normalisation and
its result is called a normal form.
If a CFG has only productions of the form:
non-terminal -> string of exactly two non-terminals,
or of the form:
non-terminal -> one terminal,
it is said to be in Chomsky Normal Form (CNF).
For every CFG, there is a CFG in CNF that defines the same language.
As with the equivalence theorems of the
regular languages, this proof consists of a constructive algorithm that transforms
any CFG into an equivalent CNF one.
The algorithm is in four steps:
Back
to Topic 4
A null production is one whose right hand side contains, or is reduceable to, /\.
If L is a CFL generated by a CFG that includes null productions, then there is a different CFG for L that has no null productions.
The proof of this theorem is a constructive
algorithm that will convert any CFG that contains null productions into a CFG
that does not contain them, but that generates the same language, with the possible
exception of the null word, /\.
Two versions of this null-removal transformation are presented below.
The first seems, at first sight, to be correct but it has a serious flaw which
is corrected in the second version.
Given a CFG with start symbol S and containing a production of the form N -> /\, where N is any non-terminal, perform the following transformation:
For all productions of the form: X -> a Nb, where
X is any non-terminal (even S or N) and
a and b are any strings of teminals and non-terminals (even involving N),add the production X -> ab and
delete the production N -> /\ but
do not delete the production X -> a Nb
Example 1:
We transform
S -> aNbNa
N -> /\
to
S -> abNa
S -> aNba
S -> aba (deleting both Ns).
Example 2:
We transform
S -> aSa | bSb | /\
to
S -> aSa | bSb | aa | bb
So far so good, but consider this grammar:
Example 3:
S -> a | Xb | aYa
X -> Y | /\
Y -> b | X
We have the null production X -> /\.
By replacement we create
S -> b (from S -> Xb) and
Y -> /\ (from Y -> X).
Then using the replacement rule to get rid of Y -> /\, we get
S -> aa (from S -> aYa) and
X -> /\ (from X -> Y)
We have failed to remove the null production
X -> /\!
This version of the null-removal transformation clearly doesn't work.
We need to eliminate all null productions at once.
In a given CFG, a non-terminal N is nullable if
there is a production N -> /\ or
there is a derivation that starts at N and leads to /\, i.e: N -> ... -> /\
For every production X -> RHS,
add enough new productions of the form X -> ...
so that the right hand side will account for any modification of the RHS that
can be formed by deleting all possible subsets of nullable non-terminals,
except that we do not allow X -> /\ to be formed even
if all the characters in the RHS are nullable.
In the CFG:
S -> a | Xb | aYa
X -> Y | /\
Y -> b | X
when we delete X -> /\, we have
to see what new productions to add:
Old productions with nullables |
Productions newly formed by the rule |
X -> Y |
Nothing |
X -> /\ |
Nothing |
Y -> X |
Nothing |
S -> Xb |
S -> b |
S -> aYa |
S -> aa |
The transformed CFG becomes:
S -> a | Xb | aYa | b | aa
X -> Y
Y -> b | X
This has no null productions, but generates
the same language.
Back
to Topic 4
A unit production is of the form X->Y (with just one non-terminal on each side).
We can get rid of these too, as follows:
If A -> B is a unit production and
all productions starting with B are of the form B -> a | b | ... where a, b,... are strings, then
drop the production A -> B and
include as new productions: A -> a | b | ... .
Again, this first, apparently adequate, version is problematic, this time because of circular definitions.
Apply this transformation to any unit production in the following grammar:
S -> A | bb
A -> B | b
B -> S | a
As you see, you only generate more unit productions, and they won't go away!
So we need the following modified version of the unit production rule:
For every pair of nonterminals A and B, for which the CFG has either
a unit production A -> B or
a chain of unit productions leading from A to B, such as A -> X -> Y -> ...-> B,
if the non-unit productions from B are B -> a | b | g | ..., then
create the productions A -> a | b | g | ... and
do this for A and B simultaneously.
Given the following CFG:
S -> A | bb
A -> B | b
B -> S | a
construct another one that defines the same language, but has no unit productions.
First separate the units from the non-units.
Unit Productions | Others |
S -> A | S -> bb |
A -> B | A -> b |
B -> S | B -> a |
Now list all unit productions and sequences
of unit productions, one non-terminal at a time, tracing each non-terminal through
each sequence it heads.
Then create the new productions that allow the first non-terminal to be replaced
by any of the strings that could replace the last non-terminal in the sequence:
S -> A gives S -> b
S -> A -> B gives S -> a
A -> B gives A -> a
A -> B -> S gives A -> bb
B -> S gives B -> bb
B -> S -> A gives B -> b
The transformed CFG, with no unit productions, for this language is:
S -> bb | b | bb
A -> b | a | bb
B -> a | bb | b
We have to arrive at productions that contain one of two basic forms:
non-terminal -> string of only non-terminals or
non-terminal -> one terminal.
This procedure is straightforward:
For each terminal symbol, say a and b,
add a new non-terminal, say A and B, respectively, and
add a new production, thus: A -> b and B -> b.Now for each production involving terminals we replace each with its new non-terminal.
For example: P -> QaRSbbTa becomes
P -> QARSBBTA, a string of solid non-terminals, together with the productions
A -> a
B -> b
Now each production with a string of more
than two non-terminals on its right hand side must be replaced with a chain
of productions each of which has exactly two non-terminals on its right hand
side.
For example,
S -> WXYZ should be replaced by
S -> WR whereR -> XS and
S -> YZ
And that's the end of the algorithm.
The application of these four steps, in strict sequence, to any
CFG, will convert it to CNF.
That is, of course, also the end of the constructive proof of the original theorem.
The following CFG defines the language Palindrome.
S -> aSa | bSb | a | b | /\
Convert it to CNF:
Step 1: Remove null productions
There is one null production: S -> /\
Removing it introduces the productions S -> aa | bb so the full grammar is nowS -> aSa | bSb | a | b | aa | bb
together with the production S -> /\, which we will ignore for the moment but reintroduce at the end.
Step 2: Remove unit productions
There are no unit productions here.
Note that S -> a and S -> b are not unit productions, because each has a single terminal symbol on the RHS.
Step 3: Remove mixed RHS
We need to define two new non-terminals for the terminals a and b with their productions:
A -> a
B -> aand introduce them into the grammar thus:
S -> ASA | BSB | AA | BB | a | b
Note that the rules S -> a | b are already in CNF so we do not introduce the rules S -> A | B.
That would merely introduce new unit productions that would have to be removed in turn.
Step 4: Generation of CNF
We have two productions that are not in CNF:
S -> ASA | BSB
Define two new non-terminals with their productions:
K -> SA
L -> SBand restore the original null production for S, so that the grammar becomes
S -> AK | BL | AA | BB | a | b | /\
K -> SA
L -> SB
A -> a
B -> bwhich is a CNF grammar for Palindrome, as required.
S -> AB
A -> B
B -> aB | Bb | /\
If you have comments or
suggestions, email to author at b.cohen@city.ac.uk
or me Vid-Com-Ram
/ members.tripod.com/bummerang
หมายเหตุ : เอกสารเวปหน้านี้นำมาจากมหาวิทยาลัยแห่งหนึ่งในประเทศอังกฤษ
เป็นหลักสูตรคล้ายคลึงกับ CT313 นำมาเพื่อเป็นเอกสารเปรียบเทียบและอ้างอิง มิได้หวังผลประโยชน์ใดๆ
และขอบอกว่าเค้ามีลิขสิทธิ์นะ