Formal Language : Basic terms
Symbols,
Alphabets, Strings, and Languages
So, what
exactly is a language, in the
formal sense? It’s not quite the same as English or Spanish. We’re going to
break it down into its most basic components.
2.1 Symbol
The most
fundamental part is the symbol.
A symbol is just a basic, indivisible unit. Think of it as a letter in the
alphabet, or a digit in a number system.
Examples:
The letter ‘a’
The digit ‘1’
The symbol ‘#’
Even emojis can be symbols!
2.2 Alphabets
Next, we
have an alphabet. An
alphabet is simply a finite, non-empty
set of symbols. We often use the Greek letter Sigma (Σ) to represent
an alphabet.
Examples:
i.
Σ = {‘a’, ‘b’, ‘c’} (A small alphabet of lowercase letters)
ii.
Σ = {0, 1} (The binary alphabet – super important for computers!)
iii.
Σ = {A, B, C, …, Z}
o Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
2.3 Strings
Now,
things get more interesting. A string
is a finite sequence of symbols
chosen from an alphabet. The symbols are placed one after another.
Examples (using the alphabet Σ = {a, b}):
i.
"a" (A string of length
1)
ii.
"ab"
(A string of length 2)
iii.
"baa" (A string of
length 3)
iv.
"aabbaba"
(A longer string)
The Empty String:
There’s
also a special string called the empty
string, which has no
symbols. We represent it with the Greek letter Epsilon (ε) or sometimes lambda (λ). It’s like an empty word.
The key
is that a string is always
formed from a single,
well-defined alphabet.
·
Valid String: A string is valid if all of its
symbols are drawn from the same, predefined alphabet.
·
Invalid String (in this context): A string is invalid if it contains symbols that do not belong to the
specified alphabet.
Example to illustrate your point:
Let’s
say we have two alphabets:
·
Σ1 = {a, b}
·
Σ2 = {0, 1}
If we
try to create a string "a1", it would be considered invalid with
respect to either Σ1 or Σ2, because it contains symbols from both
alphabets. To be valid, "a1" would need to be constructed from an
alphabet that includes {a, b, 0, 1}.
Length
of a String:
The
length of a string is simply the number of symbols in the string. We denote the
length of a string ‘w’ as |w|.
i.
|ε| = 0
ii.
|a| = 1
iii.
|ab| = 2
iv.
|baa| = 3
Lexicographic Order
Lexicographic
order for strings is essentially the same as dictionary order.
Strings are
compared character by character from left to right.
The order is
determined by the underlying order of the alphabet’s symbols.
For example, if
the alphabet is {a, b, c} and a< b < c then "abc"
comes before "acb" in lexicographic order.
Shorter strings
always come before longer strings.
If two strings
have the same length, they are ordered lexicographically.
Example: Over
the alphabet {0, 1}, the order is (epsilon, 0, 1, 00, 01, 10, 11, 000, …), where epsilon represents the empty string.
Prefix
i.
String x is a prefix of string y
if there exists a string z such that xz = y.
a. In other words, y starts with x.
ii.
Example: "abc"
is a prefix of "abcdef".
Proper Prefix
String x is a
proper prefix of string y if x is a prefix of y and x y.
The prefix cannot be the entire string itself.
Example: "abc" is a proper prefix of "abcdef",
but "abcdef" is not a proper prefix of
"abcdef".
What is Language?
Languages
Finally,
we get to the definition of a formal language. A formal language is simply a
set of strings formed from a specific alphabet.
·
Important: A language can be
finite or infinite!
·
A language L is a
subset of Σ*, where Σ is an alphabet and Σ* is the Kleene star of Σ. This means
that every string in the language L is formed by concatenating symbols from
the alphabet Σ.
Key Concepts:
Powers of an Alphabet (Σk): "The set Σk is the set of all strings of length
exactly k formed using symbols from Σ."
Example: If Σ = {a, b}, then Σ2 = {aa, ab, ba, bb}.
Given an
alphabet Σ, the set Σk represents the set of all strings of length exactly k, formed using symbols from Σ." In other words, we’re taking the alphabet and combining its
symbols to make strings of a specific length k.
More Examples:
i.
Σ = {0, 1}
ii.
Σ0 = {ε} (The set of all strings of length 0. Only the empty string.)
iii.
Σ1 = {0, 1}
iv.
Σ2 = {00, 01, 10, 11}
v.
Σ3 = {000, 001, 010, 011, 100, 101, 110, 111}
The Kleene
Star (Σ*): "The Kleene star Σ* is the set of
all possible strings (of any finite length, including
the empty string ε) that can be formed using symbols from Σ."
Mathematically: Σ* = Σ0 ∪ Σ1 ∪ Σ2 ∪ Σ3 ∪ …
The Kleene
Plus (Σ+): "The Kleene plus Σ+ is the set of
all possible strings (of length 1 or more) that can be formed using symbols
from Σ."
Mathematically: Σ+ = Σ1 ∪ Σ2 ∪ Σ3 ∪ …
Σ+ = Σ* – {ε}