Formal Language : Basic terms

Symbols, Alphabets, Strings, and Languages

So, what exactly is a language, in the formal sense? It’s not quite the same as English or Spanish. We’re going to break it down into its most basic components.

2.1 Symbol

The most fundamental part is the symbol. A symbol is just a basic, indivisible unit. Think of it as a letter in the alphabet, or a digit in a number system.

Examples:

The letter ‘a’

The digit ‘1’

The symbol ‘#’

Even emojis can be symbols!

2.2 Alphabets

Next, we have an alphabet. An alphabet is simply a finite, non-empty set of symbols. We often use the Greek letter Sigma (Σ) to represent an alphabet.

Examples:

                  i.            Σ = {‘a’, ‘b’, ‘c’} (A small alphabet of lowercase letters)

                ii.            Σ = {0, 1} (The binary alphabet – super important for computers!)

              iii.            Σ = {A, B, C, …, Z}

o    Σ = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}  

2.3 Strings

Now, things get more interesting. A string is a finite sequence of symbols chosen from an alphabet. The symbols are placed one after another.

Examples (using the alphabet Σ = {a, b}):

                  i.            "a" (A string of length 1)

                ii.            "ab" (A string of length 2)

              iii.            "baa" (A string of length 3)

              iv.            "aabbaba" (A longer string)

The Empty String:

There’s also a special string called the empty string, which has no symbols. We represent it with the Greek letter Epsilon (ε) or sometimes lambda (λ). It’s like an empty word.

The key is that a string is always formed from a single, well-defined alphabet.

·         Valid String: A string is valid if all of its symbols are drawn from the same, predefined alphabet.

·         Invalid String (in this context): A string is invalid if it contains symbols that do not belong to the specified alphabet.

Example to illustrate your point:

Let’s say we have two alphabets:

·         Σ1 = {a, b}

·         Σ2 = {0, 1}

If we try to create a string "a1", it would be considered invalid with respect to either Σ1 or Σ2, because it contains symbols from both alphabets. To be valid, "a1" would need to be constructed from an alphabet that includes {a, b, 0, 1}.

Length of a String:

The length of a string is simply the number of symbols in the string. We denote the length of a string ‘w’ as |w|.

         i.            |ε| = 0

        ii.            |a| = 1

      iii.            |ab| = 2

      iv.            |baa| = 3

  Lexicographic Order

Lexicographic order for strings is essentially the same as dictionary order.

Strings are compared character by character from left to right.

The order is determined by the underlying order of the alphabet’s symbols.

For example, if the alphabet is {a, b, c} and a< b < c then "abc" comes before "acb" in lexicographic order.

Shorter strings always come before longer strings.

If two strings have the same length, they are ordered lexicographically.

Example: Over the alphabet {0, 1}, the order is (epsilon, 0, 1, 00, 01, 10, 11, 000, …), where epsilon represents the empty string.

Prefix

                  i.            String x is a prefix of string y if there exists a string z such that xz = y.

a.     In other words, y starts with x.

                ii.            Example: "abc" is a prefix of "abcdef".

Proper Prefix

String x is a proper prefix of string y if x is a prefix of y and x  y.

The prefix cannot be the entire string itself.

Example: "abc" is a proper prefix of "abcdef", but "abcdef" is not a proper prefix of "abcdef".

    

What is Language?

Languages

Finally, we get to the definition of a formal language. A formal language is simply a set of strings formed from a specific alphabet.

·         Important: A language can be finite or infinite!

·         A language L is a subset of Σ*, where Σ is an alphabet and Σ* is the Kleene star of Σ. This means that every string in the language L is formed by concatenating symbols from the alphabet Σ.

Key Concepts:

Powers of an Alphabet (Σk): "The set Σk is the set of all strings of length exactly k formed using symbols from Σ."

Example: If Σ = {a, b}, then Σ2 = {aa, ab, ba, bb}.

Given an alphabet Σ, the set Σk represents the set of all strings of length exactly k, formed using symbols from Σ." In other words, we’re taking the alphabet and combining its symbols to make strings of a specific length k.

More Examples:

                  i.            Σ = {0, 1}

                ii.            Σ0 = {ε} (The set of all strings of length 0. Only the empty string.)

              iii.            Σ1 = {0, 1}

              iv.            Σ2 = {00, 01, 10, 11}

                v.            Σ3 = {000, 001, 010, 011, 100, 101, 110, 111}

 

The Kleene Star (Σ*): "The Kleene star Σ* is the set of all possible strings (of any finite length, including the empty string ε) that can be formed using symbols from Σ."

Mathematically: Σ* = Σ0 Σ1 Σ2 Σ3

The Kleene Plus (Σ+): "The Kleene plus Σ+ is the set of all possible strings (of length 1 or more) that can be formed using symbols from Σ."

Mathematically: Σ+ = Σ1 Σ2 Σ3

Σ+ = Σ* – {ε}