This blog post is meant to help my friends get started in reading my
other blog posts, that is, this post is a crash course on the notation
used in programming language theory ("PL theory" for short). For a
much more thorough introduction, I recommend Types and
Programming Languages by Benjamin C. Pierce and Semantic
Engineering with PLT Redex by Felleisen, Findler, and Flatt. I'll
assume the reader is an experienced programmer but not an experienced
mathematician or PL theorist. I'll start with the most basic
definitions and try to build up quickly.
Sets, Tuples, Relations, and Definition by Rules
I suspect many readers will already be familiar with
sets, tuples, and relations, but if you are not familiar
with inductive definitions, then please make sure to read the
subsection below titled Definition by Rules.
Sets
The main building block that we use in PL theory is the set, a
collection of objects (also called elements), such as the set
containing the first three natural numbers:
The only thing that matters is whether an object is in the set or not;
it doesn't make sense to ask if there are duplicates of an object or
in what order the objects appear in the set. For example, the set is the same set as the set listed above. The notation
means "in", so is true and is
false. Sets may have an infinite number of elements, such as the set
of all natural numbers (non-negative integers), written .
Tuples
Another building block is the tuple, which is an ordered collection of
objects. So is a tuple of three elements and it is different
from the tuple . The subscript notation
retrieves the ith element of tuple .
For example, if , then
.
Tuples contain only a finite number of
elements and usually less than a handful. Sometimes angle brackets
are used for tuples instead of parentheses, such as .
Relations
Putting tuples and sets together we get relations. That is, arelation is a set of tuples.
We often use relations to represent a mapping from input to output.
For example, the above relation can be thought of as mapping a natural
number to its successor, that is, to the next greater natural number.
The above definition is rather imprecise because of the elipses ().
Fortunately, there are more precise notations for describing infinite
sets and relations.
Definition by Rules
The main way that we define infinite sets in PL theory is by giving a
list of rules for what is in the set. Let's use the name for the
above relation. Then the following two rules give a precise definition
of . Notice that the second rule is recursive in that it refers to
itself. That's ok and quite common.
.
For any natural numbers and , if ,
then .
When we use rules to define a set, we implicitly mean that
an element is not in the set if there is no way to use the given rules
to justify that the element should be in the set. So is not
in because there is no way to use the above two rules to conclude
that is in .
Some sets of rules are nonsensical and do not define a set. For
example, rules should not be contradictory as in the following.
A textbook on set theory will give the restrictions on what
constitutes a "good" bunch of rules, but we won't go into that here,
other than to point out that you need at least one non-recursive rule
and that logical negation should be avoided.
A common notation for rules such as the above uses a horizontal line
in place of "if" and "then". For example, an equivalent definition
of is given by the following.
We have dropped the "For any natural numbers and " part of
rule 2. The convention is that variables such as and that
appear in a rule can be replaced by any object of the correct
type, in this case, a natural number. Often the "correct type" is
something you can deduce from the context of the discussion, in this
case, the natural numbers.
Suppose that I claim that a particular element is in , say . You might respond by saying that you don't believe me. To
convince you, I need to show you how the rules justify that , that is, I need to show you a derivation. A derivation is
a chaining together of rules, replacing variables such as and
with particular objects and replacing premises such as
with sub-derivations.
I've labelled each step in the derivation with the rule number.
The fancy name for what I'm calling Definition by Rules isinductive definition. (My daughter love the Fancy Nancy series of books.)
Language Syntax and Grammars
It turns out that using rules to defines sets, as we did above, is how
we define the syntax of a programming language. Suppose we'd like to
define a simple language of integer arithmetic, call it
, including expressions such as 1 + 3 and
-(5 + 2). Recall that is the set of all
integers. Then here's a bunch of rules that we might use to define
:
- For any , .
- For any , if , then -.
- For any and , if and , then .
- For any , if , then
.
Backus-Naur Form (BNF) is another common notation for writing
rules that define the syntax of a language, but the meaning is the same.
(There are several variations on BNF; I forget which one I'm using here.)
The bunch of rules is referred to as a grammar.
Arith ::= integer
Arith ::= "-" Arith
Arith ::= Arith "+" Arith
Arith ::= "(" Arith ")"
A vertical bar (meaning "or") is often used to make such
syntax definitions more concise, as follows.
Arith ::= integer | "-" Arith | Arith "+" Arith | "(" Arith ")"
In PL theory, we use a peculiar variation on BNF that replaces the
name of the language being defined, in this case ,
with the variable that is used to range over elements of
. So suppose we are using the variable as a
placeholder for any integer and as a placeholder for elements of
. Then we would write the above grammar as
Note that I've dropped the parentheses. It's generally understood that
parentheses are allowed in any language.
The notion of derivation coincides with that of a parse tree, they
both demonstrate why a particular element is in a set.
Operational Semantics
A language is brough to life by defining what it means to run a
program in the language, that is, the operational semantics of a
language. In the case of , we just need to specify the
integer output of each program. As discussed above, relations can be
used to map inputs to outputs, and indeed, we typically use relations
for this purpose in PL theory. There's several different styles of
relations, the first we'll discuss is a big-step style of
semantics that maps a program directly to its output.
Big-step Semantics
Let's define a relation that maps elements of
to integers. For example, we'd like to have
. This relation will be
infinite (because there are an infinite number of programs in
), so again we'll use a bunch of rules to define
. But before we start, it's common to introduce some
shorthand: means
. Below we state the rules that define using the horizontal notation. To make
sure we don't leave out any programs, we create one rule for each
syntactic rule of (there are three). We say that the
rules are sytnax-directed when there is one rule for each
syntactic rule in the language.
It may seem a little odd that I'm defining - in terms of ,
and similarly for +. Isn't that circular? No, the and
are the usual arithmetic operators for integers that everyone
learned in grade school. In this way, the language is
rather odd in not using 32 or 64-bit arithmetic. An implementor of
would have to use a big-integer package to properly
handle the arithmetic.
Small-step Semantics
The second, and perhaps more common, style of operational semantics issmall-step semantics. In this style, the relation doesn't map a
program to its output, but instead it maps a program to a slightly
simplified program in which one subexpression has been computed. This
style of semantics can be thought of as textual rewriting. To give an
example of this style, let's define a relation named
. We'll want this relation to have the following
elements, among many others:
Again, we'll introduce shorthand:
means .
Also, we'll chain together steps, so
means
and .
The term reduce is a synonym for step.
The above example of two steps can now be written as
OK, on to the rules that define the relation.
There are five rules, which we explain below.
Rules (1) and (2) are the most interesting; they perform the
arithmetic. We call them computational reduction rules. Rules
(3-5) allow us to reach inside of sub-expressions to perform
computation. They are often called congruence rules for
reasons we won't go into. The use of the variable in rule (5)
means that reduction proceeds from left to right. In particular,
we're not allowed to reduce the right-hand expression of a plus until
we've already reduced the left-hand side to an integer.
Aside: This left-to-right ordering is a choice that I made as the
designer of this example language. I could have not specified an ordering,
letting it be non-deterministic. However, this example language doesn't have
side-effects, so the ordering doesn't matter! However, most
language do have side-effects and they do specify
an ordering (but not all!), so I thought to include an
example of how ordering is typically specified.
Time for an example: let's see the derivation of the step
.
We've defined a single step of computation, the
relation,
but we're not quite done. We still need to specify
what it means to run a program to completion.
We'll do this by defining a relation
in terms of the relation
as follows. In plain Engilish, the
relation will contain any pair
if expression reduces to
integer in zero or more steps.
Some of the notation here is new and is explained below.
The notation
is the set builder or set comprehension notation for
defining a set. The stuff to the left of the vertical bar is
a template for a typical element of the set and the stuff to the right
of the vertical bar places restrictions on the elements in the set.
The notation
means zero or more steps. I like to define this multi-step relation with the
following rules:
(My brain is good at reasoning about Lisp-style lists, so
I think of the first rule as nil and the second rule as cons.)
Type Systems (with the Lambda Calclus as an example)
Many programming languages are statically typed, that is, the compiler
performs some sanity checking before proceeding with the actual work
of compiling. The checking usually involves make sure that objects are
only used as intended, for example, not trying to treat an integer as
if it were a function. The way a programming language designer (PL
theorist) specifies what kind of sanity checking should be performed
is by defining a type system for the language. The
language is so simple that there is no interesting
type checking to be performed. Let's consider a slightly larger
language that also happens to be used over and over again in PL
theory, the lambda calculus (technically, the simply-typed lambda
calculus). The lambda calculus just consists of first-class anonymous
functions. Here we'll extend the lambda calculus to also include our
arithmetic expressions. So now our example language is defined by the
following grammar.
The variable ranges over parameter names, such as foo and
g. Two expressions right next to each other denote function
application (i.e., function call). So if you're familiar with the C
language, read as . In the lambda calculus,
functions only take one parameter, so function calls only require one
argument. The syntax creates a function with one
parameter named of type (types will be defined shortly) and
whose body is the expression . (A common point of confusion is to
think that is the name of the function. It instead is the
parameter name. The function is anonymous, i.e. it doesn't have a
name.) The return value of the function will be whatever the
expression evaluates to.
Now let's consider what kind of objects will be alive when we run the
program: there's integers and functions. We'll create a set oftypes to describe the kinds of objects, using to range over the set of types.
In a function type , the is the type of the
parameter and is the return type.
The job of a type system is to predict which type of value will be
produced by an expression. For example, the expression -(5 +
2) should have the type Int because the result of
-(5 + 2) is -3, which is an integer. As with the
syntax and operational semantics of a language, PL theorists use
relations and rules to define a type system. We'll define a relation
named that, as a first approximation, maps expressions
to types, so for example, we'll have .
However, because the lambda calculus includes variables, we'll need
something analogous to a symbol table, a relation called atype environment, to keep track of which variables have which
types. The Greek letter (gamma) is traditionally used for
this purpose. We'll need to be able to create new type environments
from old ones, potentially overshadowing variable definitions from
outer scopes. To set up the mathematical machinery for that, we define
to be the relation just
like except that any tuple starting with is removed. (The way the type system will be defined,
there may be 0 or 1 tuple that starts with , making the type environment a special kind of relation
called a partial function.)
We'll write for the operation of extending a type environment
with variable x, possibly overriding a previous definition,
and define it as follows:
Suppose we have
Then
One way in which type environments are different from the global
symbol table in a compiler is that there isn't just one type environment,
there will be lots of them, one for each scope. Further, we won't ever update
a type environment in place, we'll keep creating new ones that differ a little
bit from the old ones. From a programming perspective, the mathematical metalanguage
we're using here is pure functional, that is, it doesn't use
state or side effects. The reader might worry that this might lead to
inefficiency, but remember, we're not writing a program here, we're
writing a specification! Clarity is what matters most in this setting,
and staying pure helps to make things clear.
Getting back to the
relation, instead of containing 2-tuples (pairs) it will contain
3-tuples (triples) of the form ,
so we'll be assigning types to expressions in the context of
a type environment.
As yet more shorthand (PL theorists love shorthand!), we'll write
instead of .
We're now ready to write down the rules that define .
To sum up the above rules, the arithmetic operators work on integers,
variables get their types from the environment, lambdas are given
function types based on their parameter type and their deduced return
type, the body of a lambda is checked using the environment from the
point of creation (this is lexical scoping) extended with the lambda's
parameter, and function application is sane so long as the argument's type
is the same as the parameter type.
Conclusion
Thus ends this crash course on the notation used in programming
language theory. This blog post only scratches the surface, but much
of the additional notation that you'll need is variations on what's
covered here. Happy reading! And P.S., feel free to ask questions in
the form of comments to this blog.