Notes
Links
Introduction
compiler(source) -> destination
Source can be any programming language. Destination can be machine code,
byte code, or even more source code, even in the same language
(code obfuscators are one such example).
- lexical scanning tokenization
- context-free parsing
- AST (abstract tree)
- functions as a parse tree, but is a more convenient representation
- two types of parsing
- top down (LL)
- recursive descent
- bottom up (LR)
- generates a table
- humans do not write LR parsers by hand
- top down (LL)
- semantic analysis
- type checking / type inference
- static semantics
-
dynamic analysis
int a = b + c + d; // can be represented by // mov b tmp add c tmp add d tmp mov tmp a
- dataflow (graph)
- optimization
- forward
- backward
- optimization
- code generation
- multiple approaches
- tree-based
- munch
- graph-based
- peep-hole
- bytecode
- jit
- tree-based
- multiple approaches
- AST (abstract tree)
Tokenization
java
class A { int x; if (x == 0) ... }
- regular expressions
A in Alphabet
x a
l == { a, b, ..., z, A, B, ..., Z, _ }
d == { 0, 1, 2, ..., 9 }
ID == l(l|d)*
class
andif
both matchID
, so you either make explicit expressions for them beforeID
, or you just tokenize them as identifiers and parse them separately later- reason not to like Fortran, number 7813
-
language cannot be defined with regular expressions
fortran C this is legal do i = 3,6 C so is this doi=3
-
- literals
- int lit (integer literal)
- do we treat negatives and explicit positives as a unary operator
(
+
or-
) followed by a number, or as a single entity (e.g.-1
,+3
)d+
(+|-)?d+
- what about hexadecimal, binary, and octal literals?
- do we treat negatives and explicit positives as a unary operator
(
- float literal
- Some languages make you do
0.1
, while others allow.1
- What about
1.
? - Assuming we allow all of these, we could write the regular
expression
(d*\.d+)|(d+\.d*)
- what about scientific notation? (e.g.
1.3e5
) you could do(d*'.'d+)|(d+'.'d*)(('e'|'E')('-'|'+')?d+)?
- some languages let you add an
f
ord
afterwards to specify single vs double precision, so(d*'.'d+)|(d+'.'d*)(('e'|'E')('-'|'+')?d+)?(f|d)?
- Some languages make you do
- char literal
- looks like
'a'
- we were using
'
to denote literal characters in regexes already, so I guess we’ll have to escape it to mean the literal'
char \'[^\']|\\.\'
- looks like
- string literal
- assuming multi-line strings are not allowed,
"([^\\"]|\\.)*"
- assuming multi-line strings are not allowed,
- int lit (integer literal)
- non-deterministic finite state automaton (NDFA)
- for any NDFA there is a DFA which accepts the same language
- for any DFA there is a table
- for any table you can write an interpreter
- some machines for regexes
x
xy
x|y
-
x+
n l d . 0 {b, c, e, g} cl(b) ⊥ ⊥ 1 {d, g, b, c, e} cl(d)/2 cl(f)/3 ⊥ 2 {f, g, b, c, e} 2 3 ⊥
- lexical analysis
- yacc
-
lex, flex, parser generator
grep [0-9]+ { return INTLIT; } [a-z][a-z0-9]* { return screen(yynxt); } // { eatto('\n'); }
- parsing
- assume we have a language parsable by a CFG (though this is usually a lie)
S -> A B | C d
A -> A x | ε
A -> ( A )
A -> x*
E -> E * E | E + E | v
- ambiguous statement (order of operations)
- moral: every CFG can have ambiguities
- every language should not have ambiguities
- Fortress was a language that let people program math in TeX
- project failed because it was too hard with all the ambiguities
- dangling else problem
if ... if ... else ...
- top down parsing
S -> A B C | D
S() { A(); B(); C(); }
A -> a
A() { match('a'); }
match
is a call to your lexical analyzer
- operations
check(token)
advance
match
check
thenadvance
java if (check('a')) { // I think DL might have meant `check('s')` A(); B(); C(); } else { D(); }
- ll-parsing
- recursive descent
- for every left-hand-side of the grammer, write a potentially recursive method
-
S -> A | xA | ε
antlr # ambiguous precedence E -> E + E | E * E | ( E ) | v # break it apart for unambiguity E -> E + T | T T -> T * F | F F -> ( E ) | v # C does 9 levels of this trick
- head grammar with
S -> E $
, where$
denotesEOF
- this makes sure everything ends with the end of file
- defining something like
E -> E + E ...
would mean writing a method which doesE() { E() }
, which is baad. -
Left Recursion Transformations remove the recursion problem
antlr # originally A -> A α1 | A α2 | ... | A αN | β # transform to A -> β A' A' -> α1 A' | α2 A' | ... | αN A' | ε
-
using this transformation on the previous example
antlr E -> T E' E' -> + T E' | ε T -> F T' T' -> * F T' | ε # unchanged F -> ( E ) | v
-
Left Factoring
antlr A -> α β1 | α β2 | γ # transforms into A -> α A' | γ A' -> β1 | β2
antlr A -> B x | B y # transforms into A -> B A' A' -> x | y
- yet another grammar example
antlr
E : E E
| E '*'
| E '|' E
| '(' E ')'
| ID
;
antlr
E : E '|' T
| T
;
T : T F
| F
;
F : F '*'
| A
;
A : '(' E ')'
| ID
;
antlr
E : T E_
;
E_ : '|' T E_
| ε
;
T : F T_
;
T_ : F T_
| ε
;
F : A F_
;
F_ : '*' F_
| ε
;
A : '(' E ')'
| ID
;
- first and follows
A : a | b;
- pick
a
if next token is infirst(a)
ifa = t
and next token is inFollow(A)
- algorithm
- add
S' -> S $
A -> α B β
addfirst(β)
tofollow(B)
A -> α B
orA -> α B β
nullableβ
addFollow(A)
toFollow(B)
S -> E $
A = { (, ID }
f(F')
F' = { ε, A }
f(T')
F = { (, ID }
f(T')
T' = { ε, (, ID }
F(T)
T = { (, ID }
f(E')
E' = { ε, | }
F(T)
E = { (, ID }
F(T)
)
$
- add
- LL parsers
- transform to be LL-parsable
- find first/follows
- for each lhs:
A -> B | C
- for each symbol
x
infirst(B)
- if
B
is nullable, for eachx
inFollows(B)
, check and callB
- otherwise, throw errors
- error recovery
- add
- delete
- replace
- error recovery
- if
- for each symbol
-
LR parsers
S' -> S $ S -> 'int' L L -> L ',' ID L -> ID
C int a, b
-
first thing we see is
'int'
, which is a prefix which is part of a viable right hand side (S
), so we push it onto a stackstack int
-
then we see
a
which is anID
, (though we don’t know which RHS it goes in), so we push it onto a stackstack int
ID
-
the stack now contains
'int' ID
, which matches'int' L
, so we reducestack int
L
-
now we read in
,
, which is part ofL
, and push it onto the stackstack int
L
,
-
now we read
b
, which is anotherID
stack int
L
,
ID
-
we now have
L ',' ID
, which is another validL
, so we reducestack int
L
-
we have
'int' L
, which is a validS
, so we reducestack S
-
now we read in
$
, so we push it onto the stackstack S
$
S $
is a validS'
, so we reduce toS'
, and we’re done!- Item Sets:
- Set 0 (core)
S' -> S $
- Closure
S -> 'int' L
- the two above are together one item set
Set Core Closure 0 S' -> S $
(1)S -> 'int' L
(2)1 S' -> S . $
(acc) 2 S -> int . L
(3)L -> . L ',' ID
(3)L -> . ID
(4)3 S -> 'int' L .
(-)S -> L . , ID
(3)4 L -> ID
(-)5 S -> L ',' . ID
6 S -> L ',' ID .
set int
','
ID
$
S
L
0 +2 goto1 1 acc 2 +4 goto3 3 +5 S -> int L
4 L -> ID
L -> ID
5 +6 6 S -> L, int
- Set 0 (core)
-
now our parse looks something like
0 1 2 ' ' 0
' ' 0
' ' 0
'int' 2
'int L'
'S'
'ID' 4
-
-
parser tools
- another table
antlr
S : E $
;
E : E + E
| E * E
| ( E )
| LIT
;
label | ||
---|---|---|
a | S : . E $ (b) |
E : . E + E (b) |
E : . E * E (b) |
||
E : . ( E ) (c) |
||
E : . LIT (d) |
||
b | S : E . $ (acc) |
|
E : E . + E (e) |
||
E : E . * E (f) |
||
c | E : ( . E ) (g) |
E : . E + E (g) |
E : . E * E (g) |
||
E : . ( E ) (c) |
||
E : . LIT (d) |
||
d | E : LIT . (-) |
|
e | E : E + . E (h) |
E : . E + E (h) |
E : . E * E (h) |
||
E : . ( E ) (c) |
||
E : . LIT (d) |
||
f | E : E * . E (i) |
E : . E + E (i) |
E : . E * E (i) |
||
E : . ( E ) (c) |
||
E : . LIT (d) |
||
g | E : ( E . ) (j) |
|
E : E . + E (e) |
||
E : E . * E (f) |
||
h | E : E + E . (-) |
|
E : E . + E (e) |
||
E : E . * E (f) |
||
i | E : E * E . (-) |
|
E : E . + E (e) |
||
E : E . * E (f) |
||
j | E : ( E ) . (-) |
LIT |
( |
+ |
* |
) |
$ |
S |
E |
|
---|---|---|---|---|---|---|---|---|
a | +d | +c | b | |||||
b | +e | +f | acc | |||||
c | +c | +d | g | |||||
d | E:ID |
E:ID |
E:ID |
E:ID |
||||
e | +d | +c | h | |||||
f | +d | +c | ||||||
g | +e | +f | +j | |||||
h | +e | +f | ||||||
E:E+E |
E:E+E |
E:E+E |
E:E+E |
|||||
i | +e | +f | ||||||
E:E*E |
E:E*E |
E:E*E |
E:E*E |
|||||
j | E:(E) |
E:(E) |
E:(E) |
E:(E) |
Errors
- detect
- the science part
- report
- the art part
- explosion and failure are the 2 sides of the report specturm
- recovery
- where things get pretty insane
- add / remove / replace
- add phantom productions to your grammar
- not legal code, but accepted by the parser
- lookaheads for sync tokens like
;
help escape from errors, because errors tend to be separated by them-
partly for this reason, error messages for spurious sync tokens are almost universally bad
$ javac Foo.java Foo.java:2: error: illegal start of type public { static void main(String[] args) { ^ Foo.java:2: error: ';' expected public { static void main(String[] args) { ^ 2 errors
-
Static Semantics
- checks for things which are syntactically legal, but not allowed
- type checks
- declaration before use
- write before read
- exception throws
- unreachable code
- returns
- clashes
- scopes
- overloads
- overrides
- need some program representation
- we have a parser therefore we have a parse tree
- we want something better than a parse tree, which isn’t too far from it
- enter the AST
- bottom-up tree building
- start with the children
- called synthetic or s-attributed
- parent nodes are synthesized from their children
- can be done without ever actually building the tree
- all compilers built 20+ years ago were bottom-up, as they could not fit the tree in memory
- top-down tree building
- start from the root
- called L-attributed
- program representations mainly need to do 2 things
- bindings
- constraints
klass
klass super
(or null)- is reference?
set<klass> interfaces
- attributes (e.g.
abstract
) String name
set<var> statics
set<var> fields
set<method> methods
Var
String name
(orSymbol
)klass type
set<bit> attributes
(e.g.static
,final
,constant
)Value value
method
String name
List<Var> params
Var return-type
klass owner
attributes
(final
,public
,private
, etc.)Block body
Block
List<Stat> stats
Stat
ifStat
cond
then stat
else stat?
assignStat
lhs
rhs
declStat
- just like assign but introduces a new variable
callStat
ExpStat
Exp
AddExp
MulExp
NegateExp
- sub-categories:
nullary
unary
binary
ternary
- we need a mapping of
Symbol
toStorage
Symbol
should be a thin venier overID
Compilation process:
- create
- resolve / install / check
- type-check
- static semantics
Essentially we have a big table, where the columns represent types of nodes, and the rows represent checks we want to do.
Anode |
Bnode |
… | |
---|---|---|---|
type-check | x | x | … |
exception-check | x | x | … |
eq-hash | x | x | … |
… | … | … | … |
Two ways to handle this are to make a class for each type of node, with methods for each of the checks you want to perform, or you can make a function for each type of check, with a big switch for each type of node.
Another approach is the visitor pattern.
analyze(TypeCheck a, Node p)
analyze(ExceptionCheck a, Node p)
Visitor Pattern (double dispatch)
class Visitor {
visit PlusNode(Node p) {}
}
class Node {
class PlusNode {
int left, right;
void accept(Visitor v) {
v.visit(this);
}
}
}
- resolution
- exactly once
class C {
int a;
int a; /* illegal */
}
- scoping
class C {
int a;
void f() {
int a = 1;
{
int a = 2; /* legal */
f(a);
}
f(a)
}
}
- shadowing
class C {
static int a;
}
class D extends C {
static int a;
}
- overloading / overriding
Resolution and Type Checking
- single pass
install
on production- look up or add
- “If it’s not there, put it. If it is, use it.”
- side-stack
- to help with name resolution, when you enter a scope, you push onto a stack
- context
- the only languages that allow for single-pass these days are dinosaur languages like C, Fortran, and Algol
- multi pass
install
- add it
- use it
- error
- build up the attributes of a class as you parse it
lookup(name, arglist, rettype)
- search through the current scope for
name
- if
name
is not found, recur in the next scope up - if the top scope is reached,
lookup
returnsnil
- tip: when implementing this for mini-java, start with class declarations, as that’s almost no work
- search through the current scope for
isAssignable(lhs, rhs)
- same type on both sides (good)
lhs
is same class (or super class) ofrhs
- coercible
- (warning?)
- type inference
- legal scala:
val one = 1
- (Damas-)Hindley-Milner type system
- legal scala:
Macros
- C preprocessor style
- it’s a separate program, so that’s the end of the story
- other styles
- lisp
- builds a tree node when it finds a macro during compilation
- lisp
- generics
- C++ style templates are actually macros
- Java uses common code
- treat all types as
Object
at first - insert casts to the specific type
- uses boxing on primitives
- treat all types as
Locations and Variables
int a = 3;
- where does
3
live? - in recent decades:
- locals
- stack frames?
- registers?
- statics
- allocated before program starts
- do not go away
- heaps
- that’s where java puts
new
things
- that’s where java puts
- other
- scoped memory location
- device-level stuff
- locals
- where to put things?
int f() {
/* we're in a function, so obviously it's a local */
int a = 3;
/* the following is actually represented as
int tmp = c + d;
int b = a + tmp;
*/
int b = a + c + d;
}
load c r1
load d r2
load *something* r3
add r1 r3 r4
store r4 tmp
In memory, every object starts at a certain location, and each of its fields exist at a certain multiple of the offset from that location. In most OO languages, the first field is always an object header. That header usually contains a pointer to class itself, which is in the static memory. The class has a table of its methods, which are just function pointers. This is the bare minimum for the header, but most languages that aren’t C++ also keep some bookkeeping.
Array access:
- conceptually, we have to do
origin + (index * width)
- however, multiplication is expensive
- but wait, all of our widths should be powers of two!
- if width is
4
, sincelg(4) = 2
, the operation becomesorigin + (index << 2)
- there is usually a specific machine instruction for this, since it is so common
Subclassing:
class Point { int x,y; }
class Point3 extends Point { int z; }
In memory, Point
would have a header,
and then x
and y
at certain offsets.
Class Point3
has to add z
at an offset after x
and y
.
This scheme makes multiple inheritance a nightmare.
Resolving multiple inheritance can only be done with adding layers of
indirection.
main() {
int a;
int tmp = f(a);
int b = tmp;
}
int f(int x) {
int y = x + 1;
int z = g(y);
return z;
}
int g(int x) {
return x + 1;
}
Representing and Generating Code
Representations
- instructions (x86, ARM, etc)
- CISC
- large instruction set
op (mem)++ -> mem
- RISC
- very small instruction set
load mem -> reg
store reg -> mem
op reg -> reg
+misc
- CISC
- bytecodes
- machine independent code which is trivially translatable to machine code
- internal representation
- some kind of graph
- eventually transformed into one of the above representations
Depending on the compiler, some or all of these representations may be used sequentially
JVM
Java class files
- constant pool
- a class file which is basically a big table of all compile-time constants
- integer literals (e.g.
13
) - string literals (e.g.
"hi mom"
) - class names (e.g.
Ljava/lang/Object;
)
- integer literals (e.g.
- every constant needs to be referenced by its index in the pool
- tools exist to do this for you
- Jasmin
- not used much anymore, not really maintained
- ASM
- used extensively in enterprise, well maintained
- clojure example
- BCEL
- Jasmin
- tools exist to do this for you
- a class file which is basically a big table of all compile-time constants
Optimization
- a misnomer
- better term is “improvement”
- better coding
- usually means (expected) faster
- smaller
- these can be contradictory
- internal-only
- better coding
- constraints
- equivalent behavior?
- compatible behavior w.r.t. spec
- representation
- bytecode
- target machine code
- quads
(add, v1, v2, v3)
(neg, v1, _, v3)
(load, add, _, dest)
(if, v1, _, something)
- basic block
- forms
- (forward) symbolic execution
- “never do at runtime what you can do at compile time”
- (backward) liveness analysis
- dead code removal
- strength reduction
- algebraic
a + 0 -> a
2 * x -> x << 1
- redundancies
a = 1; a = 1; -> a = 1
- algebraic
- SSA form
- every time a variable is altered, instead store it in a new name
a = 1; a++; -> a = 1; a' = a + 1;
- “pseudo-functional programming”
- (forward) symbolic execution
Data Flow
- forward, reverse
- graphs
- split
- merge
- cycles
- fixed point
- special cases
- loops
- hoisting
- induction variables