Notes
Links
Introduction
compiler(source) -> destination
Source can be any programming language. Destination can be machine code,
byte code, or even more source code, even in the same language
(code obfuscators are one such example).
- lexical scanning tokenization
- context-free parsing
- AST (abstract tree)
- functions as a parse tree, but is a more convenient representation
- two types of parsing
- top down (LL)
- recursive descent
- bottom up (LR)
- generates a table
- humans do not write LR parsers by hand
- top down (LL)
- semantic analysis
- type checking / type inference
- static semantics
-
dynamic analysis
int a = b + c + d; // can be represented by // mov b tmp add c tmp add d tmp mov tmp a - dataflow (graph)
- optimization
- forward
- backward
- optimization
- code generation
- multiple approaches
- tree-based
- munch
- graph-based
- peep-hole
- bytecode
- jit
- tree-based
- multiple approaches
- AST (abstract tree)
Tokenization
java
class A { int x; if (x == 0) ... }
- regular expressions
A in Alphabetx al == { a, b, ..., z, A, B, ..., Z, _ }d == { 0, 1, 2, ..., 9 }ID == l(l|d)*classandifboth matchID, so you either make explicit expressions for them beforeID, or you just tokenize them as identifiers and parse them separately later- reason not to like Fortran, number 7813
-
language cannot be defined with regular expressions
fortran C this is legal do i = 3,6 C so is this doi=3
-
- literals
- int lit (integer literal)
- do we treat negatives and explicit positives as a unary operator
(
+or-) followed by a number, or as a single entity (e.g.-1,+3)d+(+|-)?d+
- what about hexadecimal, binary, and octal literals?
- do we treat negatives and explicit positives as a unary operator
(
- float literal
- Some languages make you do
0.1, while others allow.1 - What about
1.? - Assuming we allow all of these, we could write the regular
expression
(d*\.d+)|(d+\.d*) - what about scientific notation? (e.g.
1.3e5) you could do(d*'.'d+)|(d+'.'d*)(('e'|'E')('-'|'+')?d+)? - some languages let you add an
fordafterwards to specify single vs double precision, so(d*'.'d+)|(d+'.'d*)(('e'|'E')('-'|'+')?d+)?(f|d)?
- Some languages make you do
- char literal
- looks like
'a' - we were using
'to denote literal characters in regexes already, so I guess we’ll have to escape it to mean the literal'char \'[^\']|\\.\'
- looks like
- string literal
- assuming multi-line strings are not allowed,
"([^\\"]|\\.)*"
- assuming multi-line strings are not allowed,
- int lit (integer literal)
- non-deterministic finite state automaton (NDFA)
- for any NDFA there is a DFA which accepts the same language
- for any DFA there is a table
- for any table you can write an interpreter
- some machines for regexes
xxyx|y-
x+n l d . 0 {b, c, e, g} cl(b) ⊥ ⊥ 1 {d, g, b, c, e} cl(d)/2 cl(f)/3 ⊥ 2 {f, g, b, c, e} 2 3 ⊥
- lexical analysis
- yacc
-
lex, flex, parser generator
grep [0-9]+ { return INTLIT; } [a-z][a-z0-9]* { return screen(yynxt); } // { eatto('\n'); }
- parsing
- assume we have a language parsable by a CFG (though this is usually a lie)
S -> A B | C dA -> A x | εA -> ( A )A -> x*E -> E * E | E + E | v- ambiguous statement (order of operations)
- moral: every CFG can have ambiguities
- every language should not have ambiguities
- Fortress was a language that let people program math in TeX
- project failed because it was too hard with all the ambiguities
- dangling else problem
if ... if ... else ...
- top down parsing
S -> A B C | DS() { A(); B(); C(); }A -> aA() { match('a'); }matchis a call to your lexical analyzer
- operations
check(token)advancematchcheckthenadvance
java if (check('a')) { // I think DL might have meant `check('s')` A(); B(); C(); } else { D(); }- ll-parsing
- recursive descent
- for every left-hand-side of the grammer, write a potentially recursive method
-
S -> A | xA | εantlr # ambiguous precedence E -> E + E | E * E | ( E ) | v # break it apart for unambiguity E -> E + T | T T -> T * F | F F -> ( E ) | v # C does 9 levels of this trick - head grammar with
S -> E $, where$denotesEOF- this makes sure everything ends with the end of file
- defining something like
E -> E + E ...would mean writing a method which doesE() { E() }, which is baad. -
Left Recursion Transformations remove the recursion problem
antlr # originally A -> A α1 | A α2 | ... | A αN | β # transform to A -> β A' A' -> α1 A' | α2 A' | ... | αN A' | ε -
using this transformation on the previous example
antlr E -> T E' E' -> + T E' | ε T -> F T' T' -> * F T' | ε # unchanged F -> ( E ) | v -
Left Factoring
antlr A -> α β1 | α β2 | γ # transforms into A -> α A' | γ A' -> β1 | β2antlr A -> B x | B y # transforms into A -> B A' A' -> x | y
- yet another grammar example
antlr
E : E E
| E '*'
| E '|' E
| '(' E ')'
| ID
;
antlr
E : E '|' T
| T
;
T : T F
| F
;
F : F '*'
| A
;
A : '(' E ')'
| ID
;
antlr
E : T E_
;
E_ : '|' T E_
| ε
;
T : F T_
;
T_ : F T_
| ε
;
F : A F_
;
F_ : '*' F_
| ε
;
A : '(' E ')'
| ID
;
- first and follows
A : a | b;- pick
aif next token is infirst(a)ifa = tand next token is inFollow(A) - algorithm
- add
S' -> S $ A -> α B βaddfirst(β)tofollow(B)A -> α BorA -> α B βnullableβaddFollow(A)toFollow(B)
S -> E $A = { (, ID }f(F')F' = { ε, A }f(T')F = { (, ID }f(T')T' = { ε, (, ID }F(T)T = { (, ID }f(E')E' = { ε, | }F(T)E = { (, ID }F(T))$ - add
- LL parsers
- transform to be LL-parsable
- find first/follows
- for each lhs:
A -> B | C- for each symbol
xinfirst(B)- if
Bis nullable, for eachxinFollows(B), check and callB - otherwise, throw errors
- error recovery
- add
- delete
- replace
- error recovery
- if
- for each symbol
-
LR parsers
S' -> S $ S -> 'int' L L -> L ',' ID L -> IDC int a, b-
first thing we see is
'int', which is a prefix which is part of a viable right hand side (S), so we push it onto a stackstack int -
then we see
awhich is anID, (though we don’t know which RHS it goes in), so we push it onto a stackstack intID -
the stack now contains
'int' ID, which matches'int' L, so we reducestack intL -
now we read in
,, which is part ofL, and push it onto the stackstack intL, -
now we read
b, which is anotherIDstack intL,ID -
we now have
L ',' ID, which is another validL, so we reducestack intL -
we have
'int' L, which is a validS, so we reducestack S -
now we read in
$, so we push it onto the stackstack S$ S $is a validS', so we reduce toS', and we’re done!- Item Sets:
- Set 0 (core)
S' -> S $ - Closure
S -> 'int' L - the two above are together one item set
Set Core Closure 0 S' -> S $(1)S -> 'int' L(2)1 S' -> S . $(acc) 2 S -> int . L(3)L -> . L ',' ID(3)L -> . ID(4)3 S -> 'int' L .(-)S -> L . , ID(3)4 L -> ID(-)5 S -> L ',' . ID6 S -> L ',' ID .set int','ID$SL0 +2 goto1 1 acc 2 +4 goto3 3 +5 S -> int L4 L -> IDL -> ID5 +6 6 S -> L, int - Set 0 (core)
-
now our parse looks something like
0 1 2 ' ' 0' ' 0' ' 0'int' 2'int L''S''ID' 4
-
-
parser tools
- another table
antlr
S : E $
;
E : E + E
| E * E
| ( E )
| LIT
;
| label | ||
|---|---|---|
| a | S : . E $ (b) |
E : . E + E (b) |
E : . E * E (b) |
||
E : . ( E ) (c) |
||
E : . LIT (d) |
||
| b | S : E . $ (acc) |
|
E : E . + E (e) |
||
E : E . * E (f) |
||
| c | E : ( . E ) (g) |
E : . E + E (g) |
E : . E * E (g) |
||
E : . ( E ) (c) |
||
E : . LIT (d) |
||
| d | E : LIT . (-) |
|
| e | E : E + . E (h) |
E : . E + E (h) |
E : . E * E (h) |
||
E : . ( E ) (c) |
||
E : . LIT (d) |
||
| f | E : E * . E (i) |
E : . E + E (i) |
E : . E * E (i) |
||
E : . ( E ) (c) |
||
E : . LIT (d) |
||
| g | E : ( E . ) (j) |
|
E : E . + E (e) |
||
E : E . * E (f) |
||
| h | E : E + E . (-) |
|
E : E . + E (e) |
||
E : E . * E (f) |
||
| i | E : E * E . (-) |
|
E : E . + E (e) |
||
E : E . * E (f) |
||
| j | E : ( E ) . (-) |
LIT |
( |
+ |
* |
) |
$ |
S |
E |
|
|---|---|---|---|---|---|---|---|---|
| a | +d | +c | b | |||||
| b | +e | +f | acc | |||||
| c | +c | +d | g | |||||
| d | E:ID |
E:ID |
E:ID |
E:ID |
||||
| e | +d | +c | h | |||||
| f | +d | +c | ||||||
| g | +e | +f | +j | |||||
| h | +e | +f | ||||||
E:E+E |
E:E+E |
E:E+E |
E:E+E |
|||||
| i | +e | +f | ||||||
E:E*E |
E:E*E |
E:E*E |
E:E*E |
|||||
| j | E:(E) |
E:(E) |
E:(E) |
E:(E) |
Errors
- detect
- the science part
- report
- the art part
- explosion and failure are the 2 sides of the report specturm
- recovery
- where things get pretty insane
- add / remove / replace
- add phantom productions to your grammar
- not legal code, but accepted by the parser
- lookaheads for sync tokens like
;help escape from errors, because errors tend to be separated by them-
partly for this reason, error messages for spurious sync tokens are almost universally bad
$ javac Foo.java Foo.java:2: error: illegal start of type public { static void main(String[] args) { ^ Foo.java:2: error: ';' expected public { static void main(String[] args) { ^ 2 errors
-
Static Semantics
- checks for things which are syntactically legal, but not allowed
- type checks
- declaration before use
- write before read
- exception throws
- unreachable code
- returns
- clashes
- scopes
- overloads
- overrides
- need some program representation
- we have a parser therefore we have a parse tree
- we want something better than a parse tree, which isn’t too far from it
- enter the AST
- bottom-up tree building
- start with the children
- called synthetic or s-attributed
- parent nodes are synthesized from their children
- can be done without ever actually building the tree
- all compilers built 20+ years ago were bottom-up, as they could not fit the tree in memory
- top-down tree building
- start from the root
- called L-attributed
- program representations mainly need to do 2 things
- bindings
- constraints
klassklass super(or null)- is reference?
set<klass> interfaces- attributes (e.g.
abstract) String nameset<var> staticsset<var> fieldsset<method> methods
VarString name(orSymbol)klass typeset<bit> attributes(e.g.static,final,constant)Value value
methodString nameList<Var> paramsVar return-typeklass ownerattributes(final,public,private, etc.)Block body
BlockList<Stat> stats
StatifStatcondthen statelse stat?
assignStatlhsrhs
declStat- just like assign but introduces a new variable
callStatExpStat
ExpAddExpMulExpNegateExp- sub-categories:
nullaryunarybinaryternary
- we need a mapping of
SymboltoStorageSymbolshould be a thin venier overID
Compilation process:
- create
- resolve / install / check
- type-check
- static semantics
Essentially we have a big table, where the columns represent types of nodes, and the rows represent checks we want to do.
Anode |
Bnode |
… | |
|---|---|---|---|
| type-check | x | x | … |
| exception-check | x | x | … |
| eq-hash | x | x | … |
| … | … | … | … |
Two ways to handle this are to make a class for each type of node, with methods for each of the checks you want to perform, or you can make a function for each type of check, with a big switch for each type of node.
Another approach is the visitor pattern.
analyze(TypeCheck a, Node p)
analyze(ExceptionCheck a, Node p)Visitor Pattern (double dispatch)
class Visitor {
visit PlusNode(Node p) {}
}
class Node {
class PlusNode {
int left, right;
void accept(Visitor v) {
v.visit(this);
}
}
}- resolution
- exactly once
class C {
int a;
int a; /* illegal */
}- scoping
class C {
int a;
void f() {
int a = 1;
{
int a = 2; /* legal */
f(a);
}
f(a)
}
} - shadowing
class C {
static int a;
}
class D extends C {
static int a;
}- overloading / overriding
Resolution and Type Checking
- single pass
installon production- look up or add
- “If it’s not there, put it. If it is, use it.”
- side-stack
- to help with name resolution, when you enter a scope, you push onto a stack
- context
- the only languages that allow for single-pass these days are dinosaur languages like C, Fortran, and Algol
- multi pass
install- add it
- use it
- error
- build up the attributes of a class as you parse it
lookup(name, arglist, rettype)- search through the current scope for
name - if
nameis not found, recur in the next scope up - if the top scope is reached,
lookupreturnsnil - tip: when implementing this for mini-java, start with class declarations, as that’s almost no work
- search through the current scope for
isAssignable(lhs, rhs)- same type on both sides (good)
lhsis same class (or super class) ofrhs- coercible
- (warning?)
- type inference
- legal scala:
val one = 1 - (Damas-)Hindley-Milner type system
- legal scala:
Macros
- C preprocessor style
- it’s a separate program, so that’s the end of the story
- other styles
- lisp
- builds a tree node when it finds a macro during compilation
- lisp
- generics
- C++ style templates are actually macros
- Java uses common code
- treat all types as
Objectat first - insert casts to the specific type
- uses boxing on primitives
- treat all types as
Locations and Variables
int a = 3;
- where does
3live? - in recent decades:
- locals
- stack frames?
- registers?
- statics
- allocated before program starts
- do not go away
- heaps
- that’s where java puts
newthings
- that’s where java puts
- other
- scoped memory location
- device-level stuff
- locals
- where to put things?
int f() {
/* we're in a function, so obviously it's a local */
int a = 3;
/* the following is actually represented as
int tmp = c + d;
int b = a + tmp;
*/
int b = a + c + d;
}load c r1
load d r2
load *something* r3
add r1 r3 r4
store r4 tmpIn memory, every object starts at a certain location, and each of its fields exist at a certain multiple of the offset from that location. In most OO languages, the first field is always an object header. That header usually contains a pointer to class itself, which is in the static memory. The class has a table of its methods, which are just function pointers. This is the bare minimum for the header, but most languages that aren’t C++ also keep some bookkeeping.
Array access:
- conceptually, we have to do
origin + (index * width) - however, multiplication is expensive
- but wait, all of our widths should be powers of two!
- if width is
4, sincelg(4) = 2, the operation becomesorigin + (index << 2) - there is usually a specific machine instruction for this, since it is so common
Subclassing:
class Point { int x,y; }
class Point3 extends Point { int z; }In memory, Point would have a header,
and then x and y at certain offsets.
Class Point3 has to add z at an offset after x and y.
This scheme makes multiple inheritance a nightmare.
Resolving multiple inheritance can only be done with adding layers of
indirection.
main() {
int a;
int tmp = f(a);
int b = tmp;
}
int f(int x) {
int y = x + 1;
int z = g(y);
return z;
}
int g(int x) {
return x + 1;
}Representing and Generating Code
Representations
- instructions (x86, ARM, etc)
- CISC
- large instruction set
op (mem)++ -> mem
- RISC
- very small instruction set
load mem -> regstore reg -> memop reg -> reg+misc
- CISC
- bytecodes
- machine independent code which is trivially translatable to machine code
- internal representation
- some kind of graph
- eventually transformed into one of the above representations
Depending on the compiler, some or all of these representations may be used sequentially
JVM
Java class files
- constant pool
- a class file which is basically a big table of all compile-time constants
- integer literals (e.g.
13) - string literals (e.g.
"hi mom") - class names (e.g.
Ljava/lang/Object;)
- integer literals (e.g.
- every constant needs to be referenced by its index in the pool
- tools exist to do this for you
- Jasmin
- not used much anymore, not really maintained
- ASM
- used extensively in enterprise, well maintained
- clojure example
- BCEL
- Jasmin
- tools exist to do this for you
- a class file which is basically a big table of all compile-time constants
Optimization
- a misnomer
- better term is “improvement”
- better coding
- usually means (expected) faster
- smaller
- these can be contradictory
- internal-only
- better coding
- constraints
- equivalent behavior?
- compatible behavior w.r.t. spec
- representation
- bytecode
- target machine code
- quads
(add, v1, v2, v3)(neg, v1, _, v3)(load, add, _, dest)(if, v1, _, something)
- basic block
- forms
- (forward) symbolic execution
- “never do at runtime what you can do at compile time”
- (backward) liveness analysis
- dead code removal
- strength reduction
- algebraic
a + 0 -> a2 * x -> x << 1
- redundancies
a = 1; a = 1; -> a = 1
- algebraic
- SSA form
- every time a variable is altered, instead store it in a new name
a = 1; a++; -> a = 1; a' = a + 1;- “pseudo-functional programming”
- (forward) symbolic execution
Data Flow
- forward, reverse
- graphs
- split
- merge
- cycles
- fixed point
- special cases
- loops
- hoisting
- induction variables