Assembler Organization
Two important steps are involved in designing software: dividing the
software into smaller, more manageable components, and determining how
the components will communicate.
For both of these steps, the designer needs to keep the major kinds of
software functionality in mind.
In this web page we will first look at a simple assembler - one without
complications arising from pseudoinstruction expansion that does not
deal with problems arising when a program is assembled from multiple
assembly language files.
This gives rise to a design that was typical of early assemblers.
It was especially well-suited for machines with limited memory.
Later, we will look at how an assembler needs to be redesigned to deal
with more complicated requirements.
The resulting organization is closer to that needed for more general
language translation applications such as compilers and web browsers.
Responsibilities
A simple assembler has four major responsibilities.
-
Generate machine code
-
Provide error information for the assembly language programmer
-
Provide machine code information for the assembly language programmer
-
Assign memory for instructions and data
Generate Machine Code
This is the primary purpose of an assembler.
The assembler usually generates a file copy of data and machine
instructions that will later be loaded into a computer in preparation
for execution.
The file copy must also contain a starting address - the address of the
first instruction to be executed.
Assign Memory
Early assemblers forced programmers to assign memory addresses for all
data and keep track of addresses assigned for instructions.
Modern assembler allow programmers to use symbols (usually statement
labels) to represent addresses for data or branch or jump targets.
This makes the programmer's life much simpler.
However, addresses are required for machine code generation.
Thus the assembler must pick up the responsibity of assigning addresses
to program symbols.
These assignments must be remembered for use when the symbols appear in
instruction operands.
Assembler Design
Much of the breakdown of an assembler into components is driven by three
major considerations.
-
the need for multiple passes through the code
-
the need for a separate component to deal with low-level character
processing
-
the need for saving information discovered in early passes through the
code for use in later passes
Assembly Passes
The simplest organization for an assembler is a two-pass organization.
The need for this organization arises from consideration of the
assembler functions and the nature of assembly language code.
-
the assembler generates machine code output
-
the assembler assigns memory addresses for program labels
-
assembly language control structures cannot be written without
forward
references.
A forward reference occurs when a use of a label in an
operand appears earlier in the code than its definition.
The memory address for a label cannot be assigned until its definition
has been read, but code generation for jumps and branches requires
knowing the address of the jump or branch target.
Thus the assembler makes two passes through the input.
During the first pass, the assembler is assigning memory addresses to
labels.
Machine code is generated during the second pass.
Dealing with Characters
The input to an assembler consists of a stream of characters which
represent assembly information in several different ways: integers, real
numbers, labels, quoted characters and strings, register names, and
various kinds of punctuation.
If character processing is mixed in with algorithms for building symbol
tables or algorithms for code generation, the result is a complex mess
that would challenge even the best programmers.
The problem is that this organization (or disorganization) forces
programmers to deal with two distinct kinds of abstraction
simultaneously.
The result is difficult algorithm development, a large number of errors,
and difficulty in debugging.
These problems are magnified if the assembler requires maintenance at a
later time.
A good general design principle is to assign responsibility for
different kinds of abstractions to different program components.
This principle is crucial for large designs, but is a good practice for
smaller designs as well.
Dividing responsibilities for different kinds of abstractions into
different components allows programmers to focus their attention on one
aspect of the problem at a time.
For an assembler, the implication is that there should be a component
dedicated to handling text at the level of characters.
This component is called a lexical analyser or scanner.
Assembler Components
Thus an assembler has four primary components, a scanner, a symbol
table, a pass 1 component, and a pass 2 component.
These are coordinated by a fifth component, a main program.
Lexical analysis: the scanner
The primary purpose of the scanner component is processing characters into
higher level units that are more meaningful for the pass 1 and pass 2
components.
These units are called tokens.
The process of forming these groups is called lexical analysis.
The need for a scanner arises in most programs that deal with complex
input.
The Symbol Table
An assembler needs to assign memory for its instructions and explicitly
declared data.
An assembly language program contains label definitions that mark the
memory locations.
The labels can be used elsewhere in the program to refer to the marked
locations.
The symbol table is a simple table structure whose entries contain
memory addresses keyed by the program labels.
When the assembler determines the address for a label it adds an entry
into the sysmbol table.
When it encounters a use of the label it can look up the address in the
symbol table.
Populating the Symbol Table: Pass 1
During pass 1, the input is read and memory addresses are asigned to
program labels.
Memory is allocated sequentially so that the pass 1 component can use a
location counter.
For each input statement, this counter is incremented by the size of
memory allocated.
Whenever a label is encountered, it is recorded in the symbol table.
The address assigned is the value of the location counter at the
beginning of the statement.
Most assemblers need to do some processing of assembler directives to
determine the size of the data involved.
Modern RISC processors have fixed instruction lengths, so machine
instructions require very little processing during pass 1.
Some assemblers keep data and instructions in separate regions of memory.
If this is done then two separate location counters are used, one for
data and one for instructions.
The symbol table could be treated as a subcomponent of the pass 1 component.
This choice has little if any effect on the complexity of coding, but a
separately compiled subcomponent does facilitate separate testing.
Generating Output: Pass 2
The primary effort in pass 2 is translating instructions into machine
code.
If the assembler is mixing data and instructions in the same area of
memory then translation of data must be done in pass 2.
For assemblers that use separate ares of memory for data and
instructions, translation of data could be moved into pass 1.
This is somewhat advantageous in that it results in a better balance of
the complexities of the two pass components.
Most of the error reports generated by an assembler are generated during
pass 2.
These reports can be interleaved with assembler listing output so that
the assembly language programmer can readily associate an error report
with the code that caused it.
While pass 2 is running, machine code is saved in a byte array (two
arrays if data and instructions are kept separate).
If there are no errors then at the end of pass 2 the array(s) is written
to a file in binary form.
It can also be displayed in hexadecimal form as a listing for the
assembly language programmer.
For large instruction sets, a table-driven design is useful.
In this approach, instructions are classified according to their operand
types.
This classification information, along with other coding information, is
stored in a table.
The pass 2 component uses the information in the table to determine the
kind of information it needs from the scanner, and the order of that
information.
A table driven design could also be used for handling directives.
This could be used in pass 1 as well as pass 2.
However, if the number of assembler directives is small then it is not
as important as for the handling of machine instructions.
Coordinating the Primary Components: The Main Program
The main program in an assembler is not complex.
The primary work is providing file parameters for function calls to the
other components and dealing with errors.
Typically pass 1 returns an error boolean that can be forwarded to pass
2 so that an executable output is not generated when there is an error
in the assembly language source.
Communication between Components
The communication between the main program and the two pass components
is quite simple - the main program just calls pass 1 or 2 functions
directing them to do their work in the proper order.
The functions do not return any data except for an error indication.
The communication between the main program and the scanner is also
simple.
The name of the file to be assembled is known directly in the main
program.
The main program either passes the name to the scanner or opens the file
and passes it to the scanner.
This could be done indirectly through the pass 1 and pass 2 components.
The communication between pass 1 and pass 2 is indirect through the
symbol table.
Pass 2 needs to get addresses for labels and values of defined constants
from the symbol table.
Although there is a fair amount of communication, the interface is
simple.
It is a standard table interface.
Complexities
The design described above is suitable for a simple assembly language.
It has the advantage that the only stored information is the symbol
table.
This reduces the amount of memory used, which was crucial in early
processors.
Today, reducing the memory footprint is not as important.
Using more memory to save results of earlier processing can simplify
design.
This is especially important when software evolves to support more
complex functionality.
We will consider one kind of change that is common to many kinds of
application that deal with language translation: retaining an internal
representation of the structures represented by a language.
The need for this is exemplified by a need for pseudoinstruction
expansion in most modern assembly languages.
Pseudoinstruction expansion results in a need for additional passes
through the assembly language code.
The strategy for handling these additional passes leads to a design that
can be adapted for more complex language translation applications.
Pseudoinstructions
Modern RISC processors have a limited instruction set.
Assemblers augment the instruction set with pseudoinstructions.
These pseudoinstructions can differ from the machine instructions in
at least three ways:
-
new instructions
For example, all assemblers add a load address instruction that
translates into an immediate load instruction with the immediate
operand determined by symbol table lookup.
-
new operand syntax
Most arithmetic instructions on RISC processors have three register
operands.
Most assemblers support using a literal constant as one of the
source operands.
This requires translating the source code instruction into two
instructions:
-
an immediate load instruction to put the literal constant into a
temporary register and
-
an arithmetic instruction with three register operands
-
relaxed restrictions on literal constants
RISC processor instructions have limited size for immediate
operands, typically 16 bits.
To deal with a 32-bit literal constant the assembler must load the
constant into a register in two steps, one for the lower half and
one for the upper half.
Strategy
The general strategy for dealing with complexities is to add a structure
for representing partially processed source code.
Each assembler pass is designed to modify this representation in various
ways:
-
A scanner pass tokenizes the source code and builds the initial
representation.
-
A pseudoinstruction expansion pass expands source code instructions
into machine instructions.
-
A memory allocation pass assigns addresses for static data and
instructions, recording the addresses in a symbol table.
-
A code generation pass generates the executable output file, replacing
symbols with addresses from the symbol table.
Adaptation
The multiple pass organization is used by most modern compilers.
Most high-level language compilers use a parse tree as the structure for
representing partially processed code.
The parse tree is not only useful in compilers.
Modern integrated development environments (IDEs) also build a parse
tree to represent the source code.
The parse tree can be used for more than just code generation.
Modern IDEs can also use a parse tree for automatic formatting of source
code.
Parse trees are usually designed to support a Visitor design pattern.
This lets designers add new kinds of functionality without having to
modify the parse tree code.