[an error occurred while processing this directive]

Machine Coding and Translation

Coding Data

All information that we can communicate to other people is coded in some way. This is readily apparent when you consider our concepts of the physical world. To someone in France, our "two" is coded as "deux". Here we have two different codings for the same number. We have a single idea, but two different words for expressing it. One person can even use two codings for the same idea: "2" and "two" are both common English codes for the same idea.

Moreover, we use two different common media for coding. All of our words have both a written and a spoken version. We have a graphical system for coding information as visual images, and a phonetic system for coding information as sounds.

Like humans, computers need to have coding schemes for representing information. The following subsections describe some coding schemes for computer information.

Boolean Data

Boolean data has two possible values: true or false. In some contexts, you can also think of boolean data as having values yes or no.

For computers, boolean data is the fundamental type of data because it is so easy to code. Most computer circuits just use two different voltage levels, one for true and one for false. Since it is so important for computers, a unit of boolean data has a name: it is called a bit.

Character Data

Characters in a computer are coded as a group of bits. Over the history of computing, several different schemes have been used. Most of the schemes died out over time so that by the early 1990s there was only one major survivor called ASCII coding. ASCII character coding used a group of 8 bits to code a character. With 8 bits, each true or false, you can have 2⁸ = 256 distinct possible values. This is more than enough to have a separate code for each upper and lower case English letter, each digit, and each common punctuation mark. Since the ASCII code wads used widely, the 8 bit grouping got its own name: it is called a byte.

More recently, software developers have made an effort to deal with alphabets other than the English alphabet. This has lead to a new character coding scheme called Unicode. The Java programming language was the first major language to use Unicode for character coding.

String Data

String data refers to sequences of characters. Computers make use of the same idea that we use in written text: just put character codes into a sequence of memory locations.

Whole Number Data

From the beginning, computers have dealt with whole numbers, or integers, as an important part of their work. Except for the very earliest days of computing, a binary coding scheme has been used for dealing with integers. This scheme is similar to our decimal coding except that it is based on powers of 2 instead of powers of 10. In binary coding, each binary "digit" has two possible values, 0 or 1. These binary digits play roles similar to our decimal digits 0 through 9, each being a multiplier for some power of 2.

There are two powerful reasons for using binary coding for integers. First, since each binary digit has two possible values, we can code each binary digit as a bit. In fact, the name "bit" originated as an abbreviation of "binary digit". The second reason is that the circuitry for doing binary arithmetic is much simpler and faster than circuitry for doing decimal arithmetic.

Just as humans use different ways of coding integers, so do computers. Part of the need for this arises from that fact that computers must be able to deal with human coding in order to be useful. When we send integer data to a program, we are sending in a string of decimal digits. When we look at computer integer output, we expect to see a string of decimal digits. In modern computers, there is no built-in capability for converting between strings of decimal digits and binary code, but it is easy to do the conversion using basic arithmetic. Compilers can automatically insert code for doing the conversion where it is needed.

Another reason for using different codings is dealing with numbers of different sizes. The size of the largest integer that you can represent depends on the number of bits that you use in the binary coding. To simplify the arithmetic circuitry, computers are designed to work with a small nuber of fixed size binary codings. The most common size in use today is 4 bytes or 32 bits. This size gives you the capability of dealing with any 9 digit integer. Most modern computers can also handle 8 byte (64 bit) integers. This size gives you the capability of dealing with any 18 digit integers.

Decimal Number Data

In early computers there were numerous ways of coding decimal numbers, which created problems when there was a need for transferring data from one machine to another. In 1985, the Institute of Electrical and Electronics Engineers (IEEE) wrote a proprosed standard, IEEE 754, with the aim of standardizing decimal number coding. By the mid 1990s this standard was in use in all of the major comercial computers.

The IEEE 754 standard defines several coding schemes with varying precision. All are based on a binary version of scientific notation. The two most widely used are single precision numbers with a 4 byte (32 bit) code and double precision with an 8 byte (64 bit) code.

Machine Instructions

Machine instructions are the basic operations that can be performed by a computer processor. Most processors today support only a small set of simple machine instructions that can be combined in various ways to perform more complex operations. For example, with a single instruction you can add two numbers or multiply two numbers. Combined with instructions to support repetitive operations, these instructions can be used to do complex processing of tabular information. There are also machine instructions that can cause the processor to execute or not execute a sequence of instructions, based on a boolean condition. These instructions allow a program to take different actions depending on the values of the data that it is processing.

There are several popular families of computer processors in use today. Among the most popular are Sun Microsystems Sparc family, Intel's Pentium family, and the Motorola-IBM PowerPC family. Each of these families uses a different scheme of machine instruction coding. The modern philosophy of computer processor design dictates that machine instruction coding be chosen to make the processor run as fast as possible. Translation technology, described in the next section, has advance to the point that there is little need to standardize the machine language.

Program Translation

Modern programming languages attempt to give programmers the capability of doing complex things with a computer, while writing instructions for the computer in a language close to their own natural language. For example, most programmers are comfortable enough with standard mathematical language to use expessions such as 1/3*l*w*h (the volume of a pyramid with base length l, base width w, and height h). They do not care to know the sequence of machine instructions that is needed to evaluate this expression, much less the machine coding for those instructions. Also, when a programmer wants to specify a numerical value, they prefer to specify it as a string of decimal digits rather than machine binary code. Furthermore, a programmer has no interest in the character coding used for the decimal digits. Thus programming languages require powerful mechanisms for translating the language of a programmer into a language that the machine understands.

Historically, there have been two primary approaches to the problem of translating programming languages into machine language: compilers and interpreter. These two approaches can also be combined into a two-step approach using virtual machines. These approaches and and optimization technique are described in the following subsections.

Compilers

The distinguishing characteristic of a compiler is that it is used once to translate a program into machine language. The machine language is saved in a file so that it can be executed as many times as you want without having to be translated again. This allows program execution at the full speed of the computer processor by eliminating the translation overhead. The drawback is that the translated program can only run on the family of processors for which the compiler was designed.

Interpreters

The distinguishing characteristic of an interpreter is that it translates a program into machine language every time that the program is executed. There is no saved translation. If an interpreter is written for each machine then you can move the program from one machine to another and still expect it to execute correctly. It is important to note that even though you need a different interpreter on each machine, the interpreter is the only software that is not portable. Programs written in the language for which the interpreter is designed are portable.

The drawback of interpreters is that the program is translated into machine instructions every time it is executed. This can slow down execution by a factor of a few hundred.

Virtual Machines

A virtual machine is a machine that is implemented in software rather than hardware. The idea is to introduce an artificial machine language that is close to the language of real processors. Then the translation overhead for converting from the artificial (or virtual) machine language to a real machine language can be reduced significantly.

With a virtual machine, program translation is a two-step process. First, a program is compiled into the language of the virtual machine. When the program is executed, the virtual machine language is interpreted into the language of the real machine. This gives you the portability of an interpreter with signicantly reduced execution overhead. With a virtual machine interpreter, the execution slowdown can be reduced to a factor of 5 to 25, depending on the nature of the program.

The only non-portable software in a program language that uses a virtual machine is the virtual machine interpreter. It is considerable simpler that an interpreter that translates directly from the programming language to real machine language because the virtual machine is already close to real machine language. The compiler from the programming language to the virtual machine language can be itself be compiled to virtual machine language so that it is portable to any machine with a virtual machine interpreter.

Historically, the idea of a virtual machine has been around for about 30 years. In the 1970s, the Pascal programming language used virtual machines on some platforms to allow easy porting of Pascal programs from one machine to another. One virtual machine language, called P-code, was used to supprt development of Pascal programs on early microprocessors. In addition to the portability advantage, P-code was exceptionally compact, an important consideration for processors at that time.

Just-In-Time Compilation

Portability considerations are a powerful motivation for using interpreters. However, even with a virtual machine, the translation overhead for interpretation may be too much for some programs. This has lead to a lot of research on how to reduce the overhead. Within the past few years, one idea has moved from the realm of research to the practical realm: Just-In-Time (JIT) compilation.

A JIT compiler is a component of an interpreter, compiling virtual machine language (usually) into real machine language. Unlike a normal compiler, it does not save the compilation results in a file. Instead, they are saved in memory only for the duration of a program's execution. When a potion of code is first executed, it is first translated to real machine language, just in time to be executed. The next time that portion of code is executed, the saved machine language instructions are executed directly.

You can get a pretty good idea of why JIT compilation is so effective by considering the speed of modern processors and the size of the programs that they execute. Recent processors are capable of executing about a billion instructions per second. A large program could contain 10 million instructions. This means that in order to execute for one second, each instruction must be executed 100 times on the average. Generally, programs that take a reasonable time to develop can only have significant run time problems if they are executing some portions of their code a large number of times.

With JIT compilation, you only have translation overhead the first time a portion of code is executed. After that, you are executing the code at full speed. When you average the run time over all of the times that the code portion is executed, the overhead is just a small percentage.

Java Program Tanslation

The Java programming language was designed so that programs could be incorporated into web pages. In Java this idea is supported by the applet concept. An applet is a program that executes from within a browser.

To be useful, an applet should be able to run on any machine, regardless of its processor or operating system. Towards this goal, Java was designed to use a virtual machine, which is incorporated into the browser. Most sophisticated browsers, including Netscape Communicator and Microsoft Explorer, have incorporated the Java virtual machine to support Java applets.

Early Java programs were slower than programs written in compiled languages due to the interpreter overhead. In the late 1990s, Java virtual machines began incorporating Just-In-Time compilation technology. Today, virtual machines with this technology are available on all major platforms. For most programs, there is very little difference between the speed of a Java version and a version in a simple compiled language.

Having dealt with speed issues through its sophisticated translation model, Java's popularity has been driven by other factors. Its simplified language and extensive standard (that means free) class library have made it one of the most popular program development languages.

Even when other factors dictate a different language, the Java virtual machine can be used. For example, the smalltalk and Eiffel languages both have compilers that compile into Java's virtual machine byte code. This allows programmers to use smalltalk and Eiffel for the development of applets. Your browser doesn't care what language the applet was initially written in as long as it gets byte code to work with. [an error occurred while processing this directive]