History leading to threaded codeA straightforward implementation approach is to translate a program to machine code. The resulting code is typically fast but nonportable and relatively large. Another approach is to translate a program to a virtual machine instruction set. A historically common approach is a bytecode, using 8-bit opcodes and, often, a stack-based virtual machine. The program is translated into the virtual machine instruction set; then that instruction set is interpreted. A typical interpreter is called a "decode and dispatch" interpreter, and follows the form bytecode: top: pushA: pushB: add: 0 /*pushA*/ i = decode(vpc++) *sp++ = A *sp++ = B *sp++ = *--sp + *--sp 1 /*pushB*/ addr = table[i] jump top jump top jump top 2 /*add*/ jump *addr If the virtual machine uses only byte-size instructions, As long as the individual operations are simple, such as "push" and "add", the overhead of figuring out what to execute is larger than the actual cost of executing it, so such interpreters are often much slower than machine code. Early computers had relatively little memory. For example, most Data General Nova, IBM 1130, and many Apple II computers had only 4 K words of RAM installed. Consequently a lot of time was spent trying to find ways to reduce the size of programs so they would fit in the memory available. At the same time, computers were slow, so simple interpretation was very visibly slower than executing machine code. Instead of writing out every step of an operation in every part of the program where it was needed, programmers saved memory by writing out each step of such operations exactly once (see "Don't repeat yourself") and placing it in a subroutine. This process — code refactoring — is still commonly used today by all programmers, although today they do it for different reasons. The top-level application usually consists of nothing but subroutine calls. And many of those subroutines, in turn, also consist of nothing but subroutine calls. Some early computers such as the RCA 1802 required several instructions to call a subroutine. In the top-level application and in many subroutines, that sequence is repeated over and over again, only the subroutine address changing from one call to the next. Using expensive memory to store the subroutine-call instructions over and over again seemed wasteful. Threaded code was invented to save this space. Threaded codeTo save space, programmers squeezed those lists of subroutine calls into simple lists of subroutine addresses (leaving out the "call" instruction on each one), and used a small interpreter (later called a virtual machine) to call each subroutine in turn. This is called direct threaded code (DTC). Although the technique is older, the first widely circulated use of the term "threaded code" is probably Bell's article "Threaded Code" from 1973.2 Charles H. Moore invented an even more compact notation in 1970 for his Forth virtual machine: indirect threaded code (ITC). Originally, Moore invented this because it was easy and fast on NOVA minicomputers, which have an indirection bit in every address. He said (in published remarks, Byte Magazine's Forth Issue) that he found it so convenient that he propagated it into all later Forth designs. Some Forth compilers compile Forth programs into direct-threaded code, while others make indirect-threaded code. The programs act the same either way. Threading modelsPractically all executable threaded code uses one or another of these methods for invoking subroutines (each method is called a "threading model"). Indirect threadingThe list of addresses in code point to an address at the start of a data area. The address at the start of the data area points at machine code to use the data. This "extra" pointer in indirect threading serves as an "executable data type" to define custom interpreters for data. The address at the start of each data area points to the code shared by all data areas of that type. More compact, yet slightly slower than direct-threaded code. Faster than byte code. No type interpretation is usually needed. Older Forth systems produced indirect-threaded code. As example, if the goal is to execute "push A, push B, add", the following might be used. Here,
thread: i_pushA: push: add:
&i_pushA &push *sp++ = *(*tp + 1) *sp++ = *--sp + *--sp
&i_pushB &A jump *(*tp++) jump *(*tp++)
&i_add i_pushB:
&push
&A
i_add:
&add
Direct threadingAddresses in the thread are the addresses of machine language. This is a compromise between speed and space. The indirect data pointer is lost, at some loss in the language's flexibility, and this may need to be corrected by a type tag in the data areas, with an auxiliary table. Some Forth systems produce direct-threaded code. On many machines direct-threading is faster than subroutine threading (see reference below). As example, a stack machine might execute the sequence "push A, push B, add". That might be translated to the following thread and routines, where thread: pushA: *sp++ = A pushB: *sp++ = B add: *sp++ = *--sp + *--sp &pushA jump *tp++ jump *tp++ jump *tp++ &pushB &add ... Alternatively, operands may be included in the thread: thread: push: *sp++ = *tp++ add: *sp++ = *--sp + *--sp &push jump *tp++ jump *tp++ &A &push &B &add Subroutine threadingSo-called "subroutine-threaded code" (also "call-threaded code") consists of a series of machine-language "call" instructions (or addresses of functions to "call", as opposed to direct threading's use of "jump"). Early compilers for ALGOL, Fortran, Cobol and some Forth systems often produced subroutine-threaded code. The code in many of these systems operated on a last-in-first-out (LIFO) stack of operands, which had well-developed compiler theory. Most modern processors have special hardware support for subroutine "call" and "return" instructions, so the overhead of one extra machine instruction per dispatch is somewhat diminished; but according to measurements by Anton Ertl, "in contrast to popular myths, subroutine threading is usually slower than direct threading."3 Ertl's most recent tests show that direct threading is the fastest threading model on Xeon, Opteron, and Athlon processors; indirect threading is the fastest threading model on Pentium M processors; and subroutine threading is the fastest threading model on Pentium 4, Pentium III, and PPC processors. As an example of call threading "push A, push B, add": thread: pushA: pushB: add: call pushA *sp++ = A *sp++ = B *sp++ = *--sp + *--sp call pushB ret ret ret call add Token threadingToken threaded code uses lists of 8 or 12-bit indexes to a table of pointers. Token threaded code is notably compact, without much special effort by a programmer. It is usually half to three-fourths the size of other threaded-codes, which are themselves a quarter to an eighth the size of compiled code. The table's pointers can either be indirect or direct. Some Forth compilers produce token threaded code. Some programmers consider the "p-code" generated by some Pascal compilers, as well as the byte codes used by .NET, Java, Basic and some C compilers to be token-threading. Huffman threadingHuffman threaded code consists of lists of Huffman codes. A Huffman code is a variable length bit string used to identify a unique item. A Huffman-threaded interpreter locates subroutines using an index table or tree of pointers that can be navigated by the Huffman code. Huffman threaded code is one of the most compact representations known for a computer program. Basically the index and codes are organized by measuring the frequency that each subroutine occurs in the code. Frequent calls are given the shortest codes. Operations with approximately equal frequencies are given codes with nearly equal bit-lengths. Most Huffman-threaded systems have been implemented as direct-threaded Forth systems, and used to pack large amounts of slow-running code into small, cheap microcontrollers. Most published uses have been in toys, calculators or watches. Lesser used threading
Common amenitiesSeparating the data and return stacks in a machine eliminates a great deal of stack management code, substantially reducing the size of the threaded code. The dual-stack principle was originated three times independently: for Burroughs large systems, Forth and PostScript, and is used in some Java virtual machines. Three registers are often present in a threaded virtual machine. Another one exists for passing data between subroutines ('words'). These are:
Often, threaded virtual machines such as implementations of Forth have a simple virtual machine at heart, consisting of three primitives. Those are:
In an indirect-threaded virtual machine, the one given here, the operations are: next: (ip)+ -> w ; jmp (w)+ nest: ip -> -(rp) ; w -> ip ; next unnest: (rp)+ -> ip ; next This is perhaps the simplest and fastest interpreter or virtual machine. See also
References
Further reading
| |