[Web Editor's Note: Article is continued from here.]
The output of the compiler
The compiled application has a very simple format. The bytecode starts with .P10
representing the version of the script, 4 bytes follow as reserved, 4 more bytes representing the number of global variables and 4 more bytes to indicate the start of the string table of the application (or, depending on how we view it, the size of the compiled bytecode).
The constant string table, which is populated by the compiler upon encountering a string in the source, is to be found in the generated code after the application bytes. The string representation in this part always starts with an 8-bit value, representing the length of the string followed by the ASCII characters themselves.
This output is then read and interpreted by the virtual machine, which sets up its structures and starts executing the application code from the first application bytecode.
Adding a new keyword to the language
In its current incarnation, the script is quite … primitive, and lacks support for most of the keywords and provided functionality we are used to from other scripting languages. But extending the script is actually quite easy. When introducing a new keyword, we will need to provide the parsing implementation and the compilation ourselves. Let’s consider implementing the while
keyword as an exercise.
We can get some inspiration from the if
keyword, since the difference between while
and if
is just a jump back to testing again the trueness of an expression, so let’s start and declare the header file for while
in the keywords directory of the compiler, with the content in Listing 5 (I have omitted things like include guard and base class includes).
namespace primal { class kw_while : public sequence, public keyword { public: static constexpr const char* N= "WHILE"; explicit kw_while(source>amp; src) : sequence(src) {} sequence::prepared_type prepare(std::vector<token>>amp; tokens) override; bool compile(compiler* c) override; std::string name() override { return N; } private: std::vector<std::shared_ptr<sequence>> m_while_body; }; } |
Listing 5 |
And some explanation:
- A keyword must inherit from sequence in order to have access to the
prepare
andcompile
methods that actually perform the parsing of the keyword and the compilation of the keyword. - A keyword must inherit from the
keyword
class in order to be included in the keyword store of the application. - The
kw_while::N
must have the name of the keyword as present in the scripting language’s syntax in uppercase. At some point it is used to compare in a cases insensitive comparison, which means the scripting language itself is case insensitive. Just like Pascal. - Since this keyword will have a body, we need a list of sequences that will represent the body of it, hence the
m_while_body
.
Now the implementation of the while
keyword can commence with the parsing of the new syntax (see Listing 6).
sequence::prepared_type kw_while::prepare(std::vector<token> >amp;tokens) { if(tokens.empty()) { return sequence::prepared_type::PT_INVALID; } parser p; std::string last; auto seqs = p.parse(m_src,[>amp;](std::string s) { return util::to_upper(s) == kw_end::N; }, last); m_while_body = std::get<0>(seqs); return sequence::prepared_type::PT_NORMAL; } |
Listing 6 |
The prepare
of the while
keyword is very similar to the prepare
of the if
keyword, with the difference being that since we have no then
in the while
, that part is missing. Parsing out the body of the while
is identical to the if
, ie. read and parse lines till we encounter the END
keyword. At this level, you don’t have to worry about the expression of the while
(like in: while expression
) as parsing it is taken care of by the layer calling it; however, if you wish to inspect it, feel free to use and consume the tokens
parameter. If you wish to notify the parsing layer not to try to parse the expressions after your keyword has done its work on it, simply return sequence::prepared_type::PT_CONSUMED
instead of PT_NORMAL
as we do here.
And the compilation of the while
keyword is shown in Listing 7.
bool kw_while::compile(compiler* c) { label lbl_while = label::create(c->get_source()); (*c->generator()) << declare_label(lbl_while); sequence::compile(c); label lbl_after_while = label::create(c->get_source()); label lbl_while_body = label::create(c->get_source()); comp* comparator = dynamic_cast<comp*> (operators[m_root->data.data()].get()); if(!comparator) { throw syntax_error("Invalid WHILE statement condition. Nothing to compare"); } (*c->generator()) << comparator->jump << lbl_while_body; (*c->generator()) << DJMP() << lbl_after_while; (*c->generator()) << declare_label(lbl_while_body); for(const auto>amp; seq : m_while_body) { seq->compile(c); } (*c->generator()) << DJMP() << lbl_while; (*c->generator()) << declare_label(lbl_after_while); return false; } |
Listing 7 |
The compilation begins with declaring a label to the location where the evaluation of the while
’s expression will be performed. Then we do the actual compilation of the expression via sequence::compile
. The rest till (*c->generator()) << DJMP() << lbl_while;
is actually the looping in the while
loop itself, and is identical to the if
keyword, as presented above.
Now, only one step remains: writing a unit test for it. The project uses Catch2 as its unit test framework, so into the tests.cpp found in the tests directory add the contents of Listing 8.
TEST_CASE("Compiler compiles, while test", "[compiler]") { auto c = primal::compiler::create(); c->compile(R"code( var a,b let a = 5 let b = 0 while a > 0 let a = a - 1 let b = b + 1 end )code" ); auto vm = primal::vm::create(); REQUIRE(vm->run(c->bytecode())); REQUIRE(vm->get_mem(0) == 0); REQUIRE(vm->get_mem(word_size) == 5); } |
Listing 8 |
The virtual machine
The general design and architecture of the virtual machine is loosely based on the workings of a Turing machine: it relentlessly behaves in a similar manner to a computer adhering to the von Neumann architecture, it feels more like a real mode x86 machine thus suffering from all its deficiencies and efficiencies, it has a dedicated assembly level programming language and it contains the following important components:
- A set of 256 registers, integers, 32 bit by default represented in the source code by the type
word_t
found in numeric_decl.h.- Reg 255 is reserved for the stack pointer
- Reg 254 is frequently used by the compiler
- Reg 253 holds the value of the LOF flag (see below)
- Reg 252 is initialized to the start of the stack segment
- Reg 251 is initialized to the entry point of the application upon the start of the application
- Reg 250 is reserved for future endeavours
All other registers are freely available for you to play with.
- A flag (called LOF – Last Operation Flag) which is set to a non-zero value by the last operation if the value after the operation evaluated to non-zero and is cleared to zero by the first conditional jump command.
- A code segment which is the application’s compiled byte-code. The size of this is flexible and depends on the application compiled. This is a read-only section and its content populated upon start-up by the virtual machine ‘loader’. For the sake of completeness, we have to mention that the virtual machine places the code segment in memory after the memory segments 1MB limit, continuously.
The machine has an Instruction Pointer to track the next instruction that needs to be executed. Initially, this is placed on the very first byte-code, and on stepping through the code, this can increase or decrease. When the virtual machine encounters errors in byte-code being executed, it issues a PANIC call and exits.
- A free-memory segment, which is used by the virtual machine and the code of the programmer. Usually it’s 1MB in size; however, it can be modified by specifying the size in the hal.h header. At this moment, only assembly level access (r/w) is granted to this area via the
MOV
andCOPY
instructions, since the compiled script makes no use of it. The virtual machine will allocate this memory segment on the heap of the real machine.- The memory can be accessed by loading the desired address value into a register and then performing the required memory access operation: either a full 32 bit value will be loaded from the memory into the required register either a byte value will be moved in the required register target.
- A memory area (usually beginning after the end of the global variables) is designated to be functioning as a stack which can be controlled via the
PUSH
/POP
/MOV
commands. The assembly instructionsCALL
andRET
are also utilizing the stack,CALL
pushes the address where the code should continue when aRET
was issued.POP
orRET
in case of the stack pointer being 0 will result inPANIC
.There is a Stack Pointer (same variable type as the registers, referenced in register with index 255) that is used to access the stack. Upon a push, the stack pointer increases its value, and upon pop, it decreases it. The assembler syntax for the stack pointer is:
$sp
- An important compatibility note: due to the way the direct jump addresses are calculated, both the compiler and vm need to be compiled using the same memory size.
- A set of 256 interrupts, but only one of them is used at the moment to handle writing to the screen, the other ones are available for adventurous programmers to experiment with.
The opcodes of the virtual machine
In order to perform the lowest possible level of operations, the primitive virtual machine has a set of opcodes that are acted upon by the VM. Following is a brief presentation of the opcodes which for the technically versed should be immediately familiar due to the heavy influence of the Intel family of opcodes.
Target specifiers
Please note that in the binary stream (some of) the values (the parameters to the assembly commands) are prepended by a type specifier which indicates how to interpret the bytes following the specifier. There is a specific assembly syntax for the various specifiers.
The specifier is always 1 byte, and it can be:
0x00
– the following 4 bytes are to be interpreted as an immediate number, negative or positive. Assembler syntax:12345
(ie: just a normal number)0x01
– the following 1 byte is the index of a given register. Assembler syntax: $r123
for register 123.0x02
– this represents the value of the memory at the register’s value in the following byte. The syntax:[$r123]
will give the byte value stored at the value of register 123 in the memory.0x4X
– the following 1 byte is the index of a register, and the nibble marked with X is the index of the byte in the given register. Assembler syntax:$r12@1
meaning byte 1 from register 12. For 32 bit virtual machines@0
,@1
,@2
,@3
are valid byte indexes.0x05
– the coming 4 bytes represent a number, and its value represents an address in the memory. Syntax:[1234]
. This is mostly used by the compiler to index in variables from the memory.0x06
– the coming 4 bytes represent a number, and its value represents the address of 1 byte in the memory. Syntax:[@1234]
0x07
– this represents the value of the memory at the register’s value in the following byte + the added offset following that. The syntax:[$r123+4]
will give the byte value stored at the value of register 123 + 4 in the memory and[$r123-4]
will give the byte value stored at the value of register 123 - 4 in the memory. The four basic operations (addition, multiplication, division, subtraction) are possible in this context.0x08
represents the value of the memory at the register’s value in the following byte + the following register’s value. The syntax:[$r123+$r245]
will give the byte value stored at the value of register 123 + value of register 245. The four basic operations (addition, multiplication, division, subtraction) are possible in this context.
MOV
The MOV
opcode moves data from the given source to the given destination.
The following is the syntax of the command:
MOV <DST> <SRC>
where the syntax of <DST>
and <SRC>
correspond to specific syntax in the target specifier list. MOV
makes sure that only sane operations are permitted, and the machine will stop execution with a PANIC
command if an invalid operation is attempted for execution (for example trying to load a value into an immediate number). One of <DST>
or <SRC>
must be a register. For example, see Listing 9.
MOV $r1 12 # initialize register 1 to 12 MOV [0] $r1 # load into memory, at location 0 the value of register 1 MOV $r6@0 12 # load 12 into register 6’s lowest byte MOV $r1@1 [@$r2] # load into byte 1 of reg 1 the value in memory at register 2’s value MOV $r3@0 [@0] # load into byte 0 of reg 3 the byte value in memory at addr 0 |
Listing 9 |
ADD, SUB, MUL, DIV, MOD, OR, XOR, AND and NOT
The assembly opcodes with these very descriptive names perform the requested operation on two parameters following them. The syntax of the parameters is identical to the parameters of MOV
, so you can just refer to that section when writing your assembly code should you require it.
NOT
requires only one parameter and it will do a binary negation of it, which cannot be an immediate value. If any of the operations results in a zero value, the LOF
flag will be set to false, otherwise to true.
COPY
COPY
is used to copy data between two addresses in the memory segment. COPY
can be imagined as multiple MOV
commands where both the destination and source are at increasing locations.
The syntax of the command is:
COPY <DEST> <SRC> <COUNT>
where DEST
is interpreted as a numerical value representing the address of the area in the memory, SRC
represents the source address and COUNT
is the number of bytes to copy. If DEST
+ COUNT
> MEM_SIZE
, the virtual machine will issue a PANIC
call and not perform the operation. At the implementation level, std::memmove
is used so it is possible to have overlapping memory areas for this command. This operation does not modify the content of the LOF
flag.
EQ, NEQ, LT, GT, LTE and GTE
The comparison operators perform two operations. The first one is to compare the value of the two parameters they get, and the second one is to set the value of the LOF
flag to be either true
or false
.
JMP, JT and JNT
The direct jump commands will set the IP of the virtual machine to the specified parameter and will continue the execution of the code from the new location.
- The
JMP
command will jump to the given address regardless of any surrounding conditions. - The
JT
will jump only if the VM’sLOF
flag was set totrue
by the previous comparison operation, and it will clear the flag simultaneously. - The
JNT
command will jump to the given location only if the flag was set tofalse
, and it also will clear the flag tofalse
.
The syntax of the commands is:
JMP | JT | JNT <ADDRESS>
where ADDRESS
will be interpreted as the address where the execution will be continued. The programmer needs to take into consideration that ADDRESS
is actually a valid destination, where valid, executable code is to be found.
DJMP, DJT and DJNT
The delta jump commands, such as the ‘direct jump’, ‘jump if last comparison was true’ and ‘jump if last comparison was not true’ commands, jump from the current location in various directions and the new address determined as being at a specific bytes away from the current location. The syntax is:
DJMP | DJT | DJNT <DELTA>
where DELTA
is to be interpreted as the difference between the current IP (instruction pointer) and the upcoming new location. At this point of execution (when DST
was evaluated by the jump command) the IP points to the next executable operation, this needs to be taken into consideration when manually calculating jump addresses.
As a side note, the jump commands cannot jump to labels since they all expect a number. In order to jump to a specific label, you will have to use goto
.
PUSH and POP
As guessed from the name, PUSH
and POP
work with the stack. PUSH
always pushes an immediate value onto it, with the type modifier 0 for numeric data or 7 for string indexes, and POP
pops the value into the parameter it was required to pop into. The syntax is: PUSH <SOMETHING>
and POP <SOMETHING>
.
Internals of the VM
Like every other code interpreter out there, our VM is performing very similar operations:
- Load the next instruction
- The code for the instruction loads its parameters
- The code for the instruction executes the instruction
- Check for failure
- Repeat from step 1.
A big part of the virtual machine’s code (specifically automatic handling of the registered opcodes) is generated by the build system (see below at the extending the VM) and the remaining of the operations are distributed between the data types representing the VM’s data structures and the actual operations the VM can perform.
The virtual machine’s basic data structures are declared in the header registers.h. This header has a basic data type primal::valued
. The class acts as a common aggregate for the entities that can be handled by the virtual machine (see Figure 7).
Figure 7 |
Upon each execution step, the instruction (opcode) needs to fetch its parameters from the memory of the virtual machine (the opcode can have zero, 1, 2 or 3 parameters for now) and perform the most basic of operations on the data types. The overloaded operators in the valued class more act as commodities, the most important operations are the word_t value()
and the void set_value(word_t)
. This class hierarchy makes it very easy to implement basic operations, for example this is the implementation of the ADD
opcode (found in impl_ADD.h):
bool primal::impl_ADD(primal::vm* v) { primal::valued* dest = v->fetch(); primal::valued* src = v->fetch(); *dest += *src; v->set_flag(dest->value() != 0); return true; }
Extending the virtual machine
The project was created with extendibility in mind, and nothing is easier than adding a new binary opcode and the corresponding implementation, or providing your own interrupt to deal with tasks.
Adding a new opcode
As you might have observed, we have not introduced the INC operation when designing the virtual machine (along with other missing operations, which hopefully will be implemented at a later stage), but no worries – let’s read through the following paragraph in order to see how you can implement your own opcodes to extend the virtual machine.
We will need to do modifications to the build system and also add a new file to it in order to have the new opcode up and running. You have to open the CMakeLists.txt from the opcodes directory around line 35 to find the registration of the opcodes section.
Go to the end of the section, and after the last register_opcode
call insert yours:
register_opcode("INC" 0xEE 1 OF_ARITH)
With this you have told the build system the following facts:
- It will have an opcode called
INC
- When in the binary stream, it will be represented by the value
0xEE
- It expects one parameter, ie. the one which will be incremented.
- It is of type arithmetic
What happens behind the scenes is that CMake
generates two files for you:
- One is the opcodes header file INC.h in the build/opcodes directory (see Listing 10), which has the declaration of the class INC and also two functions.
- The other file is compile_INC.cpp, generated in the same directory, which will be used by the compiler when encountering the
INC
opcode in the assembly code of your source file.
struct INC final : public opcode { uint8_t bin() const override {return 0xEE; } std::string name() const override {return "INC";} size_t paramcount() const override { return 1; } virtual opcode_family family() const override { return primal::opcodes::opcode_family::OF_ARITH; }; }; bool impl_INC(vm*); std::vector<uint8_t> compile_INC(std::vector<token>>amp;); |
Listing 10 |
The two functions in the header file are:
compile_INC
, which is responsible for compiling the assembly statements into corresponding (virtual) machine code and it will be provided by the build system in the compile_INC.cpp fileimpl_INC
, which actually tells the virtual machine what to do upon encountering theINC
in the binary stream, needs to be provided by us in the opcodes/impl/impl_INC.cpp file, because this is where the build system will look for it.So create that file and paste in the following:
#include <INC.h> #include <vm.h> bool impl_INC(vm* v) { valued* dest = v->fetch(); dest->set_value(dest->value() + 1); return true; }
After you regenerate your CMake cache, and once the build of the system is ready, everything should be up and running, and you can use INC
in your assembly commands. So maybe it’s time to write a unit test for this purpose too (Listing 11).
TEST_CASE("ASM compiler - INC opcode", "[asm-compiler]") { auto c = compiler::initalize(); c->compile(R"code( asm MOV $r1 10 asm INC $r1 )code" ); auto vm = vm::create(); REQUIRE(vm->run(c->bytecode())); REQUIRE(vm->r(1).value() == 11); } |
Listing 11 |
Now if you run make test
, your newly created opcode should be up, running and incrementing values you want to.
How the opcodes are registered into the VM
When you have declared a new opcode, with the register_opcode
function in the build system in the background a CPP file (opcode-impl.cpp) is generated with the contents similar to Listing 12 for each of the opcodes.
namespace primal { void register_opcodes() { vm_impl::register_opcode (primal::opcodes::MOV(), [>amp;](primal::vm* v) -> bool { return primal::impl_MOV(v); } ); vm_impl::register_opcode (primal::opcodes::ADD(), [>amp;](primal::vm* v) -> bool { return primal::impl_ADD(v); } ); |
Listing 12 |
In the VM’s implementation class (vm_impl.h) this boils down Listing 13.
struct executor { std::function<bool(vm*)> runner; }; template<class OPC, class EXECUTOR> static void register_opcode(OPC>amp;>amp; o, EXECUTOR>amp;>amp; ex) { auto f = [>amp;](vm* machina) -> bool { return ex(machina);}; executor t; t.runner = std::function<bool(vm*)>(f); opcode_runners[o.bin()] = t; }; static std::map<uint8_t, executor> opcode_runners; |
Listing 13 |
Now in its turn, the VM’s opcode runner has the code in Listing 14.
while(opcode_runners.count(ms[static_cast<size_t> (m_ip)])) { // read in an opcode uint8_t opc = ms[static_cast<size_t>(m_ip++)]; try { // is there a registered opcode runner for // the given opcode? if(!opcode_runners[ opc ].runner(v)) { panic(); } } catch(...) { panic(); } if(ms[static_cast<size_t>(m_ip)] == 0xFF) { return true; } } |
Listing 14 |
And in short, this sums up how the VM is handling the execution of the various opcodes.
Writing a new interrupt
The interrupts are pieces of code that are compiled C++ code, and they can be used by the VM to communicate with the world outside the sandbox.
For the moment, only one interrupt is in use, which is INT 1
used to write to the screen. In order to implement a new interrupt, you will have to modify the CMakeLists.txtof the virtual machine in the section where it states ‘Register the interrupts’ add the interrupt with a new call to enable_interrupt
, for example: enable_interrupt(2)
.
Now, create the implementation file for the interrupt in vm/intr and call it intr_2.cp As a start, the content of it can be:
#include <vm.h> namespace primal { bool intr_2(vm* v) { return true; } }
And from this point on, you can provide the implementation for your interrupt. You can use the interface exposed by vm.h in order to get access to the VM internals and memory.
Binding it together
As I mentioned in the introduction, our ultimate goal in this article is to have it all in one place, ie: in one CPP file we compile a piece of primal script code and let it run in the primal VM.
Using the compiler to compile and the vm in your own source files is pretty easy, you just need a few instructions.
To properly use it in your own code, you will need a sequence of instructions like:
std::shared_ptr<primal::compiler> c = primal::compiler::create();
This will create for you the object c
which is a primal script compiler that you can use to compile code in the following manner:
c->compile(R"code( let x = 40 )code" );
The next step you need to do is to create a virtual machine:
std::shared_ptr<primal::vm> vm = primal::vm::create();
It is possible to use auto
for the return type: I just wanted to show the actual return type. Now that you have the compiled code in the compiler and the virtual machine up and running, nothing is easier than to run it:
vm->run(c->bytecode())
After this point, you have full access to the memory and registers of the virtual machine with the proper function calls found in class primal::vm
located in vm.h:
word_t get_mem(word_t address)
– to get the 32/64 bit numeric value from the memory of the virtual machine from the specified address.uint8_t get_mem_byte(word_t address)
– to get the 8 bit value from the memory of the virtual machine from the specified address.const reg& r(uint8_t i) const
– to get the object representing the ith. register of the VM.
Conclusion
Working on this project was really fun, and I have learned a lot during the implementation. However, progress does not stop here. There are a lot of features in the compiler to be implemented, new keywords, proper register handling, and – why not – even some optimizations that can be introduced. The VM needs good profiling in order to pinpoint the bits that can be optimized, and a lot of other features can be included too in the entire ecosystem. While designing this whole universe, I made it as modular as possible with the thought that some day, someone might have time to work on it, and include the features he/she considers nice to have. So, if you feel that you could contribute in any way, feel free to get in touch and start coding on it in the spirit of open source.
References
[CompOpt] Compiler optimizations: http://compileroptimizations.com/
[Dijkstra61] EdsgerW.Dijkstra (1961) ALGOL Bulletin Supplement no. 10: http://www.cs.utexas.edu/~EWD/MCReps/MR35.PDF
[Dragon] Alfred V. Aho, Monica S. Lam, Ravi Sethi and Jeffrey D. Ullman (2006) Compilers: Principles, Techniques, and Tools
[GCCArchitecture] GNU C Compiler Architecture: https://en.wikibooks.org/wiki/GNU_C_Compiler_Internals/GNU_C_Compiler_Architecture
[Guntli11] Christopher Guntli (2011) ‘Architecture of clang’, https://wiki.ifs.hsr.ch/SemProgAnTr/files/Clang_Architecture_-_ChristopherGuntli.pdf
[Knuth65] Donald E. Knuth (July 1965). ‘On the translation of languages from left to right’. Information and Control 8(6): 607 - 639. doi: 10.1016/S0019 - 9958(65)90426-2
[LLVM] The LLVM Compiler Infrastructure:https://llvm.org/
[Wikipedia-1] Compiler optimizations: https://en.wikipedia.org/wiki/Category:Compiler_optimizations
[Wikipedia-2] Lexical analysis: https://en.wikipedia.org/wiki/Lexical_analysis
[Wikipedia-3] Multi-pass compiler: https://en.wikipedia.org/wiki/Multi-pass_compiler
[Wikipedia-4] Parser generators: https://en.wikipedia.org/wiki/Comparisonofparser_generators
[Wikipedia-5] Regular expression: https://en.wikipedia.org/wiki/Regular_expression
Appendix – The write function
As promised, Listing 15 is the write function you can use to display data. Save it under the name write.prim and place it in the directory where your compiler and source file is, for the moment. Don’t forget to include it in your project.
fun write(...) asm MOV $r249 $r255 # Decrease the stack pointer to skip the pushed # R254 and the return address. # This is for 32 bit builds. For 64 bit builds, # use 16 asm SUB $r255 8 # First: the number of parameters that came in asm POP $r10 :next_var # fetch the value that needs to be printed asm POP $r2 # This $r1 will contain the type of the variable: # 1 for string, 0 for number asm POP $r1 # Is this a numeric value we want to print? asm EQ $r1 0 # If yes, goto the print number location asm JT print_number # else goto the print string location asm JMP print_string :print_number # print it out asm INTR 1 # Move to the next variable asm SUB $r10 1 # JT is logically equivalent to JNZ asm JT next_var # Done here, just return asm MOV $r255 $r249 asm JMP leave :print_string # Here $r2 contains the address of the string, #first character is the length # Initialize $r1 with the length asm MOV $r1 0 asm MOV $r1@0 [$r2+$r251] # Get the address of the actual character data asm ADD $r2 1 # Print it asm INTR 1 # Move to the next variable asm SUB $r10 1 # JT is logically equivalent to JNZ asm JT next_var # Done here, just return asm MOV $r255 $r249 :leave end |
Listing 15 |
Ferenc has wanted to be a better programmer for the last 15 years. Right now he tries to accomplish this goal by working at Maritime Robotics as a system programmer, and in his free time, by exploring the hidden corners of the C++ language in search of new quests.
Overload Journal #149 - February 2019 + Design of applications and programs
Browse in : |
All
> Journals
> Overload
> o149
(8)
All > Topics > Design (236) Any of these categories - All of these categories |