Title: Programming Your Own Language in C++

Author: Martin Moene

Date: 03 June 2016 14:42:20 +01:00 or Fri, 03 June 2016 14:42:20 +01:00

Summary: Scripting languages allow dynamic features easily. Vassili Kaplan writes his own in C++ allowing keywords in any human language.

Body:

To iterate is human, to recurse divine.
~ L. Peter Deutsch

Why is yet another scripting language needed? Not only are there already quite a few scripting languages present but there are also a few frameworks to build new languages and parsers, notably ANTLR Framework [ANTLR] and the Boost Spirit library [Spirit]. Nevertheless, I see two main advantages in writing a language like the one presented in this article: the possibility of adding a new function, a control flow statement or a new data structure on the fly and to have keywords translated to any language with just a few configuration changes (or to add synonyms to existing keywords, so that, for instance, ls and dir could be used interchangeably). Also, as you will see from this article, the learning curve is much shorter to get started adding your own code in C++.

In C Vu [Kaplan15a, Kaplan15b] I published the split-and-merge algorithm to parse a string containing a mathematical expression. Later, in MSDN Magazine [Kaplan15c, Kaplan16] I implemented this algorithm in C# and showed how one can implement a scripting language in C# based on the split-and-merge algorithm.

In this article, I am going to extend the split-and-merge algorithm presented earlier and show how one can use it to write a scripting language in C++. In MSDN Magazine [Kaplan16] the scripting language was called CSCS (Customized Scripting in C#). For simplicity, I will continue calling the language I discuss here CSCS (Customized Scripting in C++ using the split-and-merge algorithm) as well.

The CSCS language, as described in MSDN Magazine [Kaplan16], was still not very mature. In particular, there is a section towards the end of the article mentioning some of the important features usually present in a programming language that CSCS was still missing. In this article Iâ€™ll generalize the CSCS language and show how to implement most of those missing features and a few others in C++.

All the code discussed here is available for download [Downloads].

The split-and-merge algorithm to parse a language statement

Here weâ€™ll generalize the split-and-merge algorithm to parse not only a mathematical expression but any CSCS language statement. A separation character must separate all CSCS statements. It is defined in the Constants.h file as const char END_STATEMENT = ';'.

The algorithm consists of two steps.

In the first step, we split a given string into a list of objects, called â€˜Variablesâ€™. Each Variable consists of an intermediate result (a number, a string, or an array of other Variables) and an â€˜actionâ€™ that must be applied to this Variable. Previously we called this Variable a â€˜Cellâ€™ and it could hold only a numerical value.

The last element of the created list of Variables has so called â€˜null actionâ€™, which, for convenience, we denote by the character ")". It has the lowest priority of 0.

For numbers, an action can be any of +, -, *, /, %, &&, ||, and so on. For strings, only + (concatenation), and logical operators <, <=, >, >=, ==, != are defined.

Listing 1 contains an excerpt from the Variable class definition.

class Variable {
public:
  Variable(): type(Constants::NONE) {}
  Variable(double val): numValue(val),
    type(Constants::NUMBER) {}
  Variable(string str): strValue(str),
    type(Constants::STRING) {}

  string toString() const;
  bool canMergeWith(const Variable& right);
  void merge(const Variable& right);
  void mergeNumbers(const Variable& right);
  void mergeStrings(const Variable& right);

private:
  double numValue;
  string strValue;
  vector<Variable> tuple;
  string action;
  string varname;
  Constants::Type type;
};

Listing 1

The separation criteria for splitting a string into a list of Variables are: an action, an expression in parentheses, or a function (including a variable, which is also treated as a function), previously registered with the parser. We are going to talk how to register a function with the parser in the next section. In Listing 2 you can see all of the actions defined for numbers and their priorities.

unordered_map<string, int> prio;
prio["++"] = 10;
prio["--"] = 10;
prio["^"]  = 9;
prio["%"]  = 8;
prio["*"]  = 8;
prio["/"]  = 8;
prio["+"]  = 7;
prio["-"]  = 7;
prio["<"]  = 6;
prio[">"]  = 6;
prio["<="] = 6;
prio[">="] = 6;
prio["=="] = 5;
prio["!="] = 5;
prio["&&"] = 4;
prio["||"] = 3;
prio["+="] = 2;
prio["-="] = 2;
prio["*="] = 2;
prio["/="] = 2;
prio["%="] = 2;
prio["="]  = 2;

Listing 2

In the case of an expression in parentheses or a function, we apply recursively the whole split-and-merge algorithm to that expression in parentheses or the function argument in order to get a Variable object as a result. At the end of the first step we are going to have a list of Variables, each one having an action to be applied to the next Variable in the list. Thanks to the recursion there will be no functions left after step 1, just numbers, strings or lists of numbers, strings or lists of numbers, strings, â€¦ and so on recursively. We call these lists â€˜tuplesâ€™. Internally they are implemented as vectors (see Listing 1).

The data structure holding the string with the expression to parse and a pointer to the character currently being parsed is ParsingScript. An excerpt from its definition is shown in Listing 3.

class ParsingScript
{
public:
  ParsingScript(const string& d): data(d),
    from(0) {}
  inline char operator()(size_t i) const {
    return data[i]; }
  inline size_t size() const             {
    return data.size(); }
  inline bool stillValid() const         {
    return from < data.size(); }
  inline size_t getPointer() const       {
    return from; }
  inline char current() const            {
    return data[from]; }
  inline char currentAndForward()        {
    return data[from++]; }
  inline char tryNext() const            {
    return from+1 < data.size() ?
           data[from+1] : Constants::NULL_CHAR; }
  inline void forward(size_t delta  = 1) {
    from += delta; }
private:
  string data; // contains the whole script
  size_t from; // a pointer to the 
               // script data above
};

Listing 3

The main parsing cycle of the first part of the algorithm is shown in Listing 4.

do // Main processing cycle of the first part.
{
  char ch = script.currentAndForward (); 
  // get the next character and move one forward

  bool keepCollecting = stillCollecting(script,
    parsingItem, to, action);
  if (keepCollecting)
  { // The char still belongs to the 
    // previous operand.
    parsingItem += ch;
    // Parse until the next extracted character is
    // in the "to" string.
    bool goForMore = script.stillValid() &&
      !Utils::contains(to, script.current()));
    if (goForMore) {
      continue;
    }
  }
  // Done getting the next token. The getValue()
  // call below may recursively call this method if
  // extracted item is a function or is starting
  // with a '('.
  ParserFunction func(script, parsingItem, ch,
    action);
  Variable current = func.getValue(script);

  if (action.empty()) { // find the next "action"
                     // token or a null action '('
    action = updateAction(script, to);
  }

  char next = script.current(); // we've already
                                // moved forward
  bool done = listToMerge.empty() && 
    (next == END_STATEMENT ||
    (action == Constants::NULL_ACTION &&
     current.getType() != NUMBER);
  if (done) {
    // If there is no numerical result, we are not
    // in a math expression.
    listToMerge.push_back(current);
    return listToMerge;
  }
  current.action = action;
  listToMerge.push_back(current);
  parsingItem.clear();
} while (script.stillValid() &&
    !Utils::contains(to, script.current());

Listing 4

The second step consists in merging the list of Variables created in the first step according to their priorities. Two Variable objects can be merged together only if the priority of the action of the Variable on the left is greater or equal than the priority of the action of the Variable on the right. If not, we jump to the Variable on the right and merge it with the Variable on its right first, and so on, recursively. As soon as the Variable on the right has been merged with the Variable next to it, we return back to the original Variable, the one we werenâ€™t able to merge before, and try to remerge it with the newly created Variable on its right. Note that eventually we will be able to merge the list since the last Variable in this list has a null action with zero priority.

Check out the implementation of the second step of the algorithm in Listing 5. The function merge() is called from outside with the mergeOneOnly parameter set to false. That parameter is set to true when the function calls itself recursively, because we need to merge the Variable on the right with the Variable on its right just once, and return back in order to remerge the current Variable.

Variable Parser::merge(Variable& current, 
  size_t& index, vector<Variable>& listToMerge,
  bool mergeOneOnly)
{
  while (index < listToMerge.size()) {
    Variable& next = listToMerge[index++];
    while (!current.canMergeWith(next)) {
      merge(next, index, listToMerge,
            true/*mergeOneOnly*/);
    }
    current.merge(next);
    if (mergeOneOnly) {
      break;
    }
  }
  return current;
}

Listing 5

Letâ€™s see an example of applying the split-and-merge algorithm to the following CSCS language statements:

  a = "blah"; b = 10; c = a == "blue" || b == 1;
  print(c);

We have four statements above, separated by the ; character.

The parsing of the first two statements is analogous â€“ as soon as the parser gets the = token, it will register the parameters on the left (a and b), mapping them to their corresponding values ("blah" and 10). We will see how to register variables and functions with the parser in the next section (actually for the parser it doesnâ€™t matter whether it is a variable or a function â€“ they are all treated the same). Note that there is no need to declare any variables â€“ they are all deduced from the context.

More interesting is the third statement:

  c = a == "blue" || b == 1;

When parsing the right part of this statement (after the assignment), as soon as the parser gets a function or a variable (this can be anything that is not in quotes and is not a number) it will try to find if there is already a corresponding variable or a function registered with the parser. If not, a ParsingException will be thrown, but in our case we have already defined a and b in the previous two statements, so their values will be substituted, transforming the right side of the statement above to:

  "blah" == "blue" || 10 == 1;

The first step of the split-and-merge algorithm produces the following list of Variables from the above statements:

Split("blah" == "blue" || 10 == 1) â†’
1. Variable(strValue = "blah", action = "==")
2. Variable(strValue = "blue", action = "||")
3. Variable(numValue = 10, action = "==")
4. Variable(numValue = 1, action = ")")

In the second part of the algorithm we merge the list of Variables above. The Variables 1 and 2 can be merged on the first pass since the priority of the "==" action is higher than the priority of the "||" action according to the Listing 2. The merging of Variables 1 and 2 consists in applying the action "==" of the first variable to the values of both variables, leading to a new Variable having the action of the Variable on its right:

  Merge(Variable(strValue = "blah", action = "=="),
  Variable(strValue = "blue", action = "||")) = 
  Variable("blah" == "blue", 
    action = "||") = Variable(numValue = 0, 
    action = "||")

Next, we merge Variable(numValue = 0, action = "||") with the next Variable in the list, Variable(numValue = 10, action = "=="). But now the action = "||" has a lower priority than the action = "==", therefore we need to merge first the Variable(numValue = 10, action = "==") with the next Variable(numValue = 1, action = ")"). Since the null action ")" has the lowest priority, the merge can be done, producing a new Variable:

  Merge(Variable(numValue = 10, 
    action = "=="),Variable(numValue = 1,
    action = ")")) =
  Variable(10 == 1, 
    action = ")") = Variable(numValue = 0, 
    action = ")")

Since this merge was successful we must now get back to the Variable(numValue = 0, action = "||") and remerge it with the newly created Variable(numValue = 0, action = ")"), producing:

  Merge(Variable(numValue = 0,
    action = "||"),Variable(numValue = 0, 
    action = ")")) =
  Variable(0 || 0,
    action = ")") = Variable(numValue = 0,
    action = ")")

Since there are no more Variables left in the list, the result of merging is 0. This value will be assigned to the variable "c" and registered with the parser.

The last statement is "print(c);". After extracting the token print the parser will look if there is a function named print already registered with the parser. Since there is one, the parser will recursively call the whole split-and-merge algorithm on the argument of the print() function, "c". Since "c" was registered with the parser in the previous step, the parser will return back its value, 0, to the print function. Letâ€™s see more closely how variables and functions can be implemented and registered with the parser taking as an example the print() function.

Registering functions and variables with the parser

All the functions that can be added to the parser must derive from the ParserFunction class.

The Identity is a special function which is called when we have an argument in parentheses. It just calls the main entry method of the split-and-merge algorithm to load the whole expression in parentheses:

 class IdentityFunction : public ParserFunction
 {
 public:
   virtual Variable evaluate(ParsingScript& script)
   {
     return Parser::loadAndCalculate(script);
   }
 };

All split-and-merge functions and variables are implemented similarly.

There are three basic steps to register a function with the parser:

Define a function keyword token, i.e. the name of the function in the scripting language, CSCS, e.g.:

    static const string PRINT; // in Constants.h

    const string Constants::PRINT = "print"; 
                               // in Constants.cpp

Implement the class to be mapped to the keyword from the previous step. Basically just the evaluate() method must be overridden. E.g. for the print() function:

    Variable 
      PrintFunction::evaluate(ParsingScript&
        script)
    {
      vector<Variable> args =
      Utils::getArgs(script, Constants::START_ARG,
      Constants::END_ARG);
      for (size_t i = 0; i < args.size(); i++) {
        cout << args[i].toString();
      }
      cout << endl;
      return Variable::emptyInstance;
    }

Map an object of the class implemented in the previous step with the previously defined keyword as follows:
```
    ParserFunction::addGlobalFunction
           (Constants::PRINT, new  PrintFunction());
```

Utils::getArgs() auxiliary function parses the arguments of the print function and first gets a list of strings, separated by commas from print(string1, string2, ..., stringN). Then it recursively applies the split-and-merge algorithm to each of these string1, string2, ..., stringN. Finally, it prints them out to the terminal.

The addGlobalFunction() method (see Listing 6) just adds a new entry to the global dictionary s_functions (implemented as an unordered_map) used by the parser to map keywords to functions.

Variable IfStatement::evaluate(ParsingScript& script)
{
  size_t startIfCondition = script.getPointer();
  
  Variable result = Parser::loadAndCalculate
    (script, Constants::END_ARG_STR);
  bool isTrue = result.numValue != 0;
    
  if (isTrue) {
    result = processBlock(script);
        
    if (result.type == 
          Constants::BREAK_STATEMENT ||
        result.type ==
          Constants::CONTINUE_STATEMENT) {
        // Got here from the middle of the if-block.
       // Skip it.
         script.setPointer(startIfCondition);
      skipBlock(script);
    }
    skipRestBlocks(script);
    return result;
  }
    
  // We are in Else. Skip everything in the 
  // If statement.
  skipBlock(script);
    
  ParsingScript nextData(script.getData(),
    script.getPointer());
  string nextToken =
    Utils::getNextToken(nextData);
    
  if (Constants::ELSE_IF_LIST.find(nextToken) !=
      Constants::ELSE_IF_LIST.end()) {
    script.setPointer(nextData.getPointer() + 1);
    result = processIf(script);
  }
  if (Constants::ELSE_LIST.find(nextToken) !=
      Constants::ELSE_LIST.end()) {
    script.setPointer(nextData.getPointer() + 1);
    result = processBlock(script);
  }
  return Variable::emptyInstance;
}

Listing 6

  void ParserFunction::addGlobalFunction
    (const string& name, ParserFunction* function)
  {
    auto tryInsert = 
      s_functions.insert({name, function});
    if (!tryInsert.second) {
      // The variable or function already exists.
      // Delete it and replace with the new one.
      delete tryInsert.first->second;
      tryInsert.first->second = function;
    }
  }

As weâ€™ve mentioned before, we treat a variable as a function: as soon as the parser gets an expression like "a = something", it will register a function with the keyword "a" and map it to the Variable "something" (which will be calculated applying recursively the split-and-merge algorithm). In C++ the code for this is:

  ParserFunction::addGlobalFunction("a",
    something /*Variable*/
);

Implementing the if â€“ else if â€“ else control flow statements

Letâ€™s see how to implement the if() â€“ else if() â€“ else() functionality in CSCS.

The first and the third steps are clear: define the if constant and register a class implementing the if control flow statement with the parser, the same way we registered the print() function above.

The second step, the implementation of the if statement, is shown in Listing 6.

First we evaluate the condition inside of if() by recursively calling the split-and-merge algorithm on that condition. If the condition is true we process the whole if() block, recursively calling the split-and-merge algorithm on each statement inside of the processBlock() method. If the condition is false we first skip the whole if() block in the skipBlock() method. Then we evaluate the else() and else if() statements. The evaluation of else if() is basically same as the evaluation of the if() statement itself, so for else if() we recursively call the if() statement evaluation.

Note that we enhanced the execution of the if-statement here â€“ as soon as there is a break or a continue statement, we get out of the if () block â€“ same way we get out from the while() block. This can be useful in case of nested ifs.

Similarly, we can register any function with the parser, e.g. while(), for(), try(), throw(), include(), etc. We can also define local or global variables in the same way. In the next section we are going to see how to define functions in CSCS and add passed arguments as local variables to CSCS.

Implementing custom functions in CSCS

To write a custom function in the scripting language, letâ€™s introduce two functions in C++, FunctionCreator and CustomFunction, both deriving from the ParserFunction base class. A FunctionCreator object is registered with the parser in the same way we registered if() and print() functions above.

As soon as the parser gets a token with the "function" keyword, an instance of the FunctionCreator will be executed, namely, its evaluate() method, see Listing 7.

Variable FunctionCreator::evaluate(ParsingScript&
    script) {
  string funcName = Utils::getToken(script,
     TOKEN_SEPARATION);
  vector<string> args =
     Utils::getFunctionSignature(script);
  string body = Utils::getBody(script, '{', '}');

  CustomFunction* custFunc = 
     new CustomFunction(funcName, body, args);
  ParserFunction::addGlobalFunction(funcName,
     custFunc);

  return Variable(funcName);
}

Listing 7

Basically, it just creates a new object, an instance of the CustomFunction, and initializes it with the extracted function body and the list of parameters. It also registers the name of the custom function with the parser, so the parser maps that name with the new CustomFunction object which will be called as soon as the parser encounters the function name keyword.

So all of the functions that we implement in the CSCS code correspond to different instances of the CustomFunction class. The custom function does primarily two things, see Listing 8. First, it extracts the function arguments and adds them as local variables to the parser (they will be removed from the parser as soon as the function execution is finished or an exception is thrown). It also checks that the number of actual parameters is equal to the number of the registered ones (this part is skipped for brevity).

Variable CustomFunction::evaluate(ParsingScript& script)
{
  // 0. Extract function arguments.
  vector<Variable> args = Utils::getArgs(script);
  // 1. Add passed arguments as local variables.
  StackLevel stackLevel(m_name);
  for (size_t i = 0; i < m_args.size(); i++) {
    stackLevel.variables[m_args[i]] = args[i];
  }
  ParserFunction::addLocalVariables(stackLevel);
  // 2. Execute the body of the function.
  Variable result;
  ParsingScript funcScript(m_body);
  while (funcScript.getPointer() <
     funcScript.size()-1 && !result.isReturn) {
    result =
       Parser::loadAndCalculate(funcScript);
    Utils::goToNextStatement(funcScript);
  }
  // 3. Return the last result of the execution.
  ParserFunction::popLocalVariables();
  return result;
}

Listing 8

Second, the body of the function is evaluated, using the main parser entry point, the loadAndCalculate() method.

If the function body contains calls to other functions, or to itself, the calls to the CustomFunction can be recursive.

Letâ€™s see this with an example in CSCS. It calculates recursively the Fibonacci numbers, see Listing 9.

cache["fib"] = 0;
function fibonacci(n) {
  if (!isInteger(n)) {
    exc = "Fibonacci is for integers only"
      "(n="+ n +")";
    throw (exc);
  }
  if (n < 0) {
    exc = "Negative number (n="+ n +") supplied";
    throw (exc);
  }
  if (n <= 1) {
    return n;
  }
  if (contains(cache["fib"], n)) {
    result = cache["fib"][n];
    print("  found in cache fib(", n, ")=",
      result);
    return result;
  }
  result = fibonacci(n - 2) + fibonacci(n - 1);
  cache["fib"][n] = result;
  return result;
}

Listing 9

The Fibonacci function above uses an auxiliary isInteger() function, also implemented in CSCS:

  function isInteger(candidate) {
    return candidate == round(candidate);
  }

The isInteger() function calls yet another, round() function. The implementation of the round() function is already in the C++ code and is analogous to the implementation of the print() function that we saw in the previous section.

To execute the Fibonacci function with different arguments we can use the following CSCS code:

  n = ...;
  print("Calculating Fibonacci(", n, ")...");
  try {
    f = fibonacci(n);
    print("fibonacci(", n, ")=", f);
  } catch(exc) {
    print("Caught: " + exc);
  }

We get the output in Figure 1 for different values of n.

Calculating Fibonacci(10)...
  found in cache fib(2)=1
  found in cache fib(3)=2
  found in cache fib(4)=3
  found in cache fib(5)=5
  found in cache fib(6)=8
  found in cache fib(7)=13
  found in cache fib(8)=21
Fibonacci(10)=55

Calculating Fibonacci(-10)...
Caught: Negative number (n=-10) supplied at
  fibonacci()

Calculating Fibonacci(1.500000)...
Caught: Fibonacci is for integers only
(n=1.500000) at
  fibonacci()

Figure 1

Since the exceptions happened at the global level, the exception stacks printed consisted only of the fibonacci() function itself. To keep track of the execution stack, i.e. CSCS functions being called, internally we use a C++ stack data structure, where we add every executing CustomFunction object as soon as we start function execution and remove it as soon as the execution is over. In case of an exception we just print out the contents of this stack. Then we clear the execution stack up to the level where the exception was caught. The implementation of the execution stack and of the try() and throw() functions can be consulted in [Downloads].

The implementation of the Fibonacci function in Listing 9 uses caching of already calculated Fibonacci numbers by using dictionaries â€“ we will see how one can implement dictionaries next.

Arrays and other data structures

To declare an array and initialize it with some data we use the same CSCS language statement. The array elements for the initialization are declared between the curly braces. Here is an example in CSCS:

  a = 10;
  arr = {++a-a--, ++a*exp(0)/a--, -2*(--a - ++a)};
  i = 0;
  while(i < size(arr)) {
    print("a[", i, "]=", arr[i], ", expecting ",
      i);
    i++;
  }

The number of elements in the array is not explicitly declared since it can be deduced from the assignment.

The function size() is implemented in C++. It returns the number of elements in an array. In case the passed argument is not an array, it will return the number of characters in it.

Internally an array is implemented as a vector, so you can add elements to it on the fly. In CSCS we access elements of an array, or modify them, by using the squared brackets. As soon as the parser gets a token containing an opening squared bracket, it knows that it is for an array index, so it applies recursively the whole split-and-merge algorithm to the string between the squared brackets to extract the index value. There can be unlimited number of dimensions of an array (well, limited by the system resources) because the array is implemented as a vector<Variable>: the Variable class has a member called "tuple" and declared as vector<Variable>. For instance, accessing a[i][j][k] in the CSCS code means a.tuple[i].tuple[j].tuple[k] in the C++ underlying code ("a" is a Variable, see Listing 1 for Variable definition).

In other words, for each consequent index in squared brackets the parser will create a new Variable of type array or use an existing "tuple" member.

If we access an element of an array and that element has not been initialized yet, an exception will be thrown by the parser. However, itâ€™s possible to assign a value to just one element of an array, even if the index used is greater than the number of elements in the array and even if the array has not been initialized yet. In this case the non-existing elements of the array will be initialized with the empty values. The CSCS code below is legal, even if the array has not been initialized before:

  i = 10;
  while(--i > 0) {
    array[i] = 2*i;
  }
  print("array[9]=", array[9]);       <CodeComment>
// prints 18</CodeComment>

  print("size(array)=", size(array)); <CodeComment>
// prints 10</CodeComment>

We can also add other data structures to the CSCS language. Letâ€™s see an example of adding a dictionary, implemented internally as an unordered_map. We add the following member to the Variable class:

  unordered_map<string, size_t> dictionary;

This is the mapping between a dictionary key and an index of already existing member vector<Variable> tuple, where the actual dictionary value will be stored. So every time a new key-value pair is added to the dictionary the following code is executed:

  tuple.emplace_back(var);
  dictionary[key] = tuple.size() - 1;

Every time an existing key is accessed the following code is executed (a check for existence is skipped):

  auto it = dictionary.find(key);
  size_t ptr = it->second;
  return tuple[ptr];

With a few changes one can use not only strings, but anything else as a dictionary key. Similarly, we can add other data structures to CSCS â€“ as long as a data structure exists, or can be implemented in C++, it can be added to CSCS as well.

In Listing 6 we saw the implementation of the if() - else if() - else control flow statements. Towards the end of the listing you might have scratched your head, asking why we didnâ€™t compare the extracted token with the "else" string, but we did a comparison with the ELSE_LIST? The reason is that the ELSE_LIST contains all possible synonyms and translations of the "else" keyword to any of the languages that the user might have supplied in the configuration file. How is a keyword translation added to the parser?

How to add keyword synonyms and language translations

One of the advantages of writing a custom programming language is a possibility to have the keywords in any language (besides the â€˜baseâ€™ language, understandably chosen to be English). Also we can create our language in such a way, that there will be no need to remember that one must use dir, copy, move, findstr on Windows, but ls, cp, mv, and grep on Unix; and that a very nice shortcut cd.. works only on Windows: in our language we can have both! And with just a few configuration changes.

Here is how we can add keyword synonyms and translations to the CSCS language.

First, we define them in a configuration file; Figure 2 is an incomplete example of a configuration file with Spanish translations. The same configuration file may contain an arbitrary number of languages. For example, we can also include the keyword synonyms in the same file (see Figure 3).

  function    = funciÃ³n
  include     = incluir
  if          = si
  else        = sino
  elif        = sino_si
  return      = regresar
  print       = imprimir
  size        = tamaÃ±o
  while       = mientras

Figure 2

  ls          = dir
  cp          = copy
  mv          = move
  rm          = rmdir
  grep        = findstr

Figure 3

After reading the keyword translations we add them to the parser one by one (see Listing 10).

void addTranslation(const string& origName,
                    const string& translation)
{
  ParserFunction* origFunction =
    ParserFunction::getFunction(origName);
  if (origFunction != 0) {
    ParserFunction::addGlobalFunction
       (translation, origFunction);
  }
  tryAddToSet(originalName, translation, CATCH,
    CATCH_LIST);
  tryAddToSet(originalName, translation, ELSE,
    ELSE_LIST);
  //â€¦ other sets
}

Listing 10

First, we try to add a translation to one of the registered functions (like print(), sin(), cos(), round(), if(), while(), try(), throw(), etc.). Basically, we map the new keyword to the ParserFunction, corresponding to the original keyword. Therefore, as soon as the parser extracts either the original keyword (say "cp"), or the one added from the configuration file (e.g. "copy"), the same C++ function CopyFunction, deriving from the ParserFunction class, will be called.

Then we try to add new keywords to the sets of additional keywords, that are not functions (e.g. a "catch" is processed only together with the try-block, "else" and "else if" are processed together with the if-block, etc.) The tryAddToSet() is an auxiliary template function that adds a translation to a set, in case the original keyword name belongs to that set (e.g. CATCH = "catch" belongs to the CATCH_LIST).

Listing 11 is an example of the CSCS code using Spanish keywords and functions.

palabras = {"Este", "sentido", "es", "en",
 "EspaÃ±ol"};
tam = tamaÃ±o(palabras);
i = 0;
mientras(i < tam) {
  si (i % 2 == 0) {
    imprimir(palabras[i]);
  }
  i++;
}

Listing 11

Conclusions

Using the techniques presented in this article and consulting the source code in [Downloads] you can develop your own fully customized language using your own keywords and functions. The resulting language will be interpreted at runtime directly, statement by statement.

We saw that implementing a printing function and a control flow statement is basically the same: one needs to write a new class, deriving from the ParserFunction class and override its evaluate() method. Then one needs to register that function with the parser, mapping it to a keyword. The evaluate() method will be called by the parser as soon as the parser extracts the keyword corresponding to this function. For the lack of space we didnâ€™t show how to implement the while(), try, throw, break, continue, and return control flow statements but they are all implemented analogously. The same applies to the prefix and postfix ++ and -- operators that we did not have space to show but you can consult in [Downloads].

Using the above approach of adding a new function to the parser, anything can be added to the CSCS language as long as it can be implemented in C++.

You can also use this custom language as a shell language (like bash on Unix or PowerShell on Windows) to perform different file or operating system commands (find files, list directories or running processes, kill or start a new process from the command line, and so on). Stay tuned to see how to do that in our next article.

References

[ANTLR] ANTLR Framework, http://www.antlr.org

[Downloads] Implementation of the CSCS language in C++,http://www.ilanguage.ch/p/downloads.html

[Kaplan15a] V. Kaplan, â€˜Split and Merge Algorithm for Parsing Mathematical Expressionsâ€™, ACCU CVu, 27-2, May 2015, http://accu.org/var/uploads/journals/CVu272.pdf

[Kaplan15b] V. Kaplan, â€˜Split and Merge Revisitedâ€™, ACCU CVu, 27-3, July 2015, http://accu.org/var/uploads/journals/CVu273.pdf

[Kaplan15c] V. Kaplan, â€˜A Split-and-Merge Expression Parser in C#â€™, MSDN Magazine, October 2015, https://msdn.microsoft.com/en-us/magazine/mt573716.aspx

[Kaplan16] V. Kaplan, â€˜Customizable Scripting in C#â€™, MSDN Magazine, February 2016, https://msdn.microsoft.com/en-us/magazine/mt632273.aspx

[Spirit] Boost Spirit Library, http://boost-spirit.com

Acknowledgements

Iâ€™d like to thank Frances Buontempo and the Overload review team for providing their feedback which enabled me to elevate the content presented in this article.

Notes:

More fields may be available via dynamicdata ..

Journal Articles