Journal Articles

CVu Journal Vol 27, #2 - May 2015 + Programming Topics
Browse in : All > Journals > CVu > 272 (9)
All > Topics > Programming (877)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Split and Merge – Another Algorithm for Parsing Mathematical Expressions

Author: Martin Moene

Date: 05 May 2015 16:53:19 +01:00 or Tue, 05 May 2015 16:53:19 +01:00

Summary: Vassili Kaplan presents an alternative to Dijkstra’s algorithm.

Body: 

Split and Merge – Another Algorithm for Parsing Mathematical Expressions

To parse a string containing a mathematical expression (for example, 3 + 4 * 2 / (1-5)2) a Shunting-yard algorithm [1] created by Edsger Dijkstra is often used. It converts the expression to Reverse Polish notation and then uses the postfix algorithm [2] to evaluate it.

In this article we present another algorithm for parsing a mathematical expression. It consists of two steps as well – firstly creating a list of structures, each containing a number and an action, and secondly merging all the structures together to get the expression result.

Even though this algorithm has same time complexity as Dijkstra’s, we believe it is simpler to code. We will discuss some examples of using it and present its implementation in C++.

Splitting an expression to a list of structures

The result of the first step of our algorithm is a list of structures. Each structure consists of two elements – a real number and an action. An action can be any of add, subtract, multiply, divide, power, or ‘no action’. The special ‘no action’ is assigned to the last member of the whole expression or to the last member before a closing parenthesis. For convenience we denote this ‘no action’ with the symbol ‘)’.

Example 1

Suppose the expression string is "3 + 2 * 6 - 1". Then the result of the first step will be:

Splitting ("3 + 2 * 6 - 1") → [3, '+'], [2, '*'], [6, '-'], [1, ')']

Note that action (operation) priorities are not taken into account at this step.

If one of the tokens is a function (e.g. sine, cosine, log, etc.), a special constant (e.g. pi) or an expression in parentheses then the whole algorithm (both steps) will be applied to this token first, which is then replaced by the calculated result. This will be done recursively to all nested functions and expressions.

Therefore, after finalizing this first step, all of the functions, constants and parentheses will be eliminated.

Merging the elements of the created list of structures

The second step is an actual calculation done by merging the elements of the list of structures (each one consisting of a number and an action) created in the first step.

We attempt to merge each structure in the list with the next, taking into account the priority of actions.

Priorities (from highest to lowest) of the actions are:

Merging can be done if and only if the priority of the action in the first structure is not lower than the priority of the second structure. If this is the case, merging consists of applying the action of the first structure to the numbers of both structures, thus creating a new structure. The resulting action of this new structure will be the one from the second structure.

E.g. merging [5, '-'] and [4, '+'] will produce [5 - 4, '+'] = [1, '+'].

But if the second structure priority is higher than the first one we cannot merge the current structure with the next one. What happens then?

Then we merge the next (second) structure with the structure next to it, and so on, recursively. Eventually the merging will take place since the last element in the stack always has a ‘no action’ which has the lowest priority. After merging the next element with the structures following it, we re-merge the current element with the newly created next structure (see example 2 below).

Example 2

Let’s merge the elements created in example 1:

[3, '+'], [2, '*'], [6, '-'], [1, ')']

We always start with first element in the list, in this case element [3, '+']. Note that [3, '+'] cannot be merged with [2, '*'] since the action ‘+’ has a lower priority than ‘*’.

So we go to the next element, [2, '*'], and try to merge it with [6, '-']. In this case the multiplication has higher priority than the subtraction so we can merge the cells. The action of the resulting structure will be the one of the second structure, i.e. ‘-’. So the result of merging [2, '*'] and [6, '-'] will be [2 * 6, '-'] = [12, '-'].

This resulting structure will be merged with the next structure, which is [1, ')']. Since the multiplication has a higher priority than ‘no action’, we can merge these two elements:

Merging ([12, '-'], [1, ')']) → [12 - 1, ')'] = [11, ')'].

Now we can remerge this result with the first structure [3, '+']. Remember that we could not do the merge first because the addition action of the first structure had a lower priority than the multiplication action of the second. But after merging the second structure with the rest, its action became the ‘no action’, so now the merging may take place:

Merging ([3, '+'], [11, ')']) → [3 + 11, ')'] = [14, ')'].

We are done since only one structure is left after merging.

So the final result of applying two steps of the Split and Merge algorithm is:

"3 + 2 * 6 - 1" → [3, '+'], [2, '*'], [6, '-'], [1, ')'] → [14, ')'].

Therefore, 3 + 2 * 6 – 1 = 14.

Complete processing examples

If the expression contains functions or other expressions in parentheses then those expressions will be calculated recursively first. Let’s see how it is done it in the following two examples.

Example 3

Let’s calculate the expression "2 ^ (3 - 1)" using this algorithm.

1. Splitting ("2 ^ (3 - 1)") → [2, '^'], [Calculate "3 - 1", ')']

Now we need to calculate "3 – 1" and then return to where we were.

To calculate it we need to apply both steps of the algorithm, merging and splitting:

1.1 Splitting ("3 - 1") → [3, '-'], [1, ')'].

1.2 Merging ([3, '-'], [1, '^']) → [3 - 1, ')'] = [2, ')'] → 2.

Now we put this result back to 1:

[2, '^'], [Calculate "3 - 1", ')'] → [2, '^'], [2, ')']

2. Merging ([2, '^'], [2, ')']) → [2^2, ')'] = [4, ')'] → 4.

Therefore the result of the expression is: "2 ^ (3 - 1)" → 4.

Example 4

Let’s calculate the expression "2 * cos (pi)" using this algorithm.

1. Splitting ("2 * cos (pi)") → [2, '*'], [Calculate "cos(pi)", ')']

Calculating "cos(pi)":

1.1 Splitting ("cos (pi)") → [cos(pi), ')'] → [cos(3.14…), ')'] → [-1, ')']

1.2 Merging ([-1, ')']) → -1

Note that pi is one of the ‘special’ constants hardcoded in the splitting module. It is replaced by its numerical value in the splitting part of the algorithm.

Replacing the expression we’ve split with the result of -1, we get:

[2, '*'], [Calculate "cos (pi)", ')'] → [2, '*'], [-1, ')']

2. Merging ([2, '*'], [-1, ')']) → [2 * (-1), ')'] = [-2, ')'] → -2.

Therefore the result of the expression is: "2 * cos (pi)" → -2.

Implementation in C++

The entry point is the following process() function which calls preprocessing of the passed string and then loadAndCalculate() function which actually implements the splitting of the passed string into a list of structures and then merging the elements of this list.

double EZParser::process(const string& data)
{
    string cleanData;
    EZParser::preprocess(data, cleanData);
    size_t from = 0;
    return loadAndCalculate(cleanData, from, '\n');
}

The preprocessing just removes the blank spaces from the passed string and checks that the number of opening parenthesis matches the number of closing ones. We will skip this function for brevity. We’ll also skip the CalculationException class which is a simple wrapper over the standard exception.

The main function that performs splitting into the list of structures and then calling the merging function is loadAndCalculate(). First it creates and fills the list, implemented as a vector, and then calls the merge() function to process it further (see Listing 1).

// data contains the whole expression, from is an
// index pointing from which character the
// processing should start. The end character is
// either '\n' if we want to calculate the whole
// expression or ')' if we want to calculate an
// expression in parentheses.
double EZParser::loadAndCalculate(
    const string& data, size_t& from, char end)
{
    if (from > = data.size() || data[from] == end)
    {
        throw CalculationException(
            "Loaded invalid data " + data);
    }
    
    vector<Cell> listToMerge;
    listToMerge.reserve(16);
    string item;
    item.reserve(16);

    do
    {
        // Main processing cycle of the first part.
        char ch = data[from++];
        if (stillCollecting(item, ch))
        {
            // the char still belongs to
            // the previous operand
            item += ch;
            if (from < data.size() & & data[from]
                    != end)
            {
                continue;
            }
        }

        // We finished getting the next meaningful
        // token. The call below may recursively
        // call the function we are in. This will
        // happen if the extracted item is a
        // function itself (sin, exp, etc.) or if
        // the next item is starting with a '('.
        double value = m_allFunctions.getValue(data, from, item, ch);
        char action = validAction(ch) ? ch : updateAction(data, from, ch);
        listToMerge.push_back(Cell(value, action));
        item.clear();
    }

    while (from < data.size() & & data[from] != end)
        ;

    if (from < data.size() & & data[from] == ')')
    {
        // End of parenthesis: move one char forward.
        // This happens when called recursively.
        from++;
    }

    Cell & base = listToMerge.at(0);
    size_t index = 1;
    return merge(base, index, listToMerge);
}
Listing 1

The stillCollecting() function just checks that we are still collecting the current token (i.e. there is no new action and no new parentheses in the next character) (see Listing 2).

bool EZParser::stillCollecting(const string& item, char ch)
{
    return (item.size() == 0 &&
           (ch == '-' || ch == ')')) || !isSpecialChar(ch);
}

bool EZParser::isSpecialChar(char ch)
{
    return validAction(ch) || ch == '(' || ch == ')';
}

bool EZParser::validAction(char ch)
{
    return ch == '*' || ch == '/' || ch == '+' ||
           ch == '-' || ch == '^';
}
Listing 2

UpdateAction() function looks for the actual action corresponding to the last extracted token. This is not always the next character since there can be one or more parenthesis after the token. In case there is no action, the closing parentheses will be returned. (See Listing 3.)

char EZParser::updateAction(const string& item, size_t& from, char ch)
{
    if (from > = item.size() || item[from] == ')')
    {
        return ')';
    }

    size_t index = from;
    char res = ch;
    while (!validAction(res) & & index < item.size())
    {
        // We look to the next character in string
        // until we find a valid action.
        res = item[index++];
    }

    from = validAction(res) ? index : index > from ? index - 1 : from;
    return res;
}
Listing 3

The implementation of the second part of the algorithm, merging the cells created in the first part is below. Note that this function calls itself recursively in case the priority of the current element is less than the priority of the next (Listing 4).

double EZParser::merge(
    Cell& current, size_t& index, vector< Cell> & listToMerge)
{
    if (index > = listToMerge.size())
    {
        return current.value;
    }

    while (index < listToMerge.size())
    {
        Cell&
        next = listToMerge.at(index++);
        if (!canMergeCells(current, next))
        {
            // If we cannot merge cells yet, go to the
            // next cell and merge next cells first.
            // E.g. if we have 1+2*3, we first merge
            // next cells, i.e. 2*3, getting 6, and then
            // we can merge 1+6.
            merge(next, index, listToMerge);
        }
        mergeCells(current, next);
    }
    return current.value;
}

bool EZParser::canMergeCells(const Cell& base, const Cell& next)
{
    return getPriority(base.action) > = getPriority(next.action);
}
Listing 4

When extending this implementation, e.g. with ‘||’, ‘&&’, etc., the two functions in Listing 5 must be adjusted.

size_t EZParser::getPriority(char action)
{
    switch (action)
    {
    case '^':
        return 4;
    case '*':
    case '/':
        return 3;
    case '+':
    case '-':
        return 2;
    }
    return 0;
}

void EZParser::mergeCells(Cell& base,
                          const Cell& next)
{
    switch (base.action)
    {
    case '^':
        base.value = ::pow(base.value, next.value);
        break;
    case '*':
        base.value *= next.value;
        break;
    case '/':
        if (next.value == 0)
        {
            throw CalculationException("Division by zero");
        }
        base.value /= next.value;
        break;
    case '+':
        base.value += next.value;
        break;
    case '-':
        base.value -= next.value;
        break;
    }
    base.action = next.action;
}
Listing 5

An arbitrary number of functions and constants may be added to the implementation. In this implementation we use an ‘identity’ function to calculate an expression in parenthesis. Listing 6 contains just some of the functions.

class EZParserFunctions
{
public:
    typedef double (EZParserFunctions::*ptrFunc)(const string& arg, size_t& from);
    EZParserFunctions();
    double getValue(const string& data, size_t& from, const string& item, char ch);
    double identity(const string& arg, size_t& from);
    double sin(const string& arg, size_t& from);
    double pi(const string& arg, size_t& from);
    // …
    static double strtod(const string& str);
protected:
    map<
    string, ptrFunc>
    m_functions;
};

EZParserFunctions::EZParserFunctions()
{
    m_functions["sin"] = &EZParserFunctions::sin;
    m_functions["pi"]  = &EZParserFunctions::pi;
    // …
}

// This function is used to process the contents
// between ().
double EZParserFunctions::identity(const string& arg, size_t& from)
{
    return EZParser::loadAndCalculate(arg, from, ')');
}

double EZParserFunctions::sin(const string& arg, size_t& from)
{
    return ::sin(EZParser::loadAndCalculate(arg, from, ')'));
}

double EZParserFunctions::pi(const string& arg, size_t& from)
{
    return 3.141592653589793;
}
Listing 6

The functions in Listing 7 are used in the first step of the algorithm via the getValue function, when creating the list of structures (see function loadAndCalculate() in Listing 1 above).

double EZParserFunctions::getValue(
    const string& data, size_t& from, const string& item, char ch)
{
    if (item.empty() & & ch == '(')
    {
        return identity(data, from);
    }
    
    map<
    string, ptrFunc>
    ::const_iterator it = m_functions.find(item);
    // The result will be either a function or a
    // real number if there is no function.
    return it == m_functions.end() 
            ? strtod(item)
            : (this-> *(it-> second))(data, from);
}

double EZParserFunctions::strtod(const string& str)
{
    char* x;
    double num = ::strtod(str.c_str(), & x);

    if (::strlen(x) > 0)
    {
        throw CalculationException("Could not parse token [" + str + "]");
    }
    return num;
}
Listing 7

Complexity and conclusions

Each token in the string will be read only once during the first, splitting step.

Suppose that there are n characters in the expression string and that the first step leads to m structures for the second step. Then in the second step there will be at most 2(m-1) - 1 = 2m - 3 comparisons (in the worst case) to merge all of the structures. Therefore the time complexity will be:

O(n + 2m - 3) = O(n).

The main difference between the Split and Merge and the Dijkstra algorithm is that the former uses potentially many recursive calls whereas the later uses no recursion. So the actual performance gain will depend on a particular implementation language and the compiler, in particular how well the recursion is managed.

References

[1] Shunting-yard algorithm, http://en.wikipedia.org/wiki/Shunting-yard_algorithm

[2] Reverse Polish notationhttp://en.wikipedia.org/wiki/Reverse_Polish_notation

Notes: 

More fields may be available via dynamicdata ..