Title: Split and Merge Revisited

Author: Martin Moene

Date: 07 July 2015 21:49:04 +01:00 or Tue, 07 July 2015 21:49:04 +01:00

Summary: Vassili Kaplan makes improvements to the Expression Parser.

Body:

Silas S. Brown (ssb22@cam.ac.uk) spotted a defect in the Merge part of the Split and Merge algorithm published in CVu [1] â€“ you can see Silasâ€™s letter here. We fix this defect here and also provide a slightly different C++ implementation of the Split and Merge algorithm, using the â€˜Virtual Constructorâ€™ idiom [2].

Right associativity defect

As Silas Brown pointed out, â€œthe Split and Merge algorithm, as described in the article, does not consistently treat operators as right-associative either. If the operator is preceded by another of the same or lower precedence level, then it will be left-associative: 3-12-1 would get -10. But if the operator is preceded by one of a higher precedence level, it will become right-associative with the operators before that: 3-2*6-1 would get -8.â€

To fix this, we need to break out of the merge loop and go back to where we were after any successful merge.

Split and Merge revisited

The split part remains the same: the result of the first step of our algorithm is a list of structures. Each structure consists of two elements â€“ a real number and an action, e.g.

Splitting (â€œ3 - 2 * 6 - 1â€) → [3, â€˜-â€™], [2, â€˜*â€™], [6, â€˜-â€™], [1, â€˜)â€™]

In the merge step the actual calculation is done by merging the elements of the list of structures created in the first step. We attempt to merge each structure in the list with the next, taking into account the priority of actions.

Merging can be done if and only if the priority of the action in the first structure is not lower than the priority of the second structure. If this is the case merging consists in applying the action of the first structure to the numbers of both structures, thus creating a new structure, e.g. merging [5, â€˜-â€™] and [4, â€˜+â€™] will produce [5 - 4, â€˜+â€™] = [1, â€˜+â€™].

What if the second structure priority is higher than the first one?

Then we merge the next (second) structure with the structure next to it, and so on, recursively. We break out of the merge loop and go back where we were after any successful merge.

Continuing with our example of â€œ3 - 2 * 6 - 1â€:

Merging ([3, â€˜-â€™], [2, â€˜*â€™], [6, â€˜-â€™], [1, â€˜)â€™]) →

Merging ([3, â€˜-â€™], Merging ([2, â€˜*â€™], [6, â€˜-â€™]), [1, â€˜)â€™]) →

Merging ([3, â€˜-â€™], [12, â€˜-â€™], [1, â€˜)â€™]) →

Merging ([-9, â€˜-â€™], [1, â€˜)â€™]) → [-9 - 1, â€˜)â€™] = [-10, â€˜)â€™].

Therefore, 3 - 2 * 6 - 1 = -10, which is now correct.

Listing 1 contains the implementation of the fixed Merge part in C++ (the parts different from the earlier version [1] are in bold).

double EZParser::merge(Cell& current, 
  size_t& index, vector<Cell>& listToMerge,
  bool mergeOneOnly)
{
  if (index >= listToMerge.size())
  {
    return current.value;
  }

  while (index < listToMerge.size())
  {
    Cell& next = listToMerge.at(index++);
    while (!canMergeCells(current, next))
    { // If we cannot merge cells yet, go to the
      // next cell and merge next cells first. 
      // E.g. if we have 1+2*3, we first merge 
      // next cells, i.e. 2*3, getting 6, and
      // then we can merge 1+6.
      merge(next, index, listToMerge, true 
        /* mergeOneOnly */);
    }

    mergeCells(current, next);
    if (mergeOneOnly) {
      return current.value;
    }
  }

  return current.value;
}

Listing 1

Implementation using the â€˜Virtual Constructorâ€™ idiom

The implementation of the Split and Merge algorithm in [1] was using pointers to functions. As James Coplien points out, a better, more object-oriented way, would be using the â€˜Virtual Constructorâ€™ idiom [2]:

â€œThe virtual constructor idiom is used when the type of an object needs to be determined from the context in which the object is constructed.â€

In our case these objects are actual functions to be called (e.g. sine, cosine, logarithm, etc.). They are determined only at run time. The advantage of this idiom is that a new function can be added easily to the framework without modifying the main algorithm implementation code.

Listing 2 is how the base class of such functions would look.

class EZParserFunction
{
public:
  static double calculate(const string& data,
      size_t& from, const string& item, char ch) {
    EZParserFunction func(data, from, item, ch);
    return func.getValue(data, from);
  }
  static void addFunction(const string& name,
     EZParserFunction* function) {
    m_functions_[name] = function;
  }

protected:
  EZParserFunction() : impl_(0) {}
  virtual ~EZParserFunction() {}
  virtual double getValue(const string& data,
     size_t& from) {
    return impl_->getValue(data, from);
  }
  double loadArg(const string& data, 
    size_t& from)
    {
      return EZParser::loadAndCalculate(data,
        from, ')');
    }

private:
  EZParserFunction(const string& data, 
    size_t& from, const string& item, char ch);

  EZParserFunction* impl_;
  static map<string, 
    EZParserFunction*> m_functions_;
};

Listing 2

Here is what a derived class for a sine function would look like:

  class SinFunction : public EZParserFunction
  {
    public:
      double getValue(const string& data,
      size_t& from)
    {
      return ::sin(loadArg(data, from));
    }
  }

The functions derived from the EZParserFunction class are used in the first, splitting process. The split process itself will remain the same, just the call to process the last extracted token will be now

  double value = EZParserFunction::calculate(data,
     from, item, ch);

instead of

  double value = m_allFunctions.getValue(data,
     from, item, ch);

Listing 3 is the (virtual) constructor of the base class.

EZParserFunction::EZParserFunction
   (const string& data, size_t& from,
    const string& item, char ch)
{
  if (item.empty() && ch == '(') 
  {
    impl_ = identityFunction;
    return;
  }
  map<string, EZParserFunction*>::
  const_iterator it
  = m_functions_.find(item);

  if (it == m_functions_.end())
  {
    strtodFunction->setItem(item);
    impl_ = strtodFunction;
    return;
  }

  impl_ = it->second;
}

Listing 3

As we can see the two functions, StrtodFunction and IdentityFunction, have a special meaning and must be implemented (see Listing 4).

class StrtodFunction : public EZParserFunction
{
public:
  double getValue(const string& data,
    size_t& from)
  {
    char* x;
    double num = ::strtod(item_.c_str(), &x);
    if (::strlen(x) > 0)
    {
      throw CalculationException("Could not parse
         token [" + item_ + "]");
    }
    return num;
  }
  void setItem(const string& item) {
    item_ = item; }

private:
  string item_;
};

class IdentityFunction : public EZParserFunction
{
public:
  double getValue(const string& data, 
  size_t& from)
  {
    return loadArg(data, from);
  }
};

static StrtodFunction* 
  strtodFunction = new StrtodFunction();

static IdentityFunction*
  identityFunction = new IdentityFunction();

Listing 4

The rest of the functions do not have to be implemented for the algorithm to work and can be added ad hoc (e.g. log, exp, pi, etc.)

Listing 5 is how the user can add functions that she implemented somewhere in her program initialization code:

EZParserFunction::addFunction("exp",
  new ExpFunction());
EZParserFunction::addFunction("log",
  new LogFunction());
EZParserFunction::addFunction("sin",
  new SinFunction());
EZParserFunction::addFunction("cos",
  new CosFunction());
EZParserFunction::addFunction("fabs",
  new FabsFunction());
EZParserFunction::addFunction("pi",
  new ExpFunction());
EZParserFunction::addFunction("sqrt",
  new SqrtFunction());

Listing 5

References

[1] Vassili Kaplan, â€˜Split and Merge â€“ another Algorithm for Parsing Mathematical Expressionsâ€™, CVu, Volume 27, Issue 2, May 2015.

[2] James Coplien, Advanced C++ Programming Styles and Idioms (p. 140), Addison-Wesley, 1992.

Notes:

More fields may be available via dynamicdata ..