Lexing Python Indentation using Spirit.Lex

Python is well known for its whitespace-sensitivity grammar. The indentation level provides block scoping in python language. While it is possible to parse python using just straight Spirit.Qi, however, it looks messy to mixing the low level indentation handling in the parsing level. Fortunately Spirit.Lex could be a help here to transform the indentation to a bunch of proper BEGIN or END tokens, which will simplify the parsing logic later.

Here's how to deal with indentation: for each line of the input, current line's indentation level will be compared with previews one stored in a stack:

If the indentation is increasing it is pushed into the stack and a BEGIN token is generated
If indentation is not changed, just continue and nothing happens
if it is decreasing comparing with the top indentation number on the stack, the top number then will be popped off from the stack and a corresponding END token is generated until the same indentation number appears on top of the stack. If the same indentation number cannot be found on the stack, an indentation error occurs.

At the end of the input, an END token is generated for each indentation level remains on the stack (until only 0 remains).

For simplicity let's handle just spaces for now and ignore tabs

The first shot is to match zero or more spaces at the beginning of the line and then calculate the total length of the spaces in a semantic action

token_def<> indent_("^[ ]*");
token_def<> identifier_("[a-zA-Z_]\\w*");
this->self
    = indent_[_val=_end-_begin]
    | identifier_
    ;

However, this won't work for lines without spaces at the beginning.

a_variable_without_any_indent

The above will only match identifier_ but not zero length indent_. It seems there is no way to match the pseudo beginning of line anchors with Spirit.Lex

The workaround is to match newline manually and switch to another lexer state dedicating for indentation.

template <typename Lexer>
class Tokens : public lex::lexer<Lexer> {
 public:
  Tokens()
      : indent_("[ ]*")
      , newline_("[\\n\\r\\f]+")
      , whitespace_("[ \\t]+")
      , if_("if")
      , identifier_("[a-zA-Z_]\\w*")
      , equal_("==")
      , colon_(':')
      , assign_('=')
      , integer_("\\d+")
  {
    this->self
        = indent_[HandleIndent(scopeLevels_)]
        ;

    this->self("NORMAL")
        = newline_[HandleNewline(scopeLevels_)]
        | whitespace_[lex::_pass=lex::pass_flags::pass_ignore]
        | if_
        | equal_
        | colon_ | assign_
        | integer_
        | identifier_
        ;
  }

 private:
  lex::token_def<> indent_;
  lex::token_def<> newline_, whitespace_;

  // keywords
  lex::token_def<> if_;

  lex::token_def<> identifier_;

  // operators
  lex::token_def<> equal_;

  // delimiters
  lex::token_def<> colon_, assign_;

  // literal
  lex::token_def<> integer_;

  std::stack<int> scopeLevels_;
};

Here we use the default INITIAL lexer state for indentation handling and NORMAL state for other token matching. Actually the indentation handling must be the INITIAL state in order to handle the first line since there is no newline before it.

We use a stack scopeLevels_ to help store indentation levels:

enum TokenIds {
  kTokenBegin = 1000,
  kTokenEnd = 1001,
};

class HandleIndent {
 public:
  HandleIndent(std::stack<int>& scopeLevels)
      : scopeLevels_(scopeLevels)
  {}

  template <typename Iterator, typename IdType, typename Context>
  void operator() (
      Iterator& start,
      Iterator& end,
      BOOST_SCOPED_ENUM(lex::pass_flags)& passFlag,
      IdType& id, Context& context) {

    int const level = std::distance(start, end);
    if (scopeLevels_.empty()) {
      if (level != 0) {
        throw std::runtime_error("Indent must start from 0!");
      }
      scopeLevels_.push(level);
    }
    // If the level is same, just ignore it and switch to normal state
    if (level == scopeLevels_.top()) {
      passFlag = lex::pass_flags::pass_ignore;
      context.set_state_name("NORMAL");
    }
    // If the level is larger, emit BEGIN and push the new level on stack
    else if (level > scopeLevels_.top()) {
      scopeLevels_.push(level);
      id = kTokenBegin;
      context.set_state_name("NORMAL");
    }
    // If the level is smaller, emit END and pop the top level from stack
    else { //Dedent, end scope
      scopeLevels_.pop();
      // Level must be one of the numbers occurring on the stack
      if (scopeLevels_.empty() || level > scopeLevels_.top()) {
        throw std::runtime_error("Dedenting failed, could not find indent matching previews ");
      } else {
        // make it a zero match, we can generate multiple END token by this trick
        end = start;
        id = kTokenEnd;
      }
    }
  }

 private:
  std::stack<int>& scopeLevels_;
};


class HandleNewline {
 public:
  HandleNewline(std::stack<int>& scopeLevels)
      : scopeLevels_(scopeLevels)
  {}

  template <typename Iterator, typename IdType, typename Context>
  void operator() (
      Iterator& start,
      Iterator& end,
      BOOST_SCOPED_ENUM(lex::pass_flags)& passFlag,
      IdType& id, Context& context) {

    if (context.get_eoi() == end) {
      // If the "end of input" is reached
      // we pop up all the non-zero levels.
      // for each pop up we generate a corresponding END token
      // here we apply the zero-match trick again to emit multiple END token
      if (scopeLevels_.top() != 0) {
        scopeLevels_.pop();
        end = start;
        id = kTokenEnd;
        return;
      }
    }
    passFlag = lex::pass_flags::pass_ignore;
    context.set_state_name("INITIAL");
  }

 private:
  std::stack<int>& scopeLevels_;
};

Notice how we apply the zero-length matching trick to generate multiple tokens.

Now, let's try lex some python code:

if a == 1:
    if b == 0:
        c
if d == 3:
    e

Here's the result:

if
a
==
1
:
BEGIN
if
b
==
0
:
BEGIN
c
END
END
if
d
==
3
:
BEGIN
e
END

Spirit.Qi could treat the BEGIN and END tokens as block delimiters, exactly as curly braces in C/C++.

There's still a limitation however: There must be a newline at the end of the input otherwise the end of input cannot be detected in the HandleNewline function. A possible workaround I could think of is to have an iterator adapter to add the missing newline in the end of the input if needed.

It would make lexing python much easier if Spirit.Lex could support matching pseudo tokens like:

Begin of the line
End of the line
Begin of the input
End of the input

For Spirit.Lex currently there is no clear way to fail the whole lexing process inside the semantic action. Assigning pass_failed flag won't work since it only fails current token's matching process. In the code above an exception is thrown to stop the lexing process, however I do hope there is a build-in support for this.

Finally, there seems no debug support in Spirit.Lex as in Spirit.Qi. A debug support like what Spirit.Qi provides will be very handy for trouble shooting in Spirit.Lex (Spirix.Lex actually has debug support using the macro BOOST_SPIRIT_LEXERTL_DEBUG)

References:

道可叨

Free Will

Lexing Python Indentation using Spirit.Lex