Disclaimer
          Myth's lexer is built with
          ocamllex,
          since that was the first meaningful result that came up when I searched "lexing with
          OCaml". It's nice enough to use. Either way, the code I've written is pure
          garbage. The way newlines, INDENTs, and DEDENTs
          are handled is annoying, and the parser & lexer are way to tightly
          coupled. But, it's the code that's currently running so, whatever.
        
A Whitespace-Sensitive Lexer
          Myth's syntax is heavily inspired by Python. I wanted something that was
          whitespace-sensitive and used indentation to specify code blocks. So I decided
          to look at the Python 3.7
          Grammar to see how they handled whitespace-sensitivity. I saw that a
          lot of the rules depended on INDENT and DEDENT
          tokens. So I'd have to make my lexer figure out when to emit those. However,
          it's often the case that after reading a single \n character
          followed by a non-whitespace character, we will want to emit multiple
          DEDENT tokens. For example,
        
          when the lexer sees the \n followed by the p in
          print, three DEDENT tokens need to be emitted,
          one for each if block. So I had to find a way to handle this
          multiple DEDENT nonsense.
        
The Lexer Code
This isn't really the place for an ocamllex tutorial, so we'll just look at the code immediately. The meat of the lexer (ignoring the non-interesting rules) looks something like
Some important points to notice:
- We have some special logic in when handling \n.
- We track parenthesis usage, in order to ignore indents within parenthesized pieces of code across multiple lines.
- Keywords are checked in a function, not in the lexer rules.
          We have a special state variable that has the following
          type and initial value:
        
          We'll see why these is useful soon. We also have a variable
          paren_count that is initialized to ref 0. This just
          keeps track of whether or not we are inside a parenthesized expression. If
          we are, then indents and newlines are ignored. Lastly, keywords are checked
          in a special function check_keyword, this is because, as per
          ocamllex
          documentation, this should be done to keep the generated transition
          table small. The definition is something like
        
          In addition to the rule token, we also have the rule
          newline, which is defined as follows:
        
          This counts the spaces immediately after a newline, in order to determine
          what indent block we are currently on. Specifically, we want to determine
          if we should emit an INDENT token, or several
          DEDENT tokens followed by a newline. Here's the definition
          of count_indent:
        
          This is some awful code... What it essentially does is, we have a stack of
          integers, space_stack, that keeps the counts for the number
          of spaces used to indent blocks. For example, if the lexer is given this
          piece of code:
        
          when it reaches pass, the space stack will have the contents
          [0, 2, 6, 7, 8].  These numbers are the number of spaces used
          in each indent block. So, when we dedent, we need to return to a preexisting
          number of spaces. The space stack ensures that we do this. Furthermore, when
          deindenting, count_indent counts how many DEDENTS
          needs to be output, and reports this using a DEDENTMANY token,
          which contains an int (the number of DEDENTs
          to output).
        
          Finally, the last part of the lexer is the part that handles outputting
          multiple DEDENT tokens. It works like this: inside the lexer,
          there is a function called token_cache that has the following
          definition:
        
          It essentially keeps a reference to a list (cache) that gets
          filled with DEDENT tokens once a DEDENTMANY
          token is output by the lexer. Notice that the parser never expects a
          DEDENTMANY token, that is only used internally by the lexer.
        
Using The Lexer
          When looking at ocamllex tutorials, you'll usually see the main lexing rule
          defined as token. Then, in the main program, or wherever the
          lexer needs to be used, you'll see a call to Lexer.token. In
          order to use this whitespace sensitive lexer, you'll instead use
          Lexer.token_cache, as your "rule" instead of token.
        
The Parser (parser.mly)
I'm using Menhir for parsing, since it supports incremental parsing reasonably nicely. This is so I can parse multiline code inside a REPL, which will hopefully be touched on in a later blog post. The parser isn't anything too fancy. The most interesting thing is how operator precedence is handled, which I'll detail here. If you're only interested in the whitespace-sensitivity bit, skip this part I guess.
Somewhere deep in my parser, I have the following bit of BNF grammar:
Basically, it's meant to match expressions of the following nature
          And the resolve_op_list call on this expression would
          look something like
        
          The Ast.Num stuff is just how my language internally represents
          its abstract
          syntax tree.
        
          The resolve_op_list code is a bit too long to post here I think, so check
          it out on the
          repo
          The way it works isn't too complicated though. I defined some precedences
          for each operator. It's a function that takes in an operator and outputs
          and associativity
        
          You'll notice it also takes in an integer i. This is the index of the
          operator in a list of operators. The index is passed in so we can also take
          into account
          associativity. Namely,
          if I search for the minimum precedence element in a list, I will find the
          left-most, or the right-most operator of minimum precedence, depending on
          the associativity.
        
          The rest of the resolve_op_list just finds the minimum precedence operator
          and splits the lists of operators and expressions at that minimum precedence
          operator. Then it recurses on those two smaller left and right lists.
        
The ability to handle operators like this is super useful for two reasons:
- Custom operators are easily added by the user.
- I don't need to write BNF rules for specific operators.
Conclusion
My code is garbage. That's the end of the lexing/parsing bit. The rest of the lexer and parser is just standard stuff that is specific to Myth, and not that generalizable. The next post will probably be on the type system, or my serious issues with mutability, or Hindley-Milner, or idk.