How to Build a RegEx Engine in Python (Part 1: The Grammar)
How to build a Regular Expressions Engine Complete With All The Major Features.
Part 1: The Grammar. Building a Regular Expressions Engine Complete With All The Major Features.
In this series of articles, we will follow the steps to build a RegEx Engine in Python. I chose Python over the many alternatives for a few reasons:
- It is well integrated with the Linux command line
- Because yes.
Please let me explain before get mad at me.
Before starting this project I had no idea how things should be done, how to code the various modules, and so on.
I had to figure everything out by myself.
And, when I don’t know how to do things exactly, I find JS a bit too confusing, it’s just a feeling, not a universal truth, and I find that with Java I can organize my ideas better.
But, since I don’t like how verbose and limiting Java is in some way, but wanted classes, and I love having functions as first-class citizens, Python seemed like the perfect blend to me.
Anyway, it’s now time to start with the project itself.
What We Need
To match a string with a regular expression of our will we need to complete a few steps:
- understand the regular expression
- create an internal representation of it (so that we can see if a passed string match with it)
- match the string with the regex, using the internal representation we built of it.
To complete the first two steps we need two components:
- a Lexer (or scanner) to parse the regex in tokens (“words”, generic “pieces of information”). As an example in the sentence “I am beautiful, very beautiful” each word (“I”, “am”, …) is a token of type “word” and the comma is a token of type “punctuation”
- a Parser that reads all the tokens in sequence and “understands” them, building while doing so an internal representation of the regex (which will be a tree, called Abstract Syntax Tree, AST for the friends).
To accomplish the third step we need to build a component able to take as input the AST built by the Parser and a string, and visiting the tree to match each leaf of the tree with some part of the string (and something more, we will discuss later).
This component is named:
So, now that we know the modules we will have to code, let’s build a roadmap:
- define the RegEx grammar we want to recognize
- build the lexer
- build the parser
- build the engine
It’s now time to discuss the grammar we want to recognize.
I assume you know what a formal grammar is and that you understand the EBNF notation, if not, DON’T PANIC, google it.
The regex grammar we will recognize is the following:
Top-level production RE ::= RE_SEQ is indeed useless but is there because I used it in one of the early versions of the project and I was too lazy to remove it later. Anyway, it is completely harmless, I promise, and it will cost you just a couple of lines of code and only one more call on the stack during execution.
The Grammar Explained
The features that the described grammar will be able to implement are the following:
Check out the gist @ https://gist.github.com/lorenzofelletti/6ffd29e669cd6eb7d015305882f680ab
So, valid regexes are:
You can browse the code of the final result here.
I think that can be enough for the first part, more in the next articles of the series.
In the next article, we will set up the environment and start coding, starting from the lexer.
See you soon! I hope you enjoyed this read!
Please feel free to reply if you don’t understand something, need some hints, whatever. I’ll be glad to answer you.
Cover Image by Kevin Horvat on Unsplash
Computer Engineering Student @ UniBo