IT & Engineering
How we built a Lucene-inspired parser in Go
At Mailgun we have numerous systems generating a ton of events every hour of the day. It’s a number so large that it’s impossible for a team of people to sort through ElasticSearch results and expect consistent results or for the team to maintain their sanity.
PUBLISHED ON
At Mailgun we have numerous systems generating a ton of events every hour of the day. It’s a number so large that it’s impossible for a team of people to sort through ElasticSearch results and expect consistent results or for the team to maintain their sanity.
Like many companies, we still need to find a way to deal with all the data. This makes for a substantial task: create a system that allows our internal customers to react to events in real-time.
Table of contents
Traditionally, options for solving such a problem include:
Feeding the data into a database and then running database queries, often on a fixed time interval
Map-reduce solutions like Hadoop
Mechanical Turk
Real-time filters.
We pride ourselves on doing a lot with very little, so we set out to see how we could make real-time filters work for us. We focused on building a system that allows essentially anyone in the engineering and support teams to write rules that react and perform actions on their behalf.
Brainstorming
When designing this system, we tried to build something that would give everyone the tools to rapidly iterate and respond to threats to our system and had to conform to the following rules:
The users should be able to add their own rules at will
The domain-specific-language henceforth referred to as a DSL, should be as simple as possible and ideally look like something they already know.
People should be able to test their rules on the historical data before applying them.
We use Kibana internally, so something that could be tested against that to fulfill point number 3 was ideal. This leads us to a straight-forward conclusion: we should write a Lucene inspired DSL parser!
The first iteration of the system relied on regular expressions and lots of character-by-character parsing. It got the job done, but subsequent iterations rapidly grew in complexity. If you've ever done any large string parsing project, you know what we mean.
Moving forward, we knew we had to build something that could be iterated upon quickly, and ideally, conformed to a known standard.
Getting grammatical
Those of us that have taken Automata Theory college courses and the like probably remember grammars in the context of programming languages. Furthermore, it’s given that some readers will have far more expertise than your humble author when it comes to the subject, so any in-depth explanations are to be referred to your friendly neighborhood search engine.
It's still useful to talk briefly about what makes a grammar and why they're useful. To quote Wikipedia, a Formal Grammar is "...a set of production rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form."
At the risk of oversimplifying, it lets us take a string like "You've got mail!" and break it down into tokens that we can semantically analyze. In other words, if our grammar is well defined, we can then write additional code to determine the meaning of that string.
The history and types of grammars can fill tons of books and are thus well beyond the scope of a blog post. Instead, we'll zero in on context-free grammars and more specifically, parsing expression grammars.
Context-free grammars and parsing expression grammars
Quoting Wikipedia again, a Context-Free Grammar is "a certain type of formal grammar: a set of production rules that describe all possible strings in a given formal language. Production rules are simple replacements."
To put it another way, a context-free grammar defines a graph which in turn defines how to parse a language. As mentioned above, it does not tell us how to understand the language. It’s also worth mentioning as we’ll see below that the graph may have cycles, but it eventually terminates.
Context-free grammars have a cousin of sorts in the Parsing Expression Grammar, or PEG. They look a lot like a context-free grammar, or CFG, but come with a very useful distinction: when confronted with an ambiguous choice, the PEG will always choose the first production rule defined. The CFG instead leaves the choice ambiguous, and why bother with ambiguity when we don’t have to?
What this really means is that it's easier to write tools to define PEGs. Fortunately, many people have already done that for us.
Taking flight
After some research, we settled on Pigeon as the basis of our PEG. Pigeon follows the same paradigm as a lot of Go tools by generating code that compiles alongside your program.
Using it is easy enough: You define both the grammar and the Go code to handle each of the rules in the same package, and simply call Parse() on the generated parser. The challenge is to write handlers for each of those rules. Like grammars, the topic of compilation could fill a library with books. Fortunately for us, writing an interpreter is less complex and the resulting code is sufficiently fast for our needs.
Furthermore, the syntax tries to align itself closely to Go syntax itself, which makes it easier when writing your PEG alongside your Go program utilizing the grammar.
Like a leaf on the wind
Say you have an event generated by sending an email, encoded in JSON, that looks like the following:
{
"event": "sent",
"subject": "A special offer just for you!",
"account": "12345",
}
There are a couple of ways to look at this event: you could decide this is an innocuous email, or you could say "This might be spam, but there's not enough information to decide". This is a very common problem in our system, and always requires lots of aggregated data to come to a conclusion.
So, let's write a rule that will match the above event:
This is fairly easy to parse, and better yet, it can also be checked in Kibana!
Now let's walk through constructing a very simple PEG that would match this rule. We'll implement each production rule one at a time and explain what it does. Rule names typically follow PascalCase as well as allowing numbers and underscores. Each rule takes the form:
Note that are actually several legal characters for the rule definition operator but we like <- as it's easy to type and looks most like grammars as they're defined academically.
The first rule in the grammar is treated as the entrypoint:
Our Input
rule says "Match the entire input string" with !.
meaning "Match the end of the file". This will pass whatever the entire input string to a rule named Term which looks like the following:
This passes control-flow to the Variable production rule:
Now we're getting a little more complex. First, we greedily match all characters until the FieldChars rule is satisfied. Note the +
there at the end. PEG syntax shares a lot in common with regular expressions, so this is saying "Match the rule FieldChars
one or more times"
Again, if you're familiar with regex syntax this should be straight-forward: Simply match any single character between “a” and “z”
Now back to Variable, the next part _
is a rule saying "match any whitespace character". Note that _
is a completely valid identifier for a production rule, and less typing is usually a good thing.
There are a couple of things to note here:
The string "whitespace" is the "friendly name" and exists for documentation purposes and your future sanity.
As with regular expression syntax, this will match any form of white space, newlines, tabs or carriage-returns.
The next part of the Variable rule matches the string literal "AND". It doesn't get much easier than that.
Next, we have another whitespace rule, and then finally the Value rule.
This rule matches the string literal '"'
, zero or more ValueChars rules and lastly another quote.
ValueChars
looks like the following:
These ValueChars
will match the lowercase and uppercase alphabet, any number, spaces, and the exclamation point.
Why did we define Value this way? Because the rule can strip the double-quotes off for us so we don't have to, plus we're lazy. However, it's merely a convenience we could have omitted in favor of including the rule form of Value with the Variable rule definition.
Finally, AndVar
should look fairly straightforward at this point. Note that it refers back to Term and implements the aforementioned cycle.
The complete definition looks like the following.
Cool, but what now?
The real-value of defining your own grammar comes when you provide implementations for the actions of the rules. Let's look at the real PEG definition:
There's a lot to unpack here so let's step through it.
At the top, we have a section of code surrounded by curly braces. This is a Pigeon convention that takes the enclosed-code verbatim and injects it into the final generated code output. Note that you can define types relevant to your grammar here for convenience, or you can put them in another file as Pigeon will use the package name you declare here. Everything after that is the definition of the grammar itself.
At this point, you've probably noticed that the rules differ a bit from how we defined them above. Let's take a look at Term
.
This is nearly our Term
definition from before, but now we've made it useful. Prefixing a rule in the definition with name
: assigns the return value of that rule to name and then passes name
to the function defined by your rule action when the code is generated. All of the code between the curly braces becomes a function associated with that rule and is called when the rule is matched:
Here we can see that Pigeon deals entirely in empty interface types. It's great for flexibility but less great for writing the implementation as we have to account for them.
As mentioned previously, we can have cycles in rules. Given this, we know that it's possible to receive an infinite number of Terms mapped to the rest
variable. Pigeon deals with this by handing us a slice of empty interfaces. We've created a Node
interface that everything has to be type-inferred into before we can use them.
After walking the entire list of AND’d
variables, we create an AndNode
containing each of the Terms. The definition of the AndNode can be found in the code literal section defined at the top of the PEG. As an implementer of the Node interface, It defines an Evaluate method which says the expression evaluates to true only if all of the individual terms also evaluate to true.
At this point the rest of the definition should hopefully make sense, so let's skip to running the thing. The following code listing shows parsing our rule, using it, and a couple of extra examples to show failures and error handling:
All you have to do is run:
~> go get
github.com/mna/pigeon
~> pigeon rules.peg > rules.go
~> go run .
Rule 'event:"sent" AND subject:"A special offer just for you!"' matched: true
Rule 'event:"found" AND subject:"A special offer just for you!"' matched: false
Rule 'event:"sent" AND badfield:"A special offer just for you!"' failed with error 'Field 'badfield' not present in the event'
Congratulations! You've now written your own language of extremely limited utility!
Considerations
At this point you may be thinking:
Don't a lot of senders use subjects like that?
Is a rules engine that only matches strings literals all that useful?
Couldn't this post have been half as long?
To which we would respond:
We would want to implement some kind of handling for the
account
as defined in the original event. Mailgun solves this problem by including business logic in the codebase to support rate-limiting based on a defined field. In other words, we can say "If this rule matches, save the account and current time and don't perform this action against this account for the next N minutes." Obviously, if we didn't have this handling and the sender in question sends one million emails, support would be notified one million times. That'd be a great way to drown your support team in noise and also get jumped in the parking lot after work.One could implement rules to parse regular expressions and substrings that would be far more useful than an explicit string match. In fact, we’ve done exactly that.
Sorry.
Lastly, you likely noticed we didn't define what the rule would do here when it does match. One possible solution is to have the rule post to a Slack channel when it’s triggered. This is a common solution here at Mailgun, and looks like the following:
slack:#channel-name
Next steps
Hopefully, at this point, your mind is spinning with possibilities. Here are some thoughts we’ve had:
There’s much more to emulate in lucene syntax like NOT and OR statements, and parentheses.
Provide a separate grammar for actions.
Implement templatization to inject values from the event into the action.
Layer on syntactic sugar to make writing some filters easier.
Implement a system to count and react to a specific number of event occurrences over a time period.
Implementing a cache because parsing the rules is somewhat expensive.
There are loads more you can do, especially when it comes to fleshing out the action parser. You could call other services, notify customers, turn the lights on and off, and more. When it comes to inventing your own tools based on this, the sky’s the limit.
Pros and cons
So at this point, you might be asking yourself: "why didn't they use $SOME_OTHER_TOOL?"
The answer is because we don't yet need the power of any of those tools nor the complexity that frequently comes with operating them. Our current stream processing tool is a single, simple binary deployed in a container that Just Works™. There's very little to manage, and it's readily keeping up with our very high volume event feed.
Pros:
It's easy to develop
It's still using off-the-shelf, FOSS tools
It can be tailored to exactly our needs
The syntax for our DSL is tiny and easy to articulate and reason about, and it forces everyone to write rules roughly the same way. It also makes it harder to create unintended side-effects
Cons:
Language features beyond what we've already implemented represent a substantial leap in complexity.
Updating the source PEG will generate new code, and it really mucks up your code diffs.
Lucene’s deliberate simplicity can make it difficult to write more complex rules.
We may eventually decide that $SOME_OTHER_TOOL is more suitable for our needs, but for the foreseeable future, the power our Mailgun DSL brings to the table is more than sufficient.
The end, finally
At this point, we hope we've demonstrated how writing your own DSL could be useful. Our implementation processes an enormous number of events per day while making the lives of many of our team members much easier (at least that’s what we tell them).
As for the features it supports, we're just getting started. If this has inspired you to look into implementing your own stream processing grammar, please let us know. We'd love to hear about it!