Parse the Parcel

I recently went on a parsing journey, and it led to yet another wonderful Dashbit library: NimbleParsec. It all started when my colleague Daniel Andrews began building MCP clients with Elixir. This approach opens up enormous potential, and I decided to write a tool that allows for querying and retrieving data using the Delta Sharing protocol. This was new ground for me – I’m used to SQL and Excel, but delta sharing utilizes the column based Apache Parquet data format to store data.

As part of this implementation, I needed to parse SQL-like predicate strings into structured Elixir data that I could use to filter Parquet files. For example, I needed to transform:

"project_id = 123" → {:eq, "project_id", 123}
"status != 'closed'" → {:neq, "status", "closed"}
"created_at >= '2024-01-01'" → {:gte, "created_at", "2024-01-01"}

These tuples could then be passed to the Explorer library to actually filter the data in the parquet files (for example, DataFrame.filter(df, col("project_id") == 123)). So the question became: how best to parse these predicate strings?

At first, I reached for String.split/2, and my implementation looked a little like this:

defp parse_predicate(predicate) when is_binary(predicate) do
  cond do
    String.contains?(predicate, " >= ") ->
      case String.split(predicate, " >= ", parts: 2) do
        [col, val] -> {:gte, String.trim(col), parse_filter_value(String.trim(val))}
        _ -> nil
      end
    #... identical conditions for " >= ", " <= ", " != ",  " = ", " > ", " < "
  end
end

defp parse_filter_value(value) do
  trimmed = trim_value(value)

  cond do
    trimmed in ["true", "TRUE"] -> true
    trimmed in ["false", "FALSE"] -> false
    trimmed in ["null", "NULL"] -> nil
    Regex.match?(~r/^-?\d+$/, trimmed) -> String.to_integer(trimmed)
    Regex.match?(~r/^-?\d+\.\d+$/, trimmed) -> String.to_float(trimmed)
    true -> trimmed
  end
end

And it worked! The code passed review, and got merged in.

You might already be able to spot some of the problems with this code above:

The order of the cond clauses in parse_predicate/1 is crucial: >= has to be checked before > (otherwise it breaks)
It’s inefficient: if the operator appears at position n in the cond, the string gets scanned from the beginning n times (once for each String.contains?/2 check until it finds a match)
The fact that each branch in that cond is identical apart from the operator passed into the second argument of each cond
It’s not extensible at all
Regex¹

I knew the code was a bit smelly and brittle, but I wasn’t really sure how to solve it. Luckily, one of the reviewers, who wrote about how this sort of thing can come back to bite you later, gently suggested that I should look into NimbleParsec.

A Hard Pass on Regex

Before we go further, I know what you’re thinking: “Why not regex?”

~r/(\w+)\s*(>=|<=|!=|=|>|<)\s*(.+)/

And then, you can capture all the parts:

Regex.run(~r/(\w+)\s*(>=|<=|!=|=|>|<)\s*(.+)/, "price >= 99.99")
#=> ["price >= 99.99", "price", ">=", "99.99"]

Well – in my opinion, Regex is ugly, hard to read and begging to be broken. It might take up less lines of code, but it wouldn’t really solve my problem. My code is still brittle, and besides, I still need to parse the result – "99.99" is not a float. And if all goes wrong and it does break, ideally, I want a clear error message telling me exactly how my parsing failed, ie: “Error: expected operator after column name”.

Plus, any extension I’d like to make to expand the functionality of my parser would entail deciphering my Regex hieroglyphics, adding it in a way that doesn’t break, and then updating my documentation. It’s just not going to work how I want it to.

So – back to the suggestion: NimbleParsec

Practial Parsing

NimbleParsec is a parser combinator library by Dashbit. Essentially, it provides a way to write small parsers, and then combine them together to perform more complicated parsing operations – building up your parser, block by block.

Does that help you in any way to understand what it is? Maybe not – so let’s jump straight into an example:

defmodule SimpleParser do
  import NimbleParsec

  # A simple parser for "name = value" which we are calling '`:my_cool_parser`, for obvious reasons
  defparsec :my_cool_parser,
    ascii_string([?a..?z], min: 1)
    |> ignore(string(" = "))
    |> integer(min: 1)
end

SimpleParser.my_cool_parser("age = 42")
#=> {:ok, ["age", 42], "", %{}, {1, 0}, 8}

Let me break this down for you. The parser definition defparsec defines how the parser :my_cool_parser is going to act as it tries to move through the input left to right.

First, we try to grab (technical term) one to many of the characters from the input, matching against the allowed characters (here, any character from a to z). The ? in front of the letter means that we’re actually matching the code points (97 for a, and 122 for z) of those characters.

Next, we ignore the equals sign and the spaces around it – and then finally, we grab a minimum of one integer. And that’s it! The success output consists of an :ok tuple, which gives us:

a list of the things that were parsed successfully – here, ["age", 42]
the remaining unparsed input – we’ve parsed everything, but if we attempted to parse "age = 42 extra stuff" this would be " extra stuff".
A context map, which can be used to pass state
A positioning tuple, indicating the line and column where the parsing ended
The byte offset – ie, the total bytes consumed.

But what about inputs that are malformed – how does it fail?

SimpleParser.my_cool_parser("age = ")
#=> {:error, "expected ASCII character in the range '0' to '9'", "", %{}, {1, 0}, 6}

Now that’s the kind of error message you want from a parser – it provides the reason why it failed, and then basically the same information you get from a success: the remaining input "", the context map, and the byte information.

Let’s see how this is working under the hood. The docs mention that “combinators are composed programmatically and compiled into multiple clauses with binary matching” – and if you want to see what that means in practice, you can add a debug: true to the defparsec:

defparsec :my_cool_parser, 
  ascii_string([?a..?z], min: 1)
  |> ignore(string(" = "))
  |> integer(min: 1),
  debug: true

You’ll then see the generated clauses which NimbleParsec is actually using to binary match the input into the parsed result:

defp my_cool_parser__0(rest, acc, stack, context, line, offset) do
  my_cool_parser__1(rest, [], [acc | stack], context, line, offset)
end

defp my_cool_parser__1(<<x0, rest::binary>>, acc, stack, context, comb__line, comb__offset) when x0 >= 97 and x0 <= 122 do
  my_cool_parser__2(rest, [<<x0::integer>>] ++ acc, stack, context, comb__line, comb__offset + 1)
end

# ... many more horrifying clauses ...

{:module, SimpleParser,
 <<70, 79, 82, 49, 0, 0, 23, 0, 66, 69, 65, 77, 65, 116, 85, 56, 0, 0, 2, 78, 0,
   0, 0, 52, 19, 69, 108, 105, 120, 105, 114, 46, 83, 105, 109, 112, 108, 101,
   80, 97, 114, 115, 101, 114, 8, 95, 95, ...>>,
 [
   my_cool_parser__0: 6,
   my_cool_parser__1: 6,
   my_cool_parser__1: 6,
   my_cool_parser__2: 6,
   my_cool_parser__2: 6,
   my_cool_parser__4: 6,
   my_cool_parser__3: 6,
   my_cool_parser__5: 6,
   my_cool_parser__5: 6,
   my_cool_parser__6: 6,
   my_cool_parser__7: 6,
   my_cool_parser__7: 6,
   my_cool_parser__8: 6,
   my_cool_parser__8: 6,
   my_cool_parser__10: 6,
   my_cool_parser__9: 6,
   my_cool_parser__11: 6
 ]}

I know I complained about Regex hieroglyphics before, but the binary matching code that is actually being executed by NimbleParsec under the hood is absolutely insane.² But it’s under the hood – so we don’t really have to think about it when we’re writing and using the parser.

Parsing Predicates Properly

Okay – so now we know how that simple example is working, I’m going to walk you through how I rebuilt my predicate parser using NimbleParsec. We’ll start small and build up. If you just want to see the completed module: I put a link a the bottom of this article!

Parsing Column Names

First, we need to parse column names like project_id or user.email:

column_name =
  [?a..?z, ?A..?Z, ?0..?9, ?_, ?.]
  |> ascii_string(min: 1)
  |> unwrap_and_tag(:column)

Just like before, we want to first grab some ascii characters, but this time we’ll accept letters, numbers, underscores, or dots (the allowed characters for a column name in the delta sharing protocol). We also want to to label this output as a column, so we can use the unwrap_and_tag function to take this piece of the parsed string and wrap it in a tuple: ie, {:column, "project_id"}. This will make it far easier to build our final result later.

Parsing Operators

Next, the operators:

operator =
  [
    ">=" |> string() |> replace(:gte),
    "<=" |> string() |> replace(:lte),
    "!=" |> string() |> replace(:neq),
    "=" |> string() |> replace(:eq),
    ">" |> string() |> replace(:gt),
    "<" |> string() |> replace(:lt)
  ]
  |> choice()
  |> unwrap_and_tag(:op)

This works through all the operators in order. “But Stu!”, I hear you cry (which is concerning as presumably you’re quite far away from me), “I thought the whole point of this re-write was to make the code less brittle and not reliant on order!”

Well, sorry. Order still matters. But now it’s explicit – it’s documented in the structure of the code itself, rather than just shoved into a cond with thoughts and prayers.

We pipe all of these operator possibilities into a choice/1 – which is a combinator that just means, “try each of these parsers in order, and use the first one that succeeds.”

When one of the operators match (let’s say, ">="), we replace/2 it with the atom :gte which is the atom we’re actually going to pass onto the Explorer library down the line to filter our parquet files (the whole point of this exercise).

Parsing Values

In my original code, the value parsing was scattered to the four winds across multiple functions and full of potential for edge cases to mess it up. With NimbleParsec on the other hand, it’s as straightforward as checking for strings, numbers and literals (and no, order doesn’t matter here).

Quoted Strings:

quoted_string =
  choice([
    "'"
    |> string()
    |> ignore()
    |> repeat(
      choice([
        "\\'" |> string() |> replace(?'),
        "\\\"" |> string() |> replace(?"),
        utf8_char(not: ?')
      ])
    )
    |> ignore(string("'"))
    |> reduce({List, :to_string, []}),
    "\""
    |> string()
    |> ignore()
    |> repeat(
      choice([
        "\\\"" |> string() |> replace(?"),
        "\\'" |> string() |> replace(?'),
        utf8_char(not: ?")
      ])
    )
    |> ignore(string("\""))
    |> reduce({List, :to_string, []})
  ])

Now I know that looks nasty, but bear with me. It’s designed to handle single and double quotes, escaped quotes inside strings (\' and \"), and UTF-8 characters – and the more it can do, the longer it’s going to get. I apologise.³

We start with a choice/1. Is it a double quoted string, or a single quoted string? Let’s use the following line as an example:

"He said my Regex was, quote, 'a piece of crap'"

We’d fail the single-quoted string check ("'") and match the double-quoted string check ("\""), so the rest of our parsing will take place in that branch of the choice/1. We then ignore that first ", and then for each remaining character in the string, repeat a different choice/1 – replacing any escaped quotes with the actual quote code point. Finally, we ignore the closing ", and then reduce/1 to convert our list of characters back to a string.

Numbers:

Now I feel like we’re getting it! Let’s keep going – how do we handle numbers?

number =
  [?-, ?+]
  |> ascii_char()
  |> optional()
  |> ascii_string([?0..?9], min: 1)
  |> optional(
    "."
    |> string()
    |> ascii_string([?0..?9], min: 1)
  )
  |> reduce(:parse_number)

First, we find the sign by matching on the ascii characters for plus or minus, which can be represented in elixir as ?- and ?+. We mark this as optional/1 and remember the sign, but if it’s unsigned, no worries. Then, we’ll grab all the numbers we can. There’s an important difference here between optional/1 and optional/2: optional/1 makes the combinator you pass optional (for example, the signs + and -) whereas optional/2 concatenates the first argument with the optional second argument.

So here, the optional where we pass in the "." will optionally match the decimal point, and anything after it.

After that, we still need to parse the string into a number. When in doubt: just use the standard library parsing functions (after turning the code point values for the – and + back into their string forms):

defp parse_number(parts) do
  number_string =
    Enum.map_join(parts, fn
      ?- -> "-"
      ?+ -> "+"
      part when is_binary(part) -> part
    end)

  case Integer.parse(number_string) do
    {value, ""} ->
      value
    {_value, _remainder} ->
      {value, ""} = Float.parse(number_string)
      value
  end
end

All done! Now for the truthies and falsies:

Literals:

literal =
  choice([
    "true" |> string() |> replace(true),
    "TRUE" |> string() |> replace(true),
    "false" |> string() |> replace(false),
    "FALSE" |> string() |> replace(false),
    "null" |> string() |> replace(nil),
    "NULL" |> string() |> replace(nil)
  ])

This one’s pretty straightforward! It’s simply a choice between various boolean strings (even if the input is SHOUTING AT YOU).

Combining Our Value Parsers:

It’s as easy as:

value =
  [
    quoted_string,
    literal,
    number
  ]
  |> choice()
  |> unwrap_and_tag(:value)

Try quoted string first, then literal, then number. And then tag the result as :value.

Thank goodness, that wasn’t too bad was it? Well, it’s time to put it all together: let’s put the column_name, operator and value together:

The Perfect Predicate Parser, Altogether!

Now we can compose our completed predicate parser:

predicate =
  whitespace
  |> concat(column_name)
  |> concat(whitespace)
  |> concat(operator)
  |> concat(whitespace)
  |> concat(value)
  |> concat(whitespace)
  |> eos()
  |> reduce(:build_predicate)

defparsec(:predicate, predicate)

It’s beautiful!

It’s so clean! It reads like exactly what it does – get the whitespace, name, whitespace, operator, whitespace, value, whitespace. That eos just means end-of-string.

Finally, we’ll reduce all the tagged parts into our final tuple:

defp build_predicate(parts) do
  column = Keyword.get(parts, :column)
  op = Keyword.get(parts, :op)
  value = Keyword.get(parts, :value)

  {op, column, value}
end

The Proof is in the Parsing

Here’s a real example. With my old code:

parse_predicate("project_id !! 123")
#=> nil
# (and a log message somewhere)

With NimbleParsec:

parse_predicate("project_id !! 123")
#=> {:error, "parse error at line 1, column 12: expected string \">=\" or string \"<=\" or string \"!=\" or string \"=\" or string \">\" or string \"<\""}

I don’t want to jinx myself, but I’m not going to complain when it inevitably goes wrong, and I have the exact reasons it went wrong in my logs.

Proportional or Preposterous?

Was this massive overkill?

…maybe! But I don’t think so.

It’s definitely more code, which in my eyes, isn’t ideal. But on the other hand – we’ve traded verbosity for clarity, extensibility, and better error messages. It should also be bulletproof⁴ (or at least, x100 less brittle than my previous implementation).

The Domain Specific Language (DSL) that NimbleParsec provides means that everything is now nicely labeled and scoped. While this does mean there is a not insignificant learning curve, it also means that we can make small focused changes if and when we need to. It also means that we can test each parser in isolation:

test "parses floats" do
  assert {:ok, [value: 3.14], _, _, _, _} = 
    MyParser.value("3.14")
end

…which gives me even more confidence in my solution.

So, should you also give up your hand-rolled parsers and Regexes and give in to the power of NimbleParsec? I think if you just need simple validations, or just to match patterns within an input, you can probably get away with the built-in String library functions or Regex.

There are also some caveats: NimbleParsec aggressively inlines code, which means that each time you use each node (value, operator, etc) they get replaced with the whole definition and compiled individually. There’s ways around this using defpcombinatorp, but it’s still worth mentioning. There’s also a definite learning curve: when building my parser above, I got it wrong quite a few times before I got it right. And while I love the verbosity of the error messages, they can be a little overwhelming and dry.

But overall, if you’re reaching for string splitting or regex for anything more complex than simple validation, give NimbleParsec a shot. Your future self will thank you.

As promised, here is the complete code!

¹You’ve solved a problem with Regex! Congratulations, you now have two problems!

²If you stare into the generated binary matching code, the generated binary matching code stares back into you.

³Actually I’m not sorry. Blame Dashbit.

⁴If you find a way to break this, please don’t tell me

Elixir nimble parsec parsing

How and when to use NimbleParsec