Appendix G: Lexers and Regular Expressions

Working with text is a pretty common and fundamental thing in day-to-day programming.

Lux's approach to doing it is with the use of composable, monadic lexers.

The idea is that a lexer is a function that takes some text input, performs some calculations which consume that input, and then returns some value, and the remaining, unconsumed input.

Of course, the lexer may fail, in which case the user should receive some meaningful error message to figure out what happened.

The lux/lexer library provides a type, and a host of combinators, for building and working with lexers.

Check it out:

(type: #export (Lexer a)
  (-> Text (Error [Text a])))

A good example of lexers being used is the lux/data/format/json module, which implements full JSON serialization, and which uses lexers when working with text inputs to the JSON parser.


However, programmers coming from other programming languages may be familiar with a different approach to test processing that has been popular for a number of years now: regular expressions.

Regular expressions offer a short syntax to building lexers that is great for writing quick text-processing tools.

Lux also offers support for this style in its lux/regex module, which offers the regex macro.

The regex macro, in turn, compiles the given syntax into a lexer, which means you can combine both approaches, for maximum flexibility.

Here are some samples for regular expressions:

## Literals
(regex "a")

## Wildcards
(regex ".")

## Escaping
(regex "\\.")

## Character classes
(regex "\\d")

(regex "\\p{Lower}")

(regex "[abc]")

(regex "[a-z]")

(regex "[a-zA-Z]")

(regex "[a-z&&[def]]")

## Negation
(regex "[^abc]")

(regex "[^a-z]")

(regex "[^a-zA-Z]")

(regex "[a-z&&[^bc]]")

(regex "[a-z&&[^m-p]]")

## Combinations
(regex "aa")

(regex "a?")

(regex "a*")

(regex "a+")

## Specific amounts
(regex "a{2}")

## At least
(regex "a{1,}")

## At most
(regex "a{,1}")

## Between
(regex "a{1,2}")

## Groups
(regex "a(.)c")

(regex "a(b+)c")

(regex "(\\d{3})-(\\d{3})-(\\d{4})")

(regex "(\\d{3})-(?:\\d{3})-(\\d{4})")

(regex "(?<code>\\d{3})-\\k<code>-(\\d{4})")

(regex "(?<code>\\d{3})-\\k<code>-(\\d{4})-\\0")

(regex "(\\d{3})-((\\d{3})-(\\d{4}))")

## Alternation
(regex "a|b")

(regex "a(.)(.)|b(.)(.)")

Another awesome feature of the regex macro is that it will build fully type-safe code for you.

This is important because the groups and alternations that you use in your regular expression will affect its output.

For example:

## This returns a single piece of text
(regex "a{1,}")

## But this one returns a pair of texts
## The first is the whole match: aXc
## And the second is the thing that got matched: the X itself
(regex "a(.)c")

## That means, these are the types of these regular-expressions:
(: (Lexer Text)
   (regex "a{1,}"))

(: (Lexer [Text Text])
   (regex "a(.)c"))

The benefits of lexers are that they are a bit easier to understand when reading them (due to their verbosity), and that they are very easy to combine (thanks to their monadic nature, and the combinator library).

The benefits of regular expressions are their familiarity to a lot of programmers, and how quick they are to write.

Ultimately, it makes the most sense to provide both mechanisms to Lux programmers, and let everyone choose whatever they find most useful.

results matching ""

    No results matching ""