Clojure’s new regex syntax

Published November 19th, 2008

Last week, Rich Hickey announced a few notable changes to Clojure, including ahead-of-time compilation and a cleaner syntax for regular expressions. Both are improvements, but the syntax is especially interesting for a reason unrelated to its function. First, a quick overview.

1. What has changed

In a sentence, fewer backslashes. The notation is now more in line with that of scripting languages, where regular expressions are first-class literals, than that of general-purpose languages like C++ or Java, where regexes are just specialized strings.

Say we are given a stream including this text:

...
<img  src="images/11/apple1.gif"/>
<img   src="images/2/bulb2.jpeg"/>
<img src="images/354/citrus32_a.png"/>
...

We want to select IMG tags and capture the basename (without extension) of each source file. This can be done in many ways; here’s a pseudocode blueprint which is just barely good enough:

<img [whitespace]+
     src=" [word-char]+ / [digit]+ / ([word-char]+) ...

Converting this to Clojure’s old syntax gives us a somewhat unwieldly #"<img\\s+src=\"\\w+/\\d+/(\\w+)". A quick test:

(let [lines "...
             <img  src=\"images/11/apple1.gif\"/>
             <img   src=\"images/2/bulb2.jpeg\"/>
             <img src=\"images/354/citrus32_a.png\"/>
             ..."]
  ;; Return only the captures, not the full matches.
  (map second
       (re-seq #"<img\\s+src=\"\\w+/\\d+/(\\w+)" lines)))
 
=> ("apple1" "bulb2" "citrus32_a")

The new update to the reader allows us to remove the double escaping of the regex specials in the literal:

(map second
     (re-seq #"<img\s+src=\"\w+/\d+/(\w+)" lines)))

2. Clojure vs foo

Since we’re on the topic, here’s how Clojure’s syntax compares to popular languages.

Ruby and Perl 5

# Regular usage
/<img\s+src="\w+\/\d+\/(\w+)/
 
# Choosing a different delimiter:
 m|<img\s+src="\w+/\d+/(\w+)|     # Perl
%r|<img\s+src="\w+/\d+/(\w+)|     # Ruby

The clearest of all extant languages (at least in this regard), Ruby and Perl can avoid some extra escaping by changing the delimiter character from / to |.

Emacs Lisp

"<img\\s-+src=\"\\w+/[[:digit:]]+/\\(\\w+\\)"

Well, the expression is long and ugly. An upside is that because of the quote delimiters, forward-slashes need not be escaped.

Java

"<img\\s+src=\"\\w+/\\d+/(\\w+)"

This is the same as Clojure’s original syntax. For reference, Clojure and Java share a regex engine and are equivalent in power.

Common Lisp

Edi Weitz’s professional CL-PPCRE package is essentially the standard for dealing with regular expressions in CL.

"<img\\s+src=\"\\w+/\\d+/(\\w+)"

Also, Edi’s CL-INTERPOL provides a reader macro which simplifies regex literals to the level of Perl’s:

#?r|<img\s+src="\w+/\d+/(\w+)|

Finally, the reader macro mastery of Doug Hoyte’s Let Over Lambda gives a method of making clear, functional literals:

;; This is a callable lambda:
;; #~m|<img\s+src="\w+/\d+/(\w+)|
 
'#~m|<img\s+src="\w+/\d+/(\w+)|
 
=> (LAMBDA (#:STR236)
     (CL-PPCRE:SCAN "<img\\s+src=\"\\w+/\\d+/(\\w+)"
                    #:STR236))

3. The real reason this is neat

The modification was proposed by Chris Houser (with a simple patch) on October 7, politely debated until October 10, and committed to Clojure in r1070 on October 15.* This syntax was better, and the discussion skipped if it should be applied, directly to how.

Turnaround time for a breaking change: one week. You have to respect that velocity.

There is a feeling in the development community that Clojure has a good chance of becoming an important language. Now is the brief time when any interested programmer could contribute something significant, in an environment which recognizes intelligent contribution and lacks — for the moment — politics and tradition.

Want to help? The Clojure mailing list is high signal-to-noise, and subscribing is a good way to get acclimated. Also, communicating realtime with Rich Hickey and other Clojure experts is no more difficult than joining an IRC channel: #clojure on freenode.

Further discussion on the programming reddit


* There was also a similar conversation in March, but it didn’t include a patch.


11 Responses to “Clojure’s new regex syntax”

  1. Peter Seibel Says:

    Doug Hoyte isn’t the only one who knows how to write reader macros — Edi Weitz provides cl-interpol ( http://www.weitz.de/cl-interpol/#regular ) which provides, among other things, support for writing regexps with a reasonable number of backslashes.

    -Peter

  2. Matt Might Says:

    Certainly a big improvement, but I’ve always preferred Scheme Shell’s S-Expression-based regex syntax for flavors of Lisp.

  3. Stephen Bach Says:

    Peter, thanks! I was aware of CL-INTERPOL, but I’ve never used it and didn’t know of its “synergy” with CL-PPCRE.

    Matt, looks nifty, and I agree, more Lisp-like.

  4. Chaitanya Gupta Says:

    Wanted to make the same point as Peter — Edi’s cl-interpol+cl-ppcre is a great combination if you deal with regexes in CL a lot.

  5. doug Says:

    For the record, Python is pretty clear too: r'<img\s+src="\w+/\d+/(\w+)'

  6. Johan L Says:

    The most readable Perl regexes are written with the /x modifier so you can have whitespace in it. That also makes it possible to write the regex on multiple lines with proper indentation and use # comments.

    (the data doesn’t quite seem to match the regexes).

    m| <img \s+ src="\w+  /  \d+  /  (\w+) |x
    

    or even:

    m| 
        <img \s+ 
            src="\w+  # Initial word (something)
            / \d+     # Bogus number
            / (\w+)   # Capture basename
    |x
    

    Yeah, Perl is just so much line noise…

  7. Johan L Says:

    Great, the lack of indentation made the last example look like crap. Oh, well…

  8. Stephen Bach Says:

    Chaitanya, I am suitably corrected. I’ve updated the article — Edi certainly deserves his due. Thanks.

    doug, thanks!

    Johan: fixed it up for you. 🙂 And I agree, Perl is the big winner here.

  9. greg Says:

    In the Java RE engine, you also have other predefined classes, such as \p{Alpha} or \p{ASCII} and so on – see:
    http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

    These imo provide readability (although I hate the \p prefix, {Alone} would be nice and leave out confusion of the syntax proposed here with square brackets which look too much like “custom” classes) and Unicode support (\p{InArbitraryUnicodeBlock} for instance)

  10. greg Says:

    (unicode categories are also useable, i.e \p{P} for all unicode punctuation signs or \p{Lu} for all uppercase letters and so on)

  11. Clojure Roundup: Post-Thanksgiving vacation edition « Clojure Study Group DC Says:

    […] Clojure recently incorporated a new regex syntax. […]