Clojure’s new regex syntax

Published November 19th, 2008

Last week, Rich Hickey announced a few notable changes to Clojure, including ahead-of-time compilation and a cleaner syntax for regular expressions. Both are improvements, but the syntax is especially interesting for a reason unrelated to its function. First, a quick overview.

1. What has changed

In a sentence, fewer backslashes. The notation is now more in line with that of scripting languages, where regular expressions are first-class literals, than that of general-purpose languages like C++ or Java, where regexes are just specialized strings.

Say we are given a stream including this text:

<img  src="images/11/apple1.gif"/>
<img   src="images/2/bulb2.jpeg"/>
<img src="images/354/citrus32_a.png"/>

We want to select IMG tags and capture the basename (without extension) of each source file. This can be done in many ways; here’s a pseudocode blueprint which is just barely good enough:

<img [whitespace]+
     src=" [word-char]+ / [digit]+ / ([word-char]+) ...

Converting this to Clojure’s old syntax gives us a somewhat unwieldly #"<img\\s+src=\"\\w+/\\d+/(\\w+)". A quick test:

(let [lines "...
             <img  src=\"images/11/apple1.gif\"/>
             <img   src=\"images/2/bulb2.jpeg\"/>
             <img src=\"images/354/citrus32_a.png\"/>
  ;; Return only the captures, not the full matches.
  (map second
       (re-seq #"<img\\s+src=\"\\w+/\\d+/(\\w+)" lines)))
=> ("apple1" "bulb2" "citrus32_a")

The new update to the reader allows us to remove the double escaping of the regex specials in the literal:

(map second
     (re-seq #"<img\s+src=\"\w+/\d+/(\w+)" lines)))

2. Clojure vs foo

Since we’re on the topic, here’s how Clojure’s syntax compares to popular languages.

Ruby and Perl 5

# Regular usage
# Choosing a different delimiter:
 m|<img\s+src="\w+/\d+/(\w+)|     # Perl
%r|<img\s+src="\w+/\d+/(\w+)|     # Ruby

The clearest of all extant languages (at least in this regard), Ruby and Perl can avoid some extra escaping by changing the delimiter character from / to |.

Emacs Lisp


Well, the expression is long and ugly. An upside is that because of the quote delimiters, forward-slashes need not be escaped.



This is the same as Clojure’s original syntax. For reference, Clojure and Java share a regex engine and are equivalent in power.

Common Lisp

Edi Weitz’s professional CL-PPCRE package is essentially the standard for dealing with regular expressions in CL.


Also, Edi’s CL-INTERPOL provides a reader macro which simplifies regex literals to the level of Perl’s:


Finally, the reader macro mastery of Doug Hoyte’s Let Over Lambda gives a method of making clear, functional literals:

;; This is a callable lambda:
;; #~m|<img\s+src="\w+/\d+/(\w+)|
=> (LAMBDA (#:STR236)
     (CL-PPCRE:SCAN "<img\\s+src=\"\\w+/\\d+/(\\w+)"

3. The real reason this is neat

The modification was proposed by Chris Houser (with a simple patch) on October 7, politely debated until October 10, and committed to Clojure in r1070 on October 15.* This syntax was better, and the discussion skipped if it should be applied, directly to how.

Turnaround time for a breaking change: one week. You have to respect that velocity.

There is a feeling in the development community that Clojure has a good chance of becoming an important language. Now is the brief time when any interested programmer could contribute something significant, in an environment which recognizes intelligent contribution and lacks — for the moment — politics and tradition.

Want to help? The Clojure mailing list is high signal-to-noise, and subscribing is a good way to get acclimated. Also, communicating realtime with Rich Hickey and other Clojure experts is no more difficult than joining an IRC channel: #clojure on freenode.

Further discussion on the programming reddit

* There was also a similar conversation in March, but it didn’t include a patch.

11 Responses to “Clojure’s new regex syntax”

  1. Peter Seibel Says:

    Doug Hoyte isn’t the only one who knows how to write reader macros — Edi Weitz provides cl-interpol ( ) which provides, among other things, support for writing regexps with a reasonable number of backslashes.


  2. Matt Might Says:

    Certainly a big improvement, but I’ve always preferred Scheme Shell’s S-Expression-based regex syntax for flavors of Lisp.

  3. Stephen Bach Says:

    Peter, thanks! I was aware of CL-INTERPOL, but I’ve never used it and didn’t know of its “synergy” with CL-PPCRE.

    Matt, looks nifty, and I agree, more Lisp-like.

  4. Chaitanya Gupta Says:

    Wanted to make the same point as Peter — Edi’s cl-interpol+cl-ppcre is a great combination if you deal with regexes in CL a lot.

  5. doug Says:

    For the record, Python is pretty clear too: r'<img\s+src="\w+/\d+/(\w+)'

  6. Johan L Says:

    The most readable Perl regexes are written with the /x modifier so you can have whitespace in it. That also makes it possible to write the regex on multiple lines with proper indentation and use # comments.

    (the data doesn’t quite seem to match the regexes).

    m| <img \s+ src="\w+  /  \d+  /  (\w+) |x

    or even:

        <img \s+ 
            src="\w+  # Initial word (something)
            / \d+     # Bogus number
            / (\w+)   # Capture basename

    Yeah, Perl is just so much line noise…

  7. Johan L Says:

    Great, the lack of indentation made the last example look like crap. Oh, well…

  8. Stephen Bach Says:

    Chaitanya, I am suitably corrected. I’ve updated the article — Edi certainly deserves his due. Thanks.

    doug, thanks!

    Johan: fixed it up for you. 🙂 And I agree, Perl is the big winner here.

  9. greg Says:

    In the Java RE engine, you also have other predefined classes, such as \p{Alpha} or \p{ASCII} and so on – see:

    These imo provide readability (although I hate the \p prefix, {Alone} would be nice and leave out confusion of the syntax proposed here with square brackets which look too much like “custom” classes) and Unicode support (\p{InArbitraryUnicodeBlock} for instance)

  10. greg Says:

    (unicode categories are also useable, i.e \p{P} for all unicode punctuation signs or \p{Lu} for all uppercase letters and so on)

  11. Clojure Roundup: Post-Thanksgiving vacation edition « Clojure Study Group DC Says:

    […] Clojure recently incorporated a new regex syntax. […]