Clojure’s new regex syntax
Published November 19th, 2008- Table of contents
- 1. What has changed
- 2. Clojure vs foo
- 3. The real reason this is neat
Last week, Rich Hickey announced a few notable changes to Clojure, including ahead-of-time compilation and a cleaner syntax for regular expressions. Both are improvements, but the syntax is especially interesting for a reason unrelated to its function. First, a quick overview.
1. What has changed
In a sentence, fewer backslashes. The notation is now more in line with that of scripting languages, where regular expressions are first-class literals, than that of general-purpose languages like C++ or Java, where regexes are just specialized strings.
Say we are given a stream including this text:
... <img src="images/11/apple1.gif"/> <img src="images/2/bulb2.jpeg"/> <img src="images/354/citrus32_a.png"/> ... |
We want to select IMG
tags and capture the basename (without extension) of each source file. This can be done in many ways; here’s a pseudocode blueprint which is just barely good enough:
<img [whitespace]+ src=" [word-char]+ / [digit]+ / ([word-char]+) ... |
Converting this to Clojure’s old syntax gives us a somewhat unwieldly #"<img\\s+src=\"\\w+/\\d+/(\\w+)"
. A quick test:
(let [lines "... <img src=\"images/11/apple1.gif\"/> <img src=\"images/2/bulb2.jpeg\"/> <img src=\"images/354/citrus32_a.png\"/> ..."] ;; Return only the captures, not the full matches. (map second (re-seq #"<img\\s+src=\"\\w+/\\d+/(\\w+)" lines))) => ("apple1" "bulb2" "citrus32_a") |
The new update to the reader allows us to remove the double escaping of the regex specials in the literal:
(map second (re-seq #"<img\s+src=\"\w+/\d+/(\w+)" lines))) |
2. Clojure vs foo
Since we’re on the topic, here’s how Clojure’s syntax compares to popular languages.
Ruby and Perl 5
# Regular usage /<img\s+src="\w+\/\d+\/(\w+)/ # Choosing a different delimiter: m|<img\s+src="\w+/\d+/(\w+)| # Perl %r|<img\s+src="\w+/\d+/(\w+)| # Ruby |
The clearest of all extant languages (at least in this regard), Ruby and Perl can avoid some extra escaping by changing the delimiter character from /
to |
.
Emacs Lisp
"<img\\s-+src=\"\\w+/[[:digit:]]+/\\(\\w+\\)" |
Well, the expression is long and ugly. An upside is that because of the quote delimiters, forward-slashes need not be escaped.
Java
"<img\\s+src=\"\\w+/\\d+/(\\w+)" |
This is the same as Clojure’s original syntax. For reference, Clojure and Java share a regex engine and are equivalent in power.
Common Lisp
Edi Weitz’s professional CL-PPCRE package is essentially the standard for dealing with regular expressions in CL.
"<img\\s+src=\"\\w+/\\d+/(\\w+)" |
Also, Edi’s CL-INTERPOL provides a reader macro which simplifies regex literals to the level of Perl’s:
#?r|<img\s+src="\w+/\d+/(\w+)| |
Finally, the reader macro mastery of Doug Hoyte’s Let Over Lambda gives a method of making clear, functional literals:
;; This is a callable lambda: ;; #~m|<img\s+src="\w+/\d+/(\w+)| '#~m|<img\s+src="\w+/\d+/(\w+)| => (LAMBDA (#:STR236) (CL-PPCRE:SCAN "<img\\s+src=\"\\w+/\\d+/(\\w+)" #:STR236)) |
3. The real reason this is neat
The modification was proposed by Chris Houser (with a simple patch) on October 7, politely debated until October 10, and committed to Clojure in r1070 on October 15.* This syntax was better, and the discussion skipped if it should be applied, directly to how.
Turnaround time for a breaking change: one week. You have to respect that velocity.
There is a feeling in the development community that Clojure has a good chance of becoming an important language. Now is the brief time when any interested programmer could contribute something significant, in an environment which recognizes intelligent contribution and lacks — for the moment — politics and tradition.
Want to help? The Clojure mailing list is high signal-to-noise, and subscribing is a good way to get acclimated. Also, communicating realtime with Rich Hickey and other Clojure experts is no more difficult than joining an IRC channel: #clojure on freenode.
Further discussion on the programming reddit
* There was also a similar conversation in March, but it didn’t include a patch.
Doug Hoyte isn’t the only one who knows how to write reader macros — Edi Weitz provides cl-interpol ( http://www.weitz.de/cl-interpol/#regular ) which provides, among other things, support for writing regexps with a reasonable number of backslashes.
-Peter
Certainly a big improvement, but I’ve always preferred Scheme Shell’s S-Expression-based regex syntax for flavors of Lisp.
Peter, thanks! I was aware of CL-INTERPOL, but I’ve never used it and didn’t know of its “synergy” with CL-PPCRE.
Matt, looks nifty, and I agree, more Lisp-like.
Wanted to make the same point as Peter — Edi’s cl-interpol+cl-ppcre is a great combination if you deal with regexes in CL a lot.
For the record, Python is pretty clear too:
r'<img\s+src="\w+/\d+/(\w+)'
The most readable Perl regexes are written with the
/x
modifier so you can have whitespace in it. That also makes it possible to write the regex on multiple lines with proper indentation and use#
comments.(the data doesn’t quite seem to match the regexes).
or even:
Yeah, Perl is just so much line noise…
Great, the lack of indentation made the last example look like crap. Oh, well…
Chaitanya, I am suitably corrected. I’ve updated the article — Edi certainly deserves his due. Thanks.
doug, thanks!
Johan: fixed it up for you. 🙂 And I agree, Perl is the big winner here.
In the Java RE engine, you also have other predefined classes, such as
\p{Alpha}
or\p{ASCII}
and so on – see:http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html
These imo provide readability (although I hate the
\p
prefix,{Alone}
would be nice and leave out confusion of the syntax proposed here with square brackets which look too much like “custom” classes) and Unicode support (\p{InArbitraryUnicodeBlock}
for instance)(unicode categories are also useable, i.e
\p{P}
for all unicode punctuation signs or\p{Lu}
for all uppercase letters and so on)[…] Clojure recently incorporated a new regex syntax. […]