Clojure’s new regex syntax
Posted in Uncategorized on November 19th, 2008- Table of contents
- 1. What has changed
- 2. Clojure vs foo
- 3. The real reason this is neat
Last week, Rich Hickey announced a few notable changes to Clojure, including ahead-of-time compilation and a cleaner syntax for regular expressions. Both are improvements, but the syntax is especially interesting for a reason unrelated to its function. First, a quick overview.
1. What has changed
In a sentence, fewer backslashes. The notation is now more in line with that of scripting languages, where regular expressions are first-class literals, than that of general-purpose languages like C++ or Java, where regexes are just specialized strings.
Say we are given a stream including this text:
... <img src="images/11/apple1.gif"/> <img src="images/2/bulb2.jpeg"/> <img src="images/354/citrus32_a.png"/> ...
We want to select IMG tags and capture the basename (without extension) of each source file. This can be done in many ways; here’s a blueprint which is just barely good enough:
<img [whitespace]+
src=" [word-char]+ / [digit]+ / ([word-char]+) ...Converting this to Clojure’s old syntax gives us a somewhat unwieldly #"<img\\s+src=\"\\w+/\\d+/(\\w+)". A quick test:
(let [lines "... <img src=\"images/11/apple1.gif\"/> <img src=\"images/2/bulb2.jpeg\"/> <img src=\"images/354/citrus32_a.png\"/> ..."] ;; Return only the captures, not the full matches. (map second (re-seq #"<img\\s+src=\"\\w+/\\d+/(\\w+)" lines))) => ("apple1" "bulb2" "citrus32_a")
The new update to the reader allows us to remove the double escaping of the regex specials in the literal:
(map second (re-seq #"<img\s+src=\"\w+/\d+/(\w+)" lines)))
2. Clojure vs foo
Since we’re on the topic, here’s how Clojure’s syntax compares to popular languages.
Ruby and Perl 5
# Regular usage /<img\s+src="\w+\/\d+\/(\w+)/ # Choosing a different delimiter: m|<img\s+src="\w+/\d+/(\w+)| # Perl %r|<img\s+src="\w+/\d+/(\w+)| # Ruby
The clearest of all extant languages (at least in this regard), Ruby and Perl can avoid some extra escaping by changing the delimiter character from / to |.
Emacs Lisp
"<img\\s-+src=\"\\w+/[[:digit:]]+/\\(\\w+\\)"
Well, the expression is long and ugly. An upside is that because of the quote delimiters, forward-slashes need not be escaped.
Java
"<img\\s+src=\"\\w+/\\d+/(\\w+)"
This is the same as Clojure’s original syntax. For reference, Clojure and Java share a regex engine and are equivalent in power.
Common Lisp
Edi Weitz’s professional CL-PPCRE package is essentially the standard for dealing with regular expressions in CL.
"<img\\s+src=\"\\w+/\\d+/(\\w+)"
Also, Edi’s CL-INTERPOL provides a reader macro which simplifies regex literals to the level of Perl’s:
#?r|<img\s+src="\w+/\d+/(\w+)|
Finally, the reader macro mastery of Doug Hoyte’s Let Over Lambda gives a method of making clear, functional literals:
;; This is a callable lambda:
;; #~m|<img\s+src="\w+/\d+/(\w+)|
'#~m|<img\s+src="\w+/\d+/(\w+)|
=> (LAMBDA (#:STR236)
(CL-PPCRE:SCAN "<img\\s+src=\"\\w+/\\d+/(\\w+)"
#:STR236))3. The real reason this is neat
The modification was proposed by Chris Houser (with a simple patch) on October 7, politely debated until October 10, and committed to Clojure in r1070 on October 15.* This syntax was better, and the discussion skipped if it should be applied, directly to how.
Turnaround time for a breaking change: one week. You have to respect that velocity.
There is a feeling in the development community that Clojure has a good chance of becoming an important language. Now is the brief time when any interested programmer could contribute something significant, in an environment which recognizes intelligent contribution and lacks — for the moment — politics and tradition.
Want to help? The Clojure mailing list is high signal-to-noise, and subscribing is a good way to get acclimated. Also, communicating realtime with Rich Hickey and other Clojure experts is no more difficult than joining an IRC channel: #clojure on freenode.
* There was also a similar discussion in March, but it didn’t include a patch.