Last week, Rich Hickey announced a few notable changes to Clojure, including ahead-of-time compilation and a cleaner syntax for regular expressions. Both are improvements, but the syntax is especially interesting for a reason unrelated to its function. First, a quick overview.
1. What has changed
In a sentence, fewer backslashes. The notation is now more in line with that of scripting languages, where regular expressions are first-class literals, than that of general-purpose languages like C++ or Java, where regexes are just specialized strings.
Say we are given a stream including this text:
... <img src="images/11/apple1.gif"/> <img src="images/2/bulb2.jpeg"/> <img src="images/354/citrus32_a.png"/> ...
We want to select
IMG tags and capture the basename (without extension) of each source file. This can be done in many ways; here’s a pseudocode blueprint which is just barely good enough:
<img [whitespace]+ src=" [word-char]+ / [digit]+ / ([word-char]+) ...
Converting this to Clojure’s old syntax gives us a somewhat unwieldly
#"<img\\s+src=\"\\w+/\\d+/(\\w+)". A quick test:
(let [lines "... <img src=\"images/11/apple1.gif\"/> <img src=\"images/2/bulb2.jpeg\"/> <img src=\"images/354/citrus32_a.png\"/> ..."] ;; Return only the captures, not the full matches. (map second (re-seq #"<img\\s+src=\"\\w+/\\d+/(\\w+)" lines))) => ("apple1" "bulb2" "citrus32_a")
The new update to the reader allows us to remove the double escaping of the regex specials in the literal:
(map second (re-seq #"<img\s+src=\"\w+/\d+/(\w+)" lines)))
2. Clojure vs foo
Since we’re on the topic, here’s how Clojure’s syntax compares to popular languages.
Ruby and Perl 5
# Regular usage /<img\s+src="\w+\/\d+\/(\w+)/ # Choosing a different delimiter: m|<img\s+src="\w+/\d+/(\w+)| # Perl %r|<img\s+src="\w+/\d+/(\w+)| # Ruby
The clearest of all extant languages (at least in this regard), Ruby and Perl can avoid some extra escaping by changing the delimiter character from
Well, the expression is long and ugly. An upside is that because of the quote delimiters, forward-slashes need not be escaped.
This is the same as Clojure’s original syntax. For reference, Clojure and Java share a regex engine and are equivalent in power.
Edi Weitz’s professional CL-PPCRE package is essentially the standard for dealing with regular expressions in CL.
Also, Edi’s CL-INTERPOL provides a reader macro which simplifies regex literals to the level of Perl’s:
;; This is a callable lambda: ;; #~m|<img\s+src="\w+/\d+/(\w+)| '#~m|<img\s+src="\w+/\d+/(\w+)| => (LAMBDA (#:STR236) (CL-PPCRE:SCAN "<img\\s+src=\"\\w+/\\d+/(\\w+)" #:STR236))
3. The real reason this is neat
The modification was proposed by Chris Houser (with a simple patch) on October 7, politely debated until October 10, and committed to Clojure in r1070 on October 15.* This syntax was better, and the discussion skipped if it should be applied, directly to how.
Turnaround time for a breaking change: one week. You have to respect that velocity.
There is a feeling in the development community that Clojure has a good chance of becoming an important language. Now is the brief time when any interested programmer could contribute something significant, in an environment which recognizes intelligent contribution and lacks — for the moment — politics and tradition.
Want to help? The Clojure mailing list is high signal-to-noise, and subscribing is a good way to get acclimated. Also, communicating realtime with Rich Hickey and other Clojure experts is no more difficult than joining an IRC channel: #clojure on freenode.
* There was also a similar conversation in March, but it didn’t include a patch.