Skip to content


The Trouble With Regular Expressions

I have this problem with regular expressions. They’re too handy for their own good.  You say you don’t know what a regular expression is?  Well, let me just tell you, you don’t know what you’re missing out on.  Think of it like a really super-duper complicated way of searching for something real specific when it’s slapped in the middle of a whole back of other crap you don’t want to even mess with.  But it’s not even just searching for something, you can search and replace really complicated text that’s just too big to do by hand.

Regular Expressions Book

Regular Expressions Book

More than that, it’s not like you just want to find every reference to the name “Meg” on your computer and change it to “Poop” … you can use regular expressions to transform Huge Text File (A) into just the  little sub-sections of important stuff … or you can convert (A) from one textual format to another, like from HTML to Whiskey, or whatever the hell new-fangled nonsense is going on out there in web junky land.

There are plenty of programs out there that make use of regular expressions and you don’t even know about it.  I first started playing around with them way back in 1995 or so, not long after Rich Siegel started selling BBEdit for Macintosh, and I’ve been using them since.  Here’s an example … say you have a text file that looks like this:
Jimmy Jack Johnson sells 43 seashells to Yo Momma. She paid about three-fiddy.
You can do a regular expression find/replace on that text, which would look something like this:
's/^(J.*)\ J.*\ (J.*)\ (s)...s\ ([0-9]+).*\r/\1\ sucks\ \4\ \2\3\.\r/g'
Which would now make that first text look something like this:
Jimmy sucks 43 Johnsons.

This sort of thing would be great for pranking your frienemies, but I have yet to hear of such a thing catching on.  Which brings us to my current dilemma.  I’ve come to rely on regular expressions so much that I believe they can do anything.  Unfortunately, I think I’m demanding too much, or at least my computer is unwilling to give me everything it’s got in order to accomplish this task.  I have a  text file, freshly spit out from Excel, that has 43 columns of textual data over 160 some-odd rows … not that big of a file, it’s only got about 43,000 characters.  It’s tab-delimited, meaning that in the text file, there’s a unprinted tab character (\t) between each section of the data, to separate out each of the cells.  What I wanted to do was churn this tabbed text file into an XML .plist, or Property List file.  I’m sure there’s a more elegant way of doing this (without having to do it by hand, obviously), but I chose to create a regular expression … here’s what the “Search For” string looks like:

^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\r

And here’s what the “Replace With” string looked like:

legislatorID\r\1\rlegtype_name\r\2\rlegtype\r\3\rlastname\r\4\rfirstname\r\5\rmiddlename\r\6\rnickname\r\7\rsuffix\r\8\rparty_name\r\9\rparty_id\r\10\rdistrict\r\11\rtenure\r\12\rpartisan_index\r\13\rphoto_name\r\14\rbio_url\r\15\rnotes\r\16\rgallery_desk\r\17\rcap_office\r\18\rstaff\r\19\rcap_phone\r\20\rcap_fax\r\21\rcap_phone2_name\r\22\rcap_phone2\r\23\rdist1_street\r\24\rdist1_city\r\25\rdist1_zip\r\26\rdist1_phone\r\27\rdist1_fax\r\28\rdist2_street\r\29\rdist2_city\r\30\rdist2_zip\r\31\rdist2_phone\r\32\rdist2_fax\r\33\rdist3_street\r\34\rdist3_city\r\35\rdist3_zip\r\36\rdist3_phone1\r\37\rdist3_fax\r\38\rdist4_street\r\39\rdist4_city\r\40\rdist4_zip\r\41\rdist4_phone1\r\42\rdist4_fax\r\43\r\r

Needless to say, it did not go over well.  BBEdit died after ten minutes of churning.  Perl sucked up 2 Gigs of RAM before I had to kill it.  Sure I could process the file little bits at a time, but wouldn’t that be taking the fun out of regular expressions?

EDIT / Update: 

Yeah, about all that. With a tiny bit of touching up, a demo version of TextMate processed the whole file in about 6 seconds. Sounds like someone’s going to actually have to pay for a software registration when the demo expires. Nothing like knocking my socks off and shaming a few industry leaders to earn your keep. Nice job there buddy.

Posted in Humor, Tech.


One Response

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. shelley says

    I am sort of surprised there was no luck with perl. Were you sucking in the whole file (@file = `list filename`) or reading it in line by line for processing? (open file;while; run regular expression on $_;)

You must be logged in to post a comment.