Tag-Archive for » regexp «

Wednesday, August 06th, 2008

So I was writing this tool to create a bunch of SQL statements from a data dump. Simple enough, right. And as always when you generate SQL statements, you have to make sure that the data doesn’t interfere with the SQL syntax by escaping the single quotes (and generally any binary data, but I didn’t have that). Any database gem/module/library has that built-in, of course, but I didn’t want to use that. So I said [Note: this doesn't work. Read on for the solution.]

def quote (str)
  str.gsub('\\','\\\\').gsub('\'','\\\'')
end

I read this as “replace all backslashes with double backslashes, and then replace all single quotes with a backslash and a single quote”. I added a simple test for it (yay TestUnit!):

def setup
  @m = Migrate.new
end

def test_quote
  assert_equal("I\\'m home", @m.quote("I'm home"))
end

But imagine my surprise when I got

  1) Failure:
test_quote(TestMigrate) [migration/test_migrate.rb:29]:
< "I\\'m home"> expected but was
< "Im homem home">.

Ooookay. What’s wrong here? Have I misunderstood the rules for escaping the escape sequence? It’s supposed to be easier with single quotes, but maybe I got it wrong. So I tried with double quotes:

def quote (str)
  str.gsub("\\","\\\\").gsub("'","\\'")
end

Surely this would work? Nope, it gives the exact same error. Time to look up gsub in the manual:

str.gsub(pattern, replacement) => new_str
str.gsub(pattern) {|match| block } => new_str

[…] If a string is used as the replacement, special variables from the match (such as $& and $1) cannot be substituted into it, as substitution into the string occurs before the pattern match starts. However, the sequences \1, \2, and so on [my emphasis] may be used to interpolate successive groups in the match.

“And so on”? Oh, so obviously \' (escaped \\' in the string literal) is the replacement string equivalent of $', which means everything afther the match (as all regexp hackers know). So I need to escape the backslash for regexp engine too:

def quote (str)
  str.gsub("\\","\\\\").gsub("'","\\\\'")
end

OK, the tests pass. But the code looks wrong. Four backslashes can’t work for both cases, can they? Let’s add a test case:

def test_quote
  assert_equal("I\\'m home", @m.quote("I'm home"))
  assert_equal("S\\\\N", @m.quote("S\\N"))
end

Nope, that fails. So we need this:

def quote (str)
  str.gsub("\\","\\\\\\\\").gsub("'","\\\\'")
end

Eight backslashes. Yes, the test passes, but is it worth it? Is it understandable? I don’t want comments to explain my code. Comments are good to provide a raison d’être for something, but not to explain its looks. Let’s switch to the other form of gsub:

def quote (str)
  str.gsub(/\\|'/) { |c| "\\#{c}" }
end

“If you see a backslash or a single quote, replace it with a backslash and whatever you saw.” That’s what I wanted to say anyway.

Good. But I wrote this in Markdown, so now I have to generate the HTML and the go through it and make sure that I restore whatever backslashes Markdown ate. (It turns out it didn’t eat any. TextMate has a Markdown Preview function that ate a lot of backslashes, but when I said “Convert to HTML” it didn’t eat any at all. Go figure.)

Category: Ruby  | Tags: , , ,  | 9 Comments