.htaccess voodoo

Every once in a while I have to do some .htaccess rewriting and every time I end up deeply fascinated at the possibilities that it offers. This time round the situation was as follows: for a client we had done some advanced search functionality, which uses fairly detailed URLs to store the search (the search is parsed into an abstract syntax tree, then serialized to a URL). The problem now was that we had rewritten the serialized syntax ... and somehow a famous search engine had picked up on the URLs and was getting bad results from them. What to do?

The solution was URL rewriting using .htaccess. However, it wasn't straight forward URL rewriting - the serialized search was encoded in the query string, and that was the only part that needed a rewrite. How does the Apache Rewrite module handle this? Very well, it turns out.

Solution - part 1

The first thing to realise is that RewriteRules by themselves won't do any good - they only work on the base part of the URL, ignoring the query string. This means you have to turn to the second part of the rewrite: the RewriteCond. Now, this presented the first part of my eye-opening experience: RewriteCond allows for using regexes. This means you can do the following:

RewriteCond %{QUERY_STRING} ^id=([0-9]+)

And you'll match on query strings that start with id = some number of digits. You can use pretty much any extended regular expression you desire ... which makes it very powerful!

Solution - part 2

As you can probably guess from the above code bit, you can also use capturing groups in the RewriteCond regexes. Not only that, though: you can reference these captured groups in a RewriteRule. It's done slightly different from capture reference in RewriteRule regexes (these are done using \$) in that a reference to a captured group from a RewriteCond uses a % as prefix. Hence, you can do:

RewriteCond %{QUERY_STRING} ^id=([0-9]+)
RewriteRule ^product.php /product/%1? [R=301,L]

And you'll be redirecting product.php?id=123 to product/123 using a 301. Notice the ? at the end of the rewritten URL - it's there to make sure ModRewrite doesn't append the original query string.

At this point, my woes were almost over - there was just one obstacle left.

Solution - part 3

When the Rewrite module does redirects, it normally also escapes characters in the URL, that could otherwise turn out problematic. One such character is %. However, this escaping is itself very problematic upon redirects, because any url_encode()d URL will contain lots of % characters followed by a character code. When ModRewrite is done with the URL, it'll have substituted all %2F with %252F, for instance ... not what you want.

There's a very simple solution to this, though: you can set a flag to stop the Rewrite module from doing any escaping. What you do is:

RewriteCond %{QUERY_STRING} ^id=([0-9]+)
RewriteRule ^product.php /product/%1? [R=301,L,NE]

This stops ModRewrite from escaping the URL, leaving you with whatever was in the original.

Using the above bits and pieces you can rewrite a URL like

/search?blah=hum%20dinger%20and%what%20not%3Aliteral

to

/search?q=hum%20dinger%20and%what%20not%3Aliteral&rewritten=true

It's voodoo for sure, but damned cool.

social