Every once in a while I have to do some
.htaccess
rewriting and every time I end up deeply fascinated at the possibilities
that it offers. This time round the situation was as follows: for a
client we had done some advanced search functionality, which uses fairly
detailed URLs to store the search (the search is parsed into an abstract
syntax tree, then serialized to a URL). The problem now was that we had
rewritten the serialized syntax ... and somehow a famous search engine
had picked up on the URLs and was getting bad results from them. What to
do?
The solution was URL rewriting using .htaccess. However, it wasn't
straight forward URL rewriting - the serialized search was encoded in
the query string, and that was the only part that needed a rewrite. How
does the Apache Rewrite module handle this? Very well, it turns out.
Solution - part 1
The first thing to realise is that
RewriteRules
by themselves won't do any good - they only work on the base part of the
URL, ignoring the query string. This means you have to turn to the
second part of the rewrite: the
RewriteCond.
Now, this presented the first part of my eye-opening experience:
RewriteCond allows for using regexes. This means you can do the
following:
RewriteCond %{QUERY_STRING} ^id=([0-9]+)
And you'll match on query strings that start with id = some number of
digits. You can use pretty much any extended regular expression you
desire ... which makes it very powerful!
Solution - part 2
As you can probably guess from the above code bit, you can also use
capturing groups in the RewriteCond regexes. Not only that, though: you
can reference these captured groups in a RewriteRule. It's done slightly
different from capture reference in RewriteRule regexes (these are done
using \$) in that a reference to a captured group from a RewriteCond
uses a % as prefix. Hence, you can do:
RewriteCond %{QUERY_STRING} ^id=([0-9]+)
RewriteRule ^product.php /product/%1? [R=301,L]
And you'll be redirecting product.php?id=123 to product/123 using a 301.
Notice the ? at the end of the rewritten URL - it's there to make sure
ModRewrite doesn't append the original query string.
At this point, my woes were almost over - there was just one obstacle
left.
Solution - part 3
When the Rewrite module does redirects, it normally also escapes
characters in the URL, that could otherwise turn out problematic. One
such character is %. However, this escaping is itself very problematic
upon redirects, because any url_encode()d URL will contain lots of %
characters followed by a character code. When ModRewrite is done with
the URL, it'll have substituted all %2F with %252F, for instance ... not
what you want.
There's a very simple solution to this, though: you can set a flag to
stop the Rewrite module from doing any escaping. What you do is:
RewriteCond %{QUERY_STRING} ^id=([0-9]+)
RewriteRule ^product.php /product/%1? [R=301,L,NE]
This stops ModRewrite from escaping the URL, leaving you with whatever
was in the original.
Using the above bits and pieces you can rewrite a URL like
/search?blah=hum%20dinger%20and%what%20not%3Aliteral
to
/search?q=hum%20dinger%20and%what%20not%3Aliteral&rewritten=true
It's voodoo for sure, but damned cool.