.htaccess, 301 Redirects and the Subtleties of Syntax
By now who doesn’t know that duplicate content is a bad thing for SEO? And although the universe of people who have heard of the famous “www resolve” issue is a smaller group, it’s still undoubtedly the vast majority of professional SEO consultants.
Why a Single Web Page Can Look Like Duplicate Content to Google
If you know what I’m talking about you can skip down to the next subheading. If you need a refresher, read on.
This realization is crucial: Google does not deal in web pages, they deal in URL’s. So if you have two URL’s that are different, but they point to the same page, Google will see those pages as different “pages” with the same content. In other words, Google may very well see that as duplicate content.
Enter the wild and wooly www. Although the “www” prefix added to website domain names is a relic of those good old days when people actually called it the “World Wide Web,” its legacy continues. Put 100 people in a room with a computer and a browser, tell them to go to the CNN website and probably half of them will enter www.cnn.com and the other half, the impatient half, the half who, like me, believes their lives will be shortened by the use of excess keystrokes, will simply enter cnn.com. (After all, counting the “dot,” eliminating the “www” saves me four precious keystrokes – hurray for me!).
So, in order to please the whole universe, web servers will typically accept both versions of the url and you’ll end up looking at all the news that’s fit to, well, print (at least in cnn’s biased opinion). Addtionally, web servers will often allow people to input the simple domain name and actually see a file called something like “index.html.” So, using the cnn example potentially, you could enter the following four addresses into the address bar of your browser…
- cnn.com
- www.cnn.com
- cnn.com/index.html
- www.cnn.com/index.html
…and each time you would arrive at the home page of CNN. Yet, Google thinks of 4 URL’s as 4 pages, right? Uh oh, duplicate content. So what do we do about it?
Using Redirects to Make Sure Google Doesn’t Go Stupid on Us
If you’ve spent more than a few hours investigating the SEO implications of this, you know that the enlightened way to deal with the above scenario is the search-engine-friendly 301 redirect. When we want to direct web traffic (and search engine spiders) from an obsolete, removed, or simply mistaken url to a valid URL, as web developers we have the choice of either a 302 (temporary) redirect, or a 301 (permanent) redirect. [use this as a call-out] A redirect basically says, “hey, careless, you typed cnn.com into your browser, and that’s not valid, but I’m going to do you a favor and send you over to www.cnn.com instead.”
At this point you could ask all sorts of questions about the relative advantages and disadvantages of 301 vs. 302 redirects, but why waste the time? Google blesses the 301 and damns the 302, so we know which type of redirect we want to use on our sites – All hail Google! And if you do already know this, you know that the best way of handling this, at least on Apache web servers, is through the .htaccess file, a small text file that has big implications for how your website behaves. (For more on what an .htaccess file is, and even how to configure it, check out the official page here.)
If you put a few lines of code into an .htaccess file, and place that file in the public root of your website (that’s the primary, top-level folder that is accessible by website visitors), then your problem with this situation is solved.
But wait. What is that little bit of code anyway? Since I’m not a web server guru, I’m at the mercy of what I find on line for this trick, and as I recently found out, what you find on line is not always exactly what you need. Let’s take a look at an example site and see how different flavors of .htaccess code that you find online can affect it.
First, the example. Here I checked the site I was working on, namely bicycle jersey reseller ecyclingstore.com, using the handy little redirect tool found at RagePank for just such a purpose. Here’s what I got:
As you can see from the screen capture, even though I had placed what I thought was the necessary code in an .htaccess on their site, I was surprised at the results, since it’s showing 4 different 200 results (a 200 code returned by a website says, “hey, you bet we have that page, and here it is!”) on 4 different urls, only one of which is a real page.
Here’s the code that was in my .htaccess file. And this is code I find all over on the Internet:
Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_HOST} ^ecyclingstore.com$ [NC]
RewriteRule ^(.*)$ http://www.ecyclingstore.com/$1 [R=301,L]
I looked at the rewrite rule found on that redirect checker, they suggested something almost, but not quite the same:
RewriteEngine on
RewriteCond %{HTTP_HOST} !^www.ecyclingstore.com
RewriteRule (.*) http://www.ecyclingstore.com/$1 [R=301,L]
When I modified the .htaccess rule and did a rerun on the redirect checker, I got this:
Much better, but I still have two 200 responses and I want to get that down to one and only one 200 response. PageRank recommended another bit of code, as follows:
Options +FollowSymLinks
RewriteCond %{THE_REQUEST} ^.*/index.php
RewriteRule ^(.*)index.php$ http://www.ecyclingstore.com/$1 [R=301,L]
Once I installed that code to .htaccess everything was hunky dory.
The Conclusion You Should Not Draw From This
I don’t want you to think that the only relevant way to do an .htaccess redirect is what I’ve shown here. For one thing, this code gets pretty hairy and I’m no expert at it.
The conclusion you should draw is that you need to test your .htaccess code once you get everything set up. I’ve left the wrong .htaccess code in place for years because I fell prey to the second classic blunder of all time (the first of which is, of course, never get involved in a land war in asia), namely never trust anything you pull off the Internet without verifying it.
PostScript: Don’t Even Trust the Redirect Checkers
As I was going to press with this blog post I took another look at PageRank’s redirect checker (mentioned above). Something in the results it was giving me got me suspicious. After a bit of investigation using a more classic, much more mundane-looking tool, namely Web Sniffer, I found that RagePank can return a 301 code on url’s that get a 404 on Web Sniffer. So I guess you even need to check the checkers.
Hi Ross, your Meatball SEO class is still having great payoffs months down the road. I just did my 301 redirects for this client and noticed that using Web Sniffer my changes showed up perfect, 200 and three 301s. Yet even 10 minutes later, RagePank was still showing all four as 200, even getting a fresh copy of their redirect tool didn’t cause it to get updated results.
Also, it may be worth noting that GoDaddy (I know, I know… but I didn’t have any say in the matter) has a tool in Hosting > Settings > URL Redirects that edits the .htaccess file via a very simple UI.) I checked Hostica and it looks as if they do not have a similar tool. Don’t know about other hosting companies.
— Franklin
BTW, here’s how GoDaddy’s tool rewrote my .htcaccess file:
rewriteengine on
rewritecond %{HTTP_HOST} ^rimtours.com$
rewriterule ^index\.php$ “http\:\/\/rimtours\.com\/” [R=301,L] #521f9b0fc400d
rewritecond %{HTTP_HOST} ^www.rimtours.com$
rewriterule ^index\.php$ “http\:\/\/rimtours\.com\/” [R=301,L] #521f9aeace894
rewritecond %{HTTP_HOST} ^www.rimtours.com$
rewriterule ^$ “http\:\/\/rimtours\.com\/” [R=301,L] #521f9ab58032b