Managing your Site's Accessibility for Search Engines

Your site is useless if it can’t be indexed by search engines. If you want it to show up in the search results, you need to make sure that it can be accessed by search engines. However, sometimes you'll want to restrict access to certain parts of your site, perhaps you want to hide irrelevant pages or private documents. In this article you'll learn how to manage your site's accessibility for search engines via a robots.txt file or the robots meta tag.

Benefits of Robots Files and Tags

Before we dig into the details of how to create a robots.txt file or robots meta tag, we should take a look at their benefits. There are some scenarios where their implementation might come in handy, such as:

Preventing duplicate content from being indexed (e.g. printable versions of pages).
For incomplete pages.
Restricting search engines from indexing confidential pages or files.

Duplicate content dilutes your SEO efforts as search engines find it hard to decide which version is the most relevant for the users' search query. This problem can be prevented by blocking duplicate pages via a robots file or tag. There's another way to manage duplicate content, but we'll discuss that later.

If you have new but incomplete pages online, it's best to block them from crawlers to prevent them from being indexed. This might be useful for new product pages, for example - if you want to keep them a secret until launch, add a robots file or tag.

Some websites have confidential pages or files that aren't blocked by a login form. An easy way to hide these from search engines is via the robots.txt file or meta tag.

Now that we know why we should manage the accessibility of certain pages, it's time to learn how we can do this.

The robots.txt File

Crawlers are workaholics. They want to index as much as possible, unless you tell them otherwise.

When a crawler visits your website, it will search for the robots.txt file. This file gives it instructions on which pages should be indexed and which should be ignored. By creating a robots.txt file you can prevent crawlers from accessing certain parts of your website.

The robots.txt file must be placed in the top-level directory of your site - for example: www.domain.com/robots.txt. This filename is also case sensitive.

Warning: if you add a robots.txt file to your website, please double-check for errors. You don’t want to inadvertently block crawlers from indexing important pages.

Creating a robots.txt File

robots.txt is a simple text file with several records. Each record has two elements: user-agent and disallow.

The user-agent element tells which crawlers should use the disallow information. Disallow tells crawlers which part of the website can’t be indexed.

A record will look something like this:

1	User-agent: *
2	Disallow:

The record above gives search engines access to all pages. We use the asterisk (*) to target all crawlers and because we haven’t specified a disallow page, they can index all pages.

However, by adding a forward slash to the disallow field, we can prevent all crawlers from indexing anything from our website:

1	User-agent: *
2	Disallow: /

We can also choose to target a single crawler. Take a look at the example below:

1	User-agent: Googlebot
2	Disallow: /private-directory/

This record tells Google not to index the private directory; Googlebot is used by Google for web searches. For a complete list of all crawlers, visit the web robots database.

Coupling one disallow to one user-agent would be a time-consuming job. Fortunately we can add multiple disallows in the same record.

1	User-agent: Bingbot
2	Disallow: /sample-directory/
3	Disallow: /an-uninteresting-page.html
4	Disallow: /pictures/logo.jpg

This will prevent Bing from indexing the sample directory, the uninteresting page and the logo.

Wildcards

As we're leaning on regular expressions here, we can also make use of wildcards in a robots.txt file.

For example, a lot of people use Wordpress as a CMS. Visitors can use the built-in search function to find posts about a certain topic and the url for a search query has the following structure: http://domain.com/?s=searchquery.

If I want to block search results from being indexed, I can use a wildcard. The robots.txt record will look like this:

1	User-agent: *
2	Disallow: /?s=

You can also use wildcards to prevent filetypes from being indexed. The following code will block all .png images:

1	User-agent: *
2	Disallow: /*.png$

Don’t forget to add the dollar sign at the end. It tells search engines that it’s the end of a URL string.

Testing Your robots.txt File

It’s always a good idea to test your robots.txt file to see if you’ve made any mistakes. You can use Google Webmaster Tools for this.

Under ‘health’ you’ll find the ‘blocked urls’ page. Here you’ll find all the information about your file. You can also test changes before uploading them.

Robots Meta Tag

The robots meta tag is used to manage the accessibility of crawlers to a single page. It tells search engines if the page can be crawled, archived or if the links on the page may be followed.

This is what the robots meta tag looks like:



<head>

	<meta name=”robots” content=”noindex” />

</head>

This meta tag prevents crawlers from indexing the web page. Besides “noindex” there are several other attributes that might be useful:

index: this page can be indexed.
noindex: this page can’t be displayed in the search results.
follow: the links on this page can be followed.
nofollow: the links on this page can’t be followed.
archive: a cache-copy of this page is permitted.
noarchive: a cache-copy of this page isn’t permitted.

Multiple attributes can be used in a single robots meta tag, for example:



<head>

	<meta name=”robots” content=”noindex, nofollow” />

</head>

This markup prevents crawlers from indexing the page and following its links.

If you happen to be using conflicting tags, Google will use the most limiting option. Let’s say you use ‘“index” and “noindex” in the same tag, the page will not be indexed (most restrictive option, just to be safe).

Do I use robots.txt or Meta Tags?

As we've discussed, there are two ways to manage the accessibility of web pages: a robots.txt file and meta tags.

The robots.txt file is great for blocking complete directories or certain file types. With a single line of text you can do a lot of work (and potentially a lot of damage!) But if you want to block an individual page, it’s best to use the robots meta tag.

Sometimes URLs that are blocked via the robots.txt file can still appear in the search results. When there are a lot of links pointing to the page and Google believes the only relevant search result for the search query, it will still show up. If you absolutely don’t want the page to be displayed you should add the noindex meta tag. This may sound complicated but Matt Cutts explains everything into detail in Uncrawled URLs in search results on YouTube.

Conclusion

With the robots.txt file and robots meta tags you can easily manage your site’s accessibility for search engines.

Don’t forget to check and double-check your meta tags and robots.txt file to prevent inadvertently blocking crawlers from indexing important pages.