Mostly you’ll want search engines to crawl and index everything on your site, or rather everything that a visitor should see. There are though plenty of reasons for preventing some content/pages being crawled and indexed. In this post we’ll go through some techniques to prevent pages from being seen by search engines.
Technically if something is crawlable, regardless of whatever “protection” you put in the way, a search engine will find it. Whether they react to it by not adding it to an index is another thing. These techniques tend to work, but don’t be surprised if something slips through- if it does there are options for that too.
The types of page/content that people want to remain un-indexed are usually:
- PPC Landing pages- e.g. keeping these from the organic SERPs stops PPC data getting skewed
- Downloadable versions of site content- e.g. PDF data sheets or case studies
- Print friendly pages
…you will often see those options in a page’s footer:

There are other reasons as well. Here’s one that sometimes crops up when dealing with coding for accessibility and the DDA.
Sometimes accessibility considerations aren’t handled well. For example, if a site offers various font size versions of pages BUT they are coded so that each font size has its own version of the page it may produce URLs like this:
http://www.website.com/responsibility/style=styles-regular
http://www.website.com/responsibility/style=styles-medium
http://www.website.com/responsibility/style=styles-large
…so there could be 3 versions of every site page, 4 if the base page and the “regular” co-exist.
Fixing it
Preventing these pages being crawled can be tackled by using the SEO’s best mate, the robots.txt. There are loads of good resources out there on how to define and write functional robots.txts, so I won’t delve too deep. What is important is that you can identify the pages that you don’t want indexed, and then justify to yourself that their exclusion would be beneficial.
There is also the option to use meta tags on individual pages, again this is well documented elsewhere.
Now, what if you are starting from fresh? You’ve found pages you don’t want found and you’ve set up your preventative measures, how can you get pages un-indexed, delisted, removed from the SERPs?
Google Webmaster Tools offers help in the form of its URL removal request option. From the Site configuration options got to Crawler access, and then to the Remove URL tab:

Before you start making removal requests it would be useful to understand that it is not a guaranteed process, and that some sites and people have problems- at least according to the forum.