Search Engine Indexing For Deep Web Pages

Google, MSN and Yahoo! are the most used search engines, so being found by them is really important for all web pages. However, there are a growing number of pages that can’t be indexed by search engines and stay invisible to other surfers, despite the fact that they contain a lot of relevant content.

This “invisible” part of the Web is actually a large percentage of the total number of pages on the Internet. In fact, one could consider the World Wide Web to be like a gigantic iceberg whose emerged surface is the only part that can be seen.

The challenge for webmasters is to help search engines index all the pages in a website.

Why Some Content Does Not Get Indexed By Search Engines

There can be many reasons behind this:

Websites that are too bulky to be completely indexed: Some websites have millions of pages but just a small percentage are indexed by search engines because of the depth of the site.
Pages are protected by the webmaster: Files like “robots.text” and robots “noindex” or “nocache” meta tags in the HTML code of a web page can prevent search engines accessing the content.
Pages with dynamic content: Pages that are the result of a submitted query and consequently do not have a static URL can be impossible for search engines to find, since the spiders cannot replicate the query submission carried out by human beings.
Password protected pages: Many websites limit access to some pages of the websites or require a password. Pages that can only be accessed after entering a password cannot be reached by search engine spiders.
Isolated / floating pages: Generally search engines follow links to the index page on a site and then crawl from there to other pages by following links. Search engine spiders will therefore have more difficulties seeing a page that is not linked to from any other page.
Pages with only JavaScript or Flash-based content cannot be easily indexed.

Making Invisible Content Visible

A good site search facility is perhaps the best tool a webmaster can use to help users find content on a site. There are many way to get all the website content indexed by search engines. The most important is to put links to invisible pages on pages that are already indexed.

Thinking of the database structure is important. Many databases are visible but contain links to deeper pages. Providing users access to the database content, typically in the form of an online catalogue, is a great idea. Users can then conduct “manual” searches to find specific information or products.

Archives often contain many pages that are not usually indexed by search engines.

Specialized search engines, directories or portal such as Google Scholar, Complete Planet, Google Book Search, Pipl, Infomine etc can be used to get content indexed on many search databases that the traditional search engines would ignore.

Other practical tips that can increase the number of pages indexed are:

Creating a comprehensive HTML sitemap with links to main pages or sections, which will in turn link to other pages.
Coding the entire site in HTML and converting formats like pdf, word, excel. Where that is not possible (e.g. in videos, music etc), transcribing the information as supplementary text, so that search engines will be able to know what the content is and index / rank it at the place it deserves.
Obtaining deep links from other relevant websites to the invisible pages.
Bookmarking the deep / invisible pages on social bookmarking sites like del.icio.us, Google Bookmarks and Yahoo! MyWeb.