crawling | TECHRENA dot Net

Google Crawling, Canonicalization and Indexing Explained

March 30, 2010 by dennis

google logo

Have you ever noticed that your blog posts are indexed in blog search in few minutes but it doesn’t come so fast in the main index? Why some of the posts are indexed faster than some of the other posts? Are sitemaps really helpful for indexing? How can you get your posts indexed faster?

All the questions really hit us much of the time. The answers are explained here.

In this post we will see:

1. The Main Difference between Indexing and Crawling

2. Main Use of Sitemaps and pings

3. Which blog posts are indexed faster

4. Google Canonicalization of the Crawled Contents

1. Difference between Indexing and Crawling

When you post an new content in your blog, Google may crawl your post but may not index it in the main index. In case of blogs Google blog search may crawl your new URLs within few minutes but it may not to the main index.

Even though Google crawls your post it doesn’t mean that the post will be indexed. Google follows certain standardization process according to its policies that the crawled content is first analyzed for quality content and then sent to the main Index. You can read further how this standardization is done below. So crawling and Indexing in Google are different.

2. Main Use of Sitemaps and pings

If you think that by submitting your sitemap to Google or by pinging to Google your posts will be definitely indexed. Google uses the sitemaps submitted to it to crawl the content, new URLs and finds out the quality content to get them indexed in Google. Google does not guarantee that by submitting your sitemap your content will be indexed. Again the Canonicalization process comes in to index the contents.

3. Which blog posts are indexed faster

This is really a big question mark. Well explaining according to Google, posts or contents from well knows blogs(popular ones) and which have good quality contents will be indexed faster. Popular blogs with good contents will be will be frequently crawled and index. So if you want your posts to get indexed faster, then get your blog well known and post good content.

Now your question will be what is good content! You can find the answer below.

4. Google Canonicalization of the Crawled Contents

Google has its own set of rules and policies that are applied to the crawled contents and decides whether to index the content or not. This process is the Google Canonicalization process which is done to produce relevant and quality results. So any blog post with good content will pass this standardization process will be indexed in a short duration of time. As far as we have seen, good content is the one which has a unique style and which has more keywords that are searched frequently in Google. But the process of Canonicalization frequently changes as the policies change, so we cannot define which are Good posts permanently as far as Google is concerned.

So you have learnt that in order to get your contents indexed faster you need to:

Popularize your blog
Produce Good and Quality content
Use relevant Keywords
Have a Unique Style of your own (Do not Copy and Paste)

You can watch Google explaining the details of blog indexing below:

Disclaimer: The content above are the sole views of the author not of www.techrena.net which is only the medium of communication. www.techrena.net does not take any responsibility of any incident resulting from the information provided here.

Google News Launches ‘”Recrawl” Feature To Crawl Publishers Sites More Frequently

January 27, 2010 by bharath

Google has just announced a new “recrawl” feature in Google News that sends Google newsbot to visit the publisher sites more often than ever.

What does this mean?

This means that if a news publisher site publishes a new article in a day and later corrects the article for any mistakes,typos, incorrect references, wrong headline or even if publisher updates the URL, Google news will make sure that it includes the most recent version of the article in its news section.

google-bot spider

Excerpt from official Google blog post.

From the moment we discover a new article, we’ll keep revisiting it looking for changes. Since we’ve noticed that most changes to articles occur just after they’re published, we revisit articles most frequently in the first day after we’ve found them. In some cases, we’ll even revisit articles we had trouble crawling the first time around. After that, we visit them less often.

The readers will no longer see broken links or outdated headlines, broken links in the news section.Publishers would be greatly benefited by this new move as Google newsbot keeps tracking the article almost live.

That’s a very welcoming change in Google news crawling, it would be more useful for bloggers if Google implements such a feature for blogs.

WebPage In Search Result Even After Disallowing in Robots.txt ? Read This

November 10, 2009November 10, 2009 by bharath

The Problem: I have disallowed one of my WebPages in my robots.txt file, and it’s still appearing in search result. What’s the deal ??

Have you blocked any of your webpage/sub-folder of your website using robots.txt, and still wonder why it is showing up in search results? If so, read on..

This is the most common complaints from many webmasters and fortunately we have an answer from Google Webmaster Guru Matt Cutts,

Let me take an example to show you how this can happen.Take robots.txt of the Google.com itself.

It’s located in google.com/robots.txt , you can find many entries in that file which are part of the Google and blocked by Google itself.Let’s pick any one of them at random, here I’ve picked up

google.com/m/trends

For some reasons Google has blocked this URL from being crawled by search engines.So if you search for google.com/m/trends, you’ll probably see something like this in the snippet of the search results :

example-google trends robots txt blocked

Sometimes you’ll find similar results for your sites too and may not be sure why is that URL is showing up in results? in general, something like this:

search snippet after robots txt block

According to Matt Cutts, this URL is not the indexed one.It’s just showing a link that Google finds it may be of some use to the search users.

So even if you have disallowed a URL in robots.txt, if someone links to it Google still may consider it of some value to the users and may show it up in results even though it hasn’t really crawled it.Sometimes it may even take up the description snippet from places like Open Directory project if that is listed there.

“If you truly don’t want that page to be in the search results, use ‘Noindex’ HTML meta tag at the top of the page.Another options is to use the URL Removal Tool from Google Webmaster tools.”

Watch this video for more details:

So the next time if you see any such URL blocked in robots.txt being shown up in search results, you need not worry, you may want to try the above said alternatives instead.

Special thanks to Matt Cutts and Google for letting us know about this.