What are the crawling, crawling, indexing, and indexing?

No, this index tag means that the spider can crawl the page, so noindex is not allowed to crawl the page? ! So why do the last few comments in the article have “noindex tags that don’t save the crawl share. Search engines need to know that there are noindex tags on the page, so you have to grab the page first, so you don’t save the crawl share.”

The message stated that this reader did not understand too much what is crawling, what is the index, what is the meaning of the index and no index tags. The no index tag is not allowed to fetch the page. It is not allowed to index the page. The two are different meanings and have different functions.

When you look at SEO about blogs and forums, you can feel that many SEOs don’t understand what the concepts of crawling, crawling, indexing, and inclusion mean. What is the difference? What is the function of no index, no follow, and robots files? There is no precise understanding of these concepts, dealing with large website structures, deciding what pages need to be crawled, what needs to be indexed, and which pages need to be blocked, indexed, etc., it is difficult to understand what to do. Even a lot of comments like the post on the quota, when I mentioned the handling of these situations, I simply couldn’t understand what I was talking about.

Such a basic, important, and more confusing SEO concept, I thought I wrote it on the blog before, read the message, and turned over the previous post to know that it was not written before. SEO is written in the actual password book, but SEO has not written it in a post every day. Make up today.

What is crawling?

Crawling refers to the process by which a search engine spider parses the URL pointed to by the link from a known page and then discovers the new page along the link (that is, the URL that the link points to). Of course, the spider does not find that the new URL immediately crawls over the new page, but stores the found URL in the address library to be captured. The spider extracts the URL to be crawled from the address library in a certain order.

What is crawling?

The crawling is that the search engine spider extracts the URL to be captured from the address database to be captured, accesses the URL, and stores the read HTML code into the database. Spider crawling opens the page like a browser, and like the user’s browser access, it also leaves a record in the server’s original log.

What is the index?

The index refers to sorting the information of a URL and storing it in the database, that is, the index library. When the user searches, the search engine extracts the URL information from the index library and displays it in a sorted manner. The English index is index. The index library is used for searching, so the indexed URL can be searched by the user, and the URL user without indexing is invisible in the search results.

It should be noted that the so-called “a URL information” is not limited to the content that the spider crawls from the URL, as well as information from other sources, such as external links, anchor texts of links, and the like. Sometimes, the information about the URL in the index library has no content fetched from this URL, but the search engine knows the existence of this URL and has some other information.

Crawling and indexing are not the same thing.

What is included?

I personally feel that there is no difference between inclusion and indexing. However, the inclusion is from the perspective of the search user. The URL can be found during the search, and the URL is included. From the perspective of the search engine, the URL is included, that is, the information of this URL exists in the index library. The word is not included in English, and the index is the same word index.

What is the role of no index?

Putting a meta no index tag in the page header tells the search engine not to index the URL, that is, the user cannot find the URL when searching, and the URL will not return to the search result list.

No index does not tell the search engine not to fetch this URL. In fact, no index has to work. This URL must be fetched first. Otherwise, how can the search engine see the no index tag in the HTML code of the page?

What is the role of the robots file?

The robots file tells the search engine that certain URLs don’t be crawled. Note that what is said here is not to grab, not to say no index. And no index is the opposite.

What is the role of no follow?

Adding the no follow attribute to the link tells the search engine not to crawl along this link, just when the link does not exist. Note that no follows simply tells the spider not to crawl the link, does not say that it does not grab the URL pointed to by the link, nor does it say that the URL pointed to by the index link is not allowed, no follow neither prohibits the crawl nor prohibits the index.

After the concept is said, it is pointed out that several SEOs often do not understand:

Pages that have not been crawled can be indexed

In other words, the spider does not access and crawl the page (such as being blocked by the robots file). This page has information stored in the index library, which can be seen when the user searches.

For example, Taobao’s entire website uses the robots file to prohibit Baidu spiders from crawling, but does not use no index to prohibit indexing. (As mentioned above, after the crawl is forbidden, there is no way to prohibit the index. If you do not crawl, you will not see the no index label.) So even if Baidu does not access and crawl Taobao pages, many of Taobao’s pages are indexed by Baidu, users can find:


Baidu knows the existence of Taobao’s homepage from so many links on the Internet. The anchor text of the link also knows that the title of this page is probably like Taobao. Of course, I know the number of comments in Baidu’s word of mouth. So even if the Baidu spider did not grab the Taobao homepage, the user could still find it and display some information that Baidu knew.

What if Baidu can’t return to Taobao’s homepage? Cancel the prohibition of the robots file, and use no index to disable the index on the page.

The page being fetched can be indexed

The most common is the above, the page header information uses no index to prohibit indexing, the page is crawled, after reading no index, it is not indexed and will not be returned to the search results. Old pages do not add no index immediately delete the index, and will retain the index for a while, but will not return to the search results.

The link on the page with no index can be tracked for a while, but for a long time, the page search engine with no index may no longer crawl and index, and the above link will be invalid.

It may also be because the content of the page is plagiarized, reprinted, and of low quality. Although the search engine crawls the page, the content is detected during the indexing process, and is discarded and not indexed. So the page is not included, usually check the original log to see if it has been crawled. If it has been crawled, it may be a content quality problem. If it is not crawled at all, it is recommended to first check if there is a problem with the structure of the website.

Link target pages with no follow can be crawled and indexed

As mentioned earlier, no follow neither prohibits crawling nor prohibits indexing. The role of no follow is to tell the spider not to follow this link, just when the link does not exist, but no follow only works for this link, it does not work for other links, this link adds no follow, does not mean that other places are not normal The link to this URL, as long as there is no follow link in other places, the target URL will still be found, crawled (assuming it is not prohibited by the robot file), index (assuming no no index is added).

These concepts and applications are very important in SEO. If you haven’t understood it, I don’t know how to explain it. I can only recommend reading it a few more times.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button