Today’s post summarizes the main content of many blogs, forum posts, and some of my own cases and understandings in Gary Illyes’s post and follow-up.
Emphasize that the following concepts apply equally to Baidu.
What is a search engine spider crawling share?
As the name implies, crawling shares is the total time limit for search engine spiders to fetch pages on a website. For a particular website, the total time that search engine spiders spend on this site is relatively fixed, and there is no limit to crawling all pages of the site.
English crawling the share of Google is using the crawl budget, literal translation is the crawl budget, I think it is not very clear what it means, so use the crawl share to express this concept.
What determines the share of grabbing? This involves grabbing demand and crawling speed limits.
Crawling demand refers to how many pages a search engine “thinks” to crawl a particular website.
There are two main factors that determine the need to capture demand. One is the page weight, how many pages on the website have reached the basic page weight, and how many pages the search engine wants to crawl. The second is whether the pages in the index library have not been updated for too long. After all, the page weights, the pages with high weights will not be updated for too long.
Page weights and website weights are closely related. Increasing the weight of the website will make search engines more willing to crawl the page.
Grab speed limit
Search engine spiders don’t drag and drop people’s website servers in order to crawl more pages, so a crawling rate limit is set for a website. The crawl rate limit is the upper limit that the server can withstand. Inside, spider crawling does not slow down the server and affect user access.
The server responds fast enough, this speed limit is raised a bit, the crawling speeds up, the server response speed drops, the speed limit decreases, the crawl slows down, and even stops crawling.
Therefore, the crawl speed limit is the number of pages that the search engine can “capture”.
What determines the share of the grab?
Grab the share is the result of considering both the crawling requirement and the crawling speed limit, that is, the number of pages that the search engine “thinks” and “can” grab.
The website has a high weight, the page content quality is high, the page is enough, the server speed is fast enough, and the grab share is large.
Small sites don’t have to worry about grabbing share
The number of small website pages is small, even if the website weight is lower, the server is slower, and the search engine spiders crawl less every day. Usually, at least a few hundred pages can be captured. How can I crawl the whole station in ten days, so A thousand pages of websites don’t have to worry about grabbing shares. Websites with tens of thousands of pages are generally not a big deal. If hundreds of visits per day can slow down the server, SEO is not the main thing to consider.
Large and medium-sized websites may need to consider crawling shares
Large and medium-sized websites with hundreds of thousands of pages or more may have to consider the problem of insufficient share.
The crawling share is not enough. For example, if the website has 10 million pages and the search engine can only capture tens of thousands of pages per day, then it may take several months or even a year to catch the website. It may mean that some important pages cannot be caught. Take, so there is no ranking, or important pages can not be updated in time.
In order for the website page to be promptly and fully captured, the first step is to ensure that the server is fast enough and the page is small enough. If the site has a lot of high-quality data, the crawl share will be limited by the crawl speed, and the page speed will directly increase the crawl speed limit, thus increasing the crawl share.
Both Baidu Webmaster Platform and Google Search Console have crawling data. As shown in the following figure, Baidu crawl frequency of a website:
The above picture shows that SEO posted a small website of this level every day. The page crawling frequency and the crawling time (depending on the server speed and page size) have little to do with each other, indicating that you have not used up the share, don’t worry.
Sometimes, the frequency of crawling and the time of crawling have a certain correspondence, as shown in the following larger website:
It can be seen that the improvement of the crawl time (reducing the page size, increasing the server speed, optimizing the database) obviously leads to an increase in the frequency of crawling, making more pages crawled and included, and traversing the website more quickly.
An example of a larger site in Google Search Console:
The top is to grab the number of pages, the middle is to grab the amount of data, unless the server goes wrong, the two should correspond. At the bottom is the page crawl time. As you can see, the page download speed is fast enough, and it is no problem to grab millions of pages every day.
Of course, as I said before, one million pages can be captured. On the one hand, search engines don’t want to catch one another.
Another reason why large websites often need to consider grabbing shares is not to waste a limited amount of crawling on meaningless page crawling, resulting in important pages that should be crawled without the chance to be crawled.
Typical pages that waste a crawl share are:
- Many filter screening pages. This point is discussed in detail in a few years ago about the crawling index of invalid URLs.
- Copy content in the station
- Low quality, spam
- Unlimited pages like calendars
The above pages were heavily fetched, and it was possible to use up the crawl share, but the crawled page did not catch.
How to save the share of grabbing?
Of course, the first is to reduce the page file size, improve server speed, optimize the database, and reduce the crawl time.
Then, try to avoid the waste of grab shares listed above. Some are content quality issues, some are website structure problems. If it is a structural problem, the easiest way is to prohibit the crawling of the robots file, but it will waste some page weights because the weights can only be entered.
In some cases, using the link no follow attribute can save the crawl share. Small sites, because the share of crawling is not enough, adding no follow is meaningless. For large websites, no follow can control the flow and distribution of weights to a certain extent. The well-designed no follow will reduce the weight of meaningless pages and increase the weight of important pages. Search engine crawling will use a URL crawling list. The URLs to be crawled are sorted by page weight. The important page weights are increased and will be crawled first. The meaningless page weights may be as low as the search engine does not want to crawl.
The last few notes:
- Linking plus no follow won’t waste crawling share. But at Google it will waste weight.
- The no index tag does not save the crawl share. Search engines need to know that there are no index tags on the page, so you have to grab the page first, so you don’t save the crawl share.
- The canonical tag sometimes saves a bit of grab shares. Like the no index tag, the search engine needs to know that there are canonical tags on the page, so you have to grab the page first, so it doesn’t directly save the crawl share. However, pages with canonical tags are often crawled less frequently, so they save a bit of crawling share.
- Crawl speed and crawl share are not ranking factors. But the pages that have not been crawled can’t be ranked.