How to reduce crawling and indexing of invalid URLs – no perfect solution

Simply put, the post points to a serious and realistic SEO problem: many websites, especially B2C, product condition filtering systems (such as choosing the brand, price, size, performance, parameters, etc.) of the product will generate a large number of invalid URLs, which is called Invalid is only from the perspective of SEO . These URLs do not produce SEO effects, but they have negative effects. Therefore, these URLs are not included as well. The reasons include:

  • A large number of filter page content is duplicated or very similar (a large amount of copying will reduce the overall quality of the site)
  • There are no corresponding products on a large number of filter conditions pages, and there is no content on the page (such as “42-inch LED TV below 100 yuan”)
  • Most filter pages have no ranking ability (the ranking ability is much lower than the classification page) but a certain weight is wasted
  • These filter pages are also not necessary channels for product pages (product pages should have other internal links to help crawl and include)
  • Crawling a large number of filter conditions page is a great waste of spider crawl time, resulting in a decline in useful page inclusion opportunities (filtering page combinations are huge)

So how do you try to keep these URLs from being crawled, indexed, and indexed? A few days ago, a post about how to hide content may also become a similar issue for SEO issues. This kind of filtering page is one of the types of content that you want to hide. Unfortunately, I can’t think of a perfect solution at the moment. Yunchen Watching proposes two methods, which I feel cannot be solved perfectly.

One is to keep the URLs that you don’t want to include as dynamic URLs, even deliberately as dynamic as possible to prevent being crawled and indexed. However, search engines can now crawl and index dynamic URLs, and technically it’s not a problem. Although the parameters are not necessarily acceptable for a certain degree, 4 or 5 parameters are usually included. We can’t confirm how many parameters are needed to block inclusion, so we can’t be a reliable method. Moreover, these URLs receive the inner chain, and there is no ranking ability, or it will waste some weight.

The second method, robots, prohibits inclusion . Similarly, the URL receives the inner chain and receives the weight. The robots file prohibits crawling the URLs, so the received weights cannot be passed out (the search engine does not know if there is any export link), and the page becomes a black hole with only weights. .

Linking to these URLs is not perfect with nofollow , similar to robots prohibition. The effect of nofollow in Google is that these URLs do not receive weights, but the weights are not assigned to other links, so the weight is also wasted. Baidu allegedly supported nofollow, but how the weights deal with the unknown.

It is useless to put these URL links in Flash and JS. Search engines can already crawl links in Flash and JS, and it is estimated that they will be better at crawling. One thing that many SEOs overlook is that links in JS can not only be crawled, but also pass weights, just like normal connections.

It is also possible to link the filter condition to the AJAX form, and the user will not access a new URL after clicking, or add the # after the URL on the original URL, and will not be treated as a different URL. Like the JS issue, search engines are actively trying to crawl and crawl content in AJAX, which is not safe.

Another method is to add the noindex+follow tag to the head of the page, meaning that the page does not index, but tracks the links on the page. This can solve the problem of copying content, and also solve the problem of weight black hole (the weight can be transferred to other pages with the export link), can not solve the problem of wasting spider crawl time, these pages are still crawled by spiders (then In order to see the noindex+follow tag in the page html, for some websites, the number of filtered pages is huge, and crawling these pages, the spider does not have enough time to climb the useful page.

Yet another method that can be considered a hidden page (cloaking), which is used to detect the visitor program, is a search engine spider then returns the page links removed the filters, then the user will return to the normal conditions of a filter page. This is an ideal solution. The only problem is that it may be considered cheating. The highest principle that search engines often tell SEO about whether to cheat is: If you don’t have a search engine, would you do this? Or is it a method that is only used for search engines? Obviously, using cloaking to hide URLs that don’t want to be crawled is done for search engines, not for users. Although the purpose of cloaking in this case is wonderful, there is no malicious, but the risk is there, and the daring can be tried.

Another method is to use the canonical tag . The biggest question is whether Baidu supports the unknown, and the canonical tag is a recommendation for the search engine, not an instruction. That is to say, the tag search engine may not comply, which is useless. In addition, the original meaning of the canonical label is to specify a normalized URL, and the filter condition page is somewhat questionable. After all, the content on these pages is often different.

One of the better methods at the moment is iframe+robots ban. Putting the filter part of the code into the iframe is equivalent to calling other file contents. For the search engine, this part of the content does not belong to the current page, that is, the content is hidden. But not belonging to the current page does not mean that there is no existence, the search engine can find the content and links in the iframe, or it is possible to crawl these URLs, so add robots to prohibit crawling. The content in the iframe still has some weight loss, but because the link in the iframe is not diverted from the current page, but only from the file that is called, the weight loss is relatively small. In addition to headaches such as typography and browser compatibility, one potential problem with the iframe approach is the risk of being considered cheating. Now search engines generally don’t think iframes are cheating. Many ads are placed in iframes, but there are subtle differences between hiding a bunch of links and hiding ads. Going back to the general principle that search engines judge cheating, it’s hard to say that this is not specifically done for search engines. Remember that Matt Cutts said that Google may change the way it handles iframes in the future, and they still want to see everything that ordinary users can see on the same page.

In short, I don’t have a perfect answer to this reality or serious problem. Of course, if you can’t solve it perfectly, you can’t live. The SEO of different websites is different. The specific problem is analyzed. One or more of the above methods should solve the main problem.

The biggest problem is not the most of these, but sometimes you want to make these pages crawled and indexed filter, this is a cup start. Discuss it later.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button