I don’t know when search algorithms will be based on AI on a large scale, and I don’t know how much AI technology is currently applied in search algorithms. Due to the unexplained nature of artificial intelligence technology , search engines will be very cautious with AI as the basis of the algorithm, otherwise it is not easy to debug.
However, some modules in the algorithm use AI for affirmation. Baidu’s DNN model and Google’s RankBrain algorithm have been introduced before , and they are all applications of AI in search algorithms.
So what is the artificial intelligence-based search algorithm? What is the working principle and process? Simply talk about my understanding.
Advantages and search of artificial intelligence
The current mainstream method of implementing artificial intelligence is the deep learning branch in machine learning, which is not strictly distinguished in this post.
Simply put, artificial intelligence is to give the system a lot of training data, artificial intelligence itself to find patterns and laws. The data given to the AI system is tagged, or the AI system results are told. For example, in Go, the AI system has a large amount of historical game data (later Alpha does not need a historical game, the self-matching data will do), and the outcome of these games is the label. The AI system then learns the relationship between the chess board and the outcome (winning or losing).
In the search, the AI system has a large amount of data on the page, which is the index library of the search engine itself. It also needs tags, that is, which pages are high quality? Which search results are user-satisfied for a query term? The AI algorithm then learns the relationship between page characteristics (ie, ranking factors) and rankings.
The traditional search algorithm is that the search engineer manually selects the ranking factor, manually gives the ranking factor a certain weight, and calculates the ranking according to the given formula. The disadvantage of this method is that when the amount of data is large and the ranking factors are large, it is very difficult to adjust the weight of the ranking factors. The initial weight is probably based on common sense, coupled with the head, has a lot of subjective randomness. When there are hundreds of factors that affect each other, adjusting the weight of these factors becomes chaotic and unpredictable.
Finding patterns from massive data is what AI excels at. AI can quickly find possible ranking factors, adjust ranking factor weights, automatically iterate calculations, and fit calculation formulas between ranking factors and user-satisfied search results.
The calculation formula trained by the training data is the AI search algorithm, which can be applied to more search by users.
Who is going to tag?
Since the tagged data is required to train the AI search algorithm, where does the tag data come from? This is the role of the search engine quality assessor.
The work of the quality assessor was detailed in the Google Quality Assessment Guide post not long ago . These real users (they are not Google employees), after learning the quality assessment guide, Google in the evaluation system to the assessor’s real website, the real query word data, the assessor to carry out relevant assessments, the most important is:
- Rate the page quality
- Rate search results for specific query terms
Google’s quality evaluators have long existed, and should not be used to develop AI algorithms, but to evaluate the quality of traditional algorithms. But their assessment data can be used effectively by artificial intelligence systems.
In this way, the AI system knows which pages of the search results that the user is satisfied with for a certain query word, and in which order.
Now, the AI system has a lot of pages feature data, and knows what kind of search results are true user satisfaction. The next step is to train the system to find the relationship between page features and search rankings.
Training artificial intelligence search algorithm
Search engines can divide tagged search result data into two groups. A set of training, a set of verification.
The AI algorithm checks the characteristics of the pages in the training results of the training group, and what weights should be given to these features. According to what kind of calculation formula, the user-satisfied (tagged) search results can be calculated.
Different from the traditional algorithm, which features (ranking factors) are needed, how much weight these features are given, not determined by the engineer, is the AI system itself to find and evaluate. These factors may be what engineers want and have long used, such as:
- Page keyword density
- Page content length
- There are no ads on the page
- How many external links are on the page
- How many internal links are on the page
- How many pages with the query word as the anchor text
- How many external links are in the domain name of the page?
- How fast is the page open?
- Wait, etc., there may be hundreds or thousands of
Maybe the engineers didn’t think about it at all, maybe some of them seem to be irrelevant and unreasonable, such as:
- The number of words used in the body of the page
- The author’s name is three words
- The first time the page was crawled was the day of the week.
- The number of links outside the page is a single even number
The above is just an example. To illustrate, AI is not looking for causality, but related relationships. As long as the AI sees which features of the ranked pages are enough, it is not reasonable to look at these features in relation to the rankings. It is not the AI’s concern, and it is not necessary to care.
Of course, some factors may be negative, such as the length of the domain name, which is likely to be negatively correlated with high rankings.
The process of training the AI system is to find these ranking factors (regardless of whether humans look at it or not), give these factors a certain weight, and fit a calculation formula that just outputs the search result that the user is satisfied with. This fitting process should be iterative, a weight value, a formula can’t be done, automatically adjusted, and calculated again until the search results of the evaluator’s tagged are perfectly matched. This training process may take a few days, maybe a few weeks, depending on the amount of data.
AI search algorithm verification
The trained AI search algorithm can be applied to other query words that are not in the training data.
First, verify the data with the verification group mentioned above. If the search result given by the newly trained algorithm matches the verification group data (also the label that the evaluator has tagged), the algorithm is good and can be online. If the AI algorithm gives a different search result than the page in the verification group search result, or if the pages are basically the same but the ordering is very different, it may be necessary to retrain the AI system.
Of course, to do all the query words, the AI algorithm gives the same search results as the evaluator has hit the most satisfactory label, which is unlikely. It is estimated that as long as the front row, for example, the top 20 page order differences are within a certain tolerance range. The higher the front, the lower the fault tolerance rate. For example, the page ranked first and second is wrong, and the page after the third page is not much worse.
The validated algorithm can be online and accepted by the real user. This is likely to involve a ranking factor that SEOs generally believe is related to rankings, but search engines have always denied: Is user experience data a ranking factor?
Many SEO ranking factors show that page click rate, bounce rate, user stay time, visit depth and ranking are highly correlated, but Google has always explicitly denied that the data is a ranking factor. Of course, for Baidu, the click-through rate is obviously a ranking factor.
The reason is likely that the search engine needs to use these user experience data to verify the quality of the search algorithm. If the user’s general click-through rate is reduced and the bounce rate is increased, the new online algorithm has problems and needs to be adjusted. Although search engines do not directly use user data to rank, the goal of the algorithm is to increase user data, making the two highly correlated.
After the new AI algorithm is launched, the user data monitored by the search engine indicates that the user is satisfied, and the algorithm succeeds, waiting for the next round of optimization.
The above is purely speculation.