On yoast.com, we talk a lot about writing and readability. We consider it a very important part of SEO. Your text needs to be easy to follow and it needs to satisfy your users’ needs. This, in turn, will help your rankings. However, we rarely talk about how Google and other search engines read and understand these texts. In this post, we’ll explore what we know about how Google analyzes online text.
Are we sure Google understands text?
We know that Google understands text to some degree. Just think about it. One of the most important things Google has to do is match what someone types into the search bar to a suitable search result. User signals (like click-through and bounce rates) alone won’t help Google to do this. Moreover, we know that it’s possible to rank for a phrase that you don’t use in your text (although it’s still good practice to identify and use one or more specific keyphrases). So clearly, Google does something to actually read and assess your text in some way or another.
How Google understands text
Back to our initial question: How does Google understand text? To be honest, we don’t know this in detail. Unfortunately, that information isn’t freely available. And we also know, judging from the search results, that there is still a lot of work that needs to be done. But there are some clues here and there that we can draw conclusions from. We know that Google has taken big steps when it comes to understanding context. We also know that the search engine tries to determine how words and concepts are related to each other. How do we know this? On the one hand, by keeping an eye on any news surrounding Google’s algorithm. On the other hand, by considering how the actual search results pages have changed.
One interesting technique Google has filed patents for and worked on is called word embedding. We’ll save the details for another post, but the goal is basically to find out what words are closely related to other words. This is what happens: a computer program is fed a certain amount of text. It then analyzes the words in that text and determines what words tend to appear together. Then, it translates every word into a series of numbers. This allows the words to be represented as a point in space in a diagram, like a scatter plot. This diagram shows what words are related in what ways. More accurately, it shows the distance between words, sort of like a galaxy made up of words. So for example, a word like “keywords” would be much closer to “copywriting” than it would be to say “kitchen utensils”.
Interestingly, this can also be done for phrases, sentences and paragraphs. The bigger the dataset you feed the program, the better it will be able to categorize and understand words and work out how they’re used and what they mean. And, what do you know, Google has a database of the entire internet. With a dataset like that, it’s possible to create very reliable models that predict and assess the value of text and context.
From word embeddings, it’s only a small step to the concept of related entities. Let’s take a look at the search results to illustrate what related entities are. If you type in “types of pasta”, this is what you’ll see right at the top of the SERP: a heading called “pasta varieties”, with a number of rich results that include a ton of different types of pasta. These pasta varieties are even subcategorized into “ribbon pasta”, “tubular pasta”, and other subtypes of pasta. And there are lots and lots of similar SERPs that reflect the way words and concepts are related to each other.