As I said in How Do Algorithms Think? Part 1, my thinking has undergone a dramatic change since I learned that Google is apparently using vectors to determine contextual meaning in words, phrases, sentences, etc. It was one of those “OF COURSE!” epiphanies for me. The more I thought about it, the more I realized how radically it revolutionized what the search engine could do, in terms of determining the meaning of a body of text by boiling it down to a composite vector representation.
How vectors work
For those of you who aren’t yet familiar with how word vectors are used, let me explain the basics:
A vector is defined as an object with magnitude and direction. In other words, it is a line between two points in space, in one direction, of a particular length. As such, it will have an angle. Vectors can be plotted in 2-dimensional or 3-dimensional space, but for simplicity’s sake, we’ll just look at 2-dimensional vectors here.
As you can see, that vector begins at coordinates 1,1 and ends at coordinates 5,4. That will allow us to plot it, but little else, But when we compare it with another vector, such as this one, with coordinates of (0,0) + (3,5):
we see they have two different angles and magnitudes.
I won’t bore you with the math, but every vector can be combined with another vector (or several) to resolve to a composite representation. So after discounting various stop words, the relevant words can be vectorized and combined to arrive at a composite. With me?
Let me give you an analogy
Imagine assigning a color to a word – say, blue. That tone of blue can be compared to a vector’s angle. And the intensity of the blue from light to royal, can be likened to a vector’s magnitude.
Add another word – this word will assign red. The composite of the two would result in some shade of magenta or violet. When we finish combining all the words on the page into a composite, we’ll have a color that represents the composite color of that page.
Obviously, many different shades of blue or blends that contain a good share of blue would give the overall document a predominantly blue face which would be a good indication of a tight contextual focus. But if the vectored words are all over the place, meaning-wise, the resultant composite color would approach a gray or white, rather than something closer to one of the three primary colors (red, blue & green). In other words, the closer all the words are to the same basic (primary) color, the better the contextual focus.
Going back to vectors, the closer the body of words are to the same angle, the better the focus.
Now to stretch our imaginations a bit
If Google can use word vectors to assign a representative mathematical value to the context of a page, wouldn’t it make sense that they’d do the same thing to search queries? If there’s anything computers are fast at, it’s sorting and matching numbers – so reducing both documents and queries to a mathematical solution will process a lot faster than dealing with bits that have to be transformed every single time.
And because it’s more efficient – both in terms of speed and machine cycles, I think it’s the most logical approach for the search engine; the cataloging of indexed pages can take on an entirely new face.
Added bonus: the larger the corpus of documents and queries, the more accurate the process becomes. For a machine learning algorithm, it’s nearly ideal.
So that’s where my thinking is now. Whether you think I’m on the right track or have jumped the rails entirely, I’d love to hear your thoughts. After all, that’s why the comment section below is there.