The Evolution of Question-Answering Accuracy at Google

In these days when “Fake News” has become a catch-phrase, Google has come under fire for the accuracy of some of the answers they provide in featured snippets as they engage in question-answering, and there are signs that they are working to address that problem. We’ve seen such errors pinpointed in articles such as the following:

Google does have a help page about featured snippets, where they tell us where those come from. The page is titled, Featured snippets in search. It tells us where answers in featured snippets come from:

The summary is a snippet extracted programmatically from what a visitor sees on your web page. What’s different with a featured snippet is that it is enhanced to draw user attention on the results page. When we recognize that a query asks a question, we programmatically detect pages that answer the user’s question, and display a top result as a featured snippet in the search results.

I found a patent that describes how Google checked upon the facts in question-answering with featured snippets, and remembered writing about this before.

It had me wondering what had changed?

This post is about an updated continuation patent that focuses upon providing more up-to-date information involving facts in question-answering by the search engine. I wrote about the original version of the patent in the post, How Google was Corroborating Facts for Direct Answers, about an earlier version of the patent at: Corroborating facts in electronic documents, granted on February 10, 2015. (You can click through and read what I wrote about that one to get a better sense of how it was supposed to work, and I will detail the differences here.)

The new version of the patent is at:

Corroborating facts in electronic documents
Inventors: Shubin Zhao and Krzysztof Czuba
Assignee: Google Inc.
US Patent: 9,785,686
Granted: October 10, 2017
Filed: February 6, 2015

Abstract

A query is defined that has an answer formed of terms from electronic documents. A repository having facts is examined to identify attributes corresponding to terms in the query. The electronic documents are examined to find other terms that commonly appear near the query terms. Hypothetical facts representing possible answers to the query are created based on the information identified in the fact repository and the commonly-appearing terms. These hypothetical facts are corroborated using the electronic documents to determine how many documents support each fact. Additionally, contextual clues in the documents are examined to determine whether the hypothetical facts can be expanded to include additional terms. A hypothetical fact that is supported by at least a certain number of documents, and is not contained within another fact with at least the same level of support, is presented as likely correct.

Note that the original version of this patent was filed in 2006, and its likely that Google was planning on doing question-answering back then. We know they were because they wrote a blog post in 2005 that told us they intended to do so called, Just the Facts, Fast.

In continuation patents, normally most of the text of the patent is the same from the older description to the newer description, but the claims (what the patent examiners focus upon when deciding whether to grant a patent or not) get updated. It’s worth looking at how the claims have changed from one version to the next.

The first claim in the older version of the patent tells us that it focused upon:

1. A computer-implemented method for identifying facts described by electronic documents, comprising: defining a query, the query posing a question having an answer formed of terms from the electronic documents; creating one or more hypothetical facts in response to the query and the electronic documents, each hypothetical fact representing a possible answer to the query, wherein creating one or more hypothetical facts in response to the query comprises: parsing the query to filter out noise words and produce filtered terms; searching a repository of facts comprising attributes and values to identify attributes corresponding to the filtered terms; searching the electronic documents to identify terms that frequently appear near the filtered terms; and forming one or more hypothetical facts responsive to the attributes corresponding to the filtered terms and the terms that frequently appear near the filtered terms in the electronic documents; corroborating the one or more hypothetical facts using the electronic documents to identify a likely correct fact; and presenting the identified likely correct fact as the answer to the query.

I’m seeing some additional words in the newer version of the first claim, such as the word “threshold” for the amount of corroborating facts that support a fact in an answer:

1. A computer-implemented method for identifying facts described by electronic documents, comprising: defining a query, the query posing a question having an answer formed of terms from the electronic documents; creating one or more hypothetical facts based on at least one term used to define the query and at least one of the terms from the electronic documents, each hypothetical fact representing a possible answer to the query; corroborating the one or more hypothetical facts using the electronic documents to identify a likely correct fact, wherein corroborating a hypothetical fact using the electronic documents comprises: determining how many of the electronic documents support the hypothetical fact; and identifying the hypothetical fact as likely correct if an amount of support for the hypothetical fact surpasses a threshold, wherein the threshold is at least more than one electronic document of the electronic documents; and presenting the identified likely correct fact as the answer to the query.

So, the difference between the two appears to be in finding more support and more evidence for facts that might be presented as answers to questions.

This claim from the new patent shows that the process behind this patent has evolved somewhat to be more careful as well (the filtering of noise words from the orginal first claim is now in a separate following claim):

4. The method of claim 1, wherein creating one or more hypothetical facts in response to the query comprises: parsing the query to filter out noise words and produce filtered terms; searching a repository of facts comprising attributes and values to identify attributes corresponding to the filtered terms; searching the electronic documents to identify terms that frequently appear near the filtered terms; and forming one or more hypothetical facts responsive to the attributes corresponding to the filtered terms and the terms that frequently appear near the filtered terms in the electronic documents.

It’s difficult telling how effective this update to this patent might be in terms of improving the accuracy of facts in featured snippets; it is a challenge that Google faces, and whether or not they are able to succeed in providing appropriate facts for question asking queries will likely be something that lots of people will be keeping an eye open for.

I think it’s useful keeping in mind how far we’ve come, quickly. Not all that long ago, I remember keeping a card in my wallet with the phone number to a nearby university library, which let me connect electronically to their card catalog, so I could research, and look up books that might help me answer questions. I used to call, find sources, and walk down to the library and check out those books, and then read till I found answers (if any were to be had.) Being able to get answers directly from Google can be much faster, and if those answers are accurate, very helpful. Being able to hunt down information like this from home shows us, it is still the early days of search, and we’ve come really far already, so we will see how this evolves.

More Accurate Answers Come from More Trustworthy Resources?

Another effort that targets improved accuracy in search results involves Google’s Knowledge Vault, as described in a paper called Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources. If you haven’t read it yet, I highly recommend that you do. In it, Google tells us how they’ve been testing the trustworthiness of Websites, and assigning them Knowledge-Based Trust scores, based upon how well they address facts that Google already knows the answers to. So, rather than just trusting how accurate facts are based upon volumes of the facts, this approach looks at the quality of sources of facts, based upon the thought that:

A source that has few false facts is considered to be trustworthy.

That seems to be an interesting evolution.

Facebooktwittergoogle_plusmail

Leave a Reply

Your email address will not be published. Required fields are marked *