Google announced in September 2013 the release of Hummingbird, their first algorithm replacement in 15 years. This was not a tweak or a new signal, but the first complete overhaul of Google’s algorithm since BackRub. Hummingbird is billed as a response to the changing nature of search, the trend towards natural language queries, and the growth in the length of queries.
SEO pundits were quick to offer their take of what this change meant and how it would affect us all; mostly they have missed the mark, although some have given good advice in the process.
Google Jeopardy: Put all of your content in the form of a question, Eric Ward suggested, and you will suddenly rank on the top of the first page. It is true that Q&A sites like Quora and Stack Overflow have been the rage lately. It is also true that a well executed FAQ strategy can be the foundation of simple and effective content marketing. If someone actually asks you a question via email, live chat, forum, etc., then likely hundreds of other users are trying to answer the same question via search. Mining user questions and publishing the answers is a great strategy that could yield big gains in traffic — but it probably has little to do with Hummingbird.
Topics and Sets: AJ Kohn has a great piece about Google Now and Sets which got me thinking about the role of search history & personalization in Hummingbird. It is well worth a read. Basically AJ points out how Google maps words (queries) and sites you click on to build Google Now topics. In the process, they create a small data leak into the personalization/disambiguation engine Google is using to discover the true user intent.
My analysis of the Hummingbird Update focused largely on the ability to improve topic modeling through a combination of traditional text analysis natural and entity detection. Google Now Topics looks like a Hummingbird learning lab.
Watching how queries and click behavior turn into topics (there’s that word again) and what types of content are displayed for each topic is a window into Google’s evolving abilities and application of entities into search results. It may not be the full picture of what’s going on but there’s enough here to put a lot of paint on the canvass.
This insight is great, as far as it goes, but in my opinion entities and topics are important but not at the core of Hummingbird.
Hummingbird is about disambiguation of search intent based on the users search history. The missing clues to the true nature of Hummingbird are readily available.
The Big Brand Bailout: A few years ago big brands started magically dominating search results for highly competitive short tail queries. Displaced site owners (many with lead gen sites) screamed in protest. Google called this update Vince. @stuntdubl called it the Big Brand Bailout and that is the name that stuck. And hundreds of theories suggest how/why this happened. Eventually, a Google engineer who was not trained in the @MattCutts school of answering-questions-without-saying-anything-meaningful slipped up and revealed that Google was relying on users’ subsequent query behavior to improve SERP for the initial query and elevating sites (brands) that showed later in the click stream. Brands that were included in as little as 1-2% of subsequent queries got pushed on to the first page of results. This was the first direct evidence we had of Google using user behavior to disambiguate intent and influence rankings on the original query.
Panda: Many parts of the Panda update remain opaque and the classifier has evolved significantly since its first release. Google characterized it as a machine learning algorithm and hence a black box which wouldn’t allow for manual intervention. We later learned that some sites were subsequently added to the training set as quality site, thus causing them to recover and be locked in as “Good sites.” This makes it especially hard to compare winners and losers to reverse engineer best practices. What most SEO practitioners agree upon is that user behavior & engagement play a large role in the site’s Panda score. If users quickly return to the search engine and click on the next result or refine their query, that can’t be a good signal for site quality.
Personalization: If you are an SEO, you likely spend a lot of time in incogito/private browsing mode and turn off search history, location tracking and other features that allow Google to track your intent, your movements and your online behavior. If you are not an SEO, you likely surf logged into your Google+/Gmail/Gchat/YouTube/Borg account with your history enabled. If you browsed more like a normal person, you would have noticed how dramatically Google SERPs change based on previous queries. Notice how the search URL has evolved to include your query sequence.
Conversational Voice Search: Google demonstrated what they mean by “conversational search” at the I/O conference in May of 2013. Danny Sullivan provided his usual excellent coverage, which focused on Google’s ability to remember your previous queries and provide context for your question. This is a significant advance in the user experience for and something that differentiates it from Siri. More importantly, it is an unambiguous statement about what Google means by “conversational search.” Conversational search is the ability to use previous interactions to disambiguate the user intent for subsequent queries.
Google’s announcements are often aspirational and seemingly lack nuance. They tell us that they have solved a problem or devalued a tactic and we all point to the exceptions before proclaiming the announcement as hype or FUD; years later we look around and that tactic is all but dead and the practitioners are toast (unless the company is big enough to earn Google immunity). These pronouncements feel false and misleading because they are made several iterations before the goal is accomplished. The key to understanding where Google is now is to look at what they told us they were doing a year ago.
In the case of Hummingbird, what they told us is that the search quality team has been renamed the knowledge team; they want to answer people’s search intent instead of always pushing users off to other (our) websites. Google proudly proclaims that they do over 500 algorithm updates per year and that they are constantly testing refinements, new layouts and features. They also allude to the progress they are making with machine learning and their advancing ability to make connections based on the enormous amount of data they accumulate every day. Hummingbird marks a sea change in our understanding of the algorithm. Instead of the Knowledge Team, they should have renamed it the Measurement Team, because Google is measuring everything we do and trying mine that data to understand intent provide users with the variation they are looking for.
What does this mean to site owners?
This is the $64 billion dollar question and one without a simple answer. Matt Cutts told us at SMX Advanced in 2013 that only 15% of queries are of interest to any webmaster/SEO anywhere. 85% of what Google worries about we actually pay no attention to. An update that affects 1.5% of queries can affect 10% of queries some SEO somewhere cares about and 50% of the “money terms” on Google.
Simultaneously, Google tends to role out changes and then iterate them. The lack of screaming protests or volatile weather reports suggests that very few results actually changed when Hummingbird was released; at least results you can view in a ranking scraper. Instead Google rolled out the tools they need to make the next leap in personalization which will gradually pick winners and losers.
The good news is that Hummingbird provides a significant chance for onsite SEO to improve performance and generate strong ROI. Machine learning is data driven and by nature the product of objective, measurable user actions. Site owners who embrace user focused optimization (not narrowly defined conversion goals) and build out robust topics with segmentation driven by mapping related queries to content that honestly addresses user intent can significantly improve engagement.
That is Hummingbird nectar.
AJ Kohn says
Interesting piece but let me make sure I understand what you’re main thesis is here. Are you saying that Hummingbird is about disambiguating intent using aggregated user behavior metrics?
I think that’s been going on for quite a long time (as you point out here) and has simply gotten better under Hummingbird. But I don’t think that’s the core of Hummingbird.
User behavior has been a rising factor in how Google determines authority. Short clicks versus long clicks and the time to long click (http://www.blindfiveyearold.com/time-to-long-click) are of vital importance to Google. It’s why domain host crowding remains stubbornly on a lot of SERPs. User behavior shows this is what satisfies Joe Average.
Google has a tremendous amount of data locked up in click-tracking and the DoubleClick cookie and so they have a good idea about what people actually do versus what they say they want.
But Hummingbird sits on top of this to a large extent in my view. I believe they have new infrastructure that allows them to take in multiple types of data (i.e. – entities or social) and have deep learning algorithms that can make better sense of them.
I think Jeff Dean’s aspiration of understanding that one sentence means the same as another sentence is the goal and that Hummingbird gets them closer since it allows them to pluck entities from query strings and then use that to better understand and match those queries.
But no doubt history and user behavior both on a personal and aggregate level are very powerful and may become more so with this new hybrid algorithm engine.
i’m working on ranking problem for information retrieval and i wonder what Machine Learning Techonology is used to rank the documents in Google’s Hummingbird algorithm ( is it a boosting algorithm , SVM technique or something else) ??
Thank you in advance
Jonah Stein says
To my knowledge, Google has not revealed anything that would let us answer that question, nor am I am information retrieval scientist. Given Google’s data sets cover so much ground, I wouldn’t know how to identify a technique but my personal theory/guess is that they are looking two things: Content sets (queries that have the same underlying intent) and user outcomes.
I am again guessing that they are letting the outcomes determine the original intent; if a group of queries results in the same user behavior then they are the same. For example, no amount of semantic algorithm will tell you that a search for “restaurants near Embarcadero BART” and “SF places to eat near market & Spear” are the same topic but my click behavior after the query should be similar enough for Google to decide they are the same thing. I am not sure if this is accurately described as machine learning or brute force statistical analysis, but it seems like the most robust and reliable approach available.