Tech

Google accidentally published internal research documentation on GitHub

Getty Images | Alexander Koerner

Google apparently accidentally published a large amount of internal technical documents on GitHub, detailing in part how the search engine ranks web pages. For most of us, the question of search ranking is simply “are my web results good or bad?” » told them in the past. Most of the comments on the leak come from SEO experts Rand Fishkin and Mike King.

Google confirmed the authenticity of the documents to The Verge, saying: “We caution against making inaccurate assumptions about search based on out-of-context, outdated, or incomplete information. We’ve shared a lot of information about how search works and the types of factors our systems take into account, while striving to protect the integrity of our results from manipulation.

The funny thing about the accidental release to GoogleAPI GitHub is that although these are sensitive internal documents, Google technically released them under an Apache 2.0 license. This means that anyone who came across the materials was granted a “perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license” so they are now available for free online, like here.

One of the leaked documents.
Enlarge / One of the leaked documents.

The leak contains a ton of API documentation for Google’s “ContentWarehouse,” which looks a lot like the search index. As you might expect, even this incomplete overview of how Google ranks web pages is incredibly complex. King writes that there are “2,596 modules represented in the API documentation with 14,014 attributes (features).” These are all documents written by programmers for programmers and rely on a lot of background information that you would probably only know if you worked on the research team. The SEO community is still poring over documents and using them to develop hypotheses about how Google Search works.

Fishkin and King accuse Google of “lying” to SEO experts in the past. One of the revelations in the documents is that a search result’s click-through rate affects its ranking, something Google has repeatedly denied. The click tracking system is called “Navboost”, in other words, it boosts the websites that users go to. Naturally, a lot of this click data comes from Chrome, even when you exit search. For example, some results may show a small set of “sitemap” results below the main listing, and apparently part of what powers that content lies in the most popular subpages, as determined by traffic tracking. Chrome clicks.

The documents also suggest that Google has whitelists that will artificially boost certain websites on certain topics. The two mentioned were “isElectionAuthority” and “isCovidLocalAuthority”.

Much of the documentation is exactly how you would expect a search engine to work. Sites have a “SiteAuthority” value which will rank well-known sites higher than lesser-known sites. The authors also have their own rankings, but as with everything here, it’s impossible to know how everything interacts with everything else.

Both comments from our SEO experts make them seem offended that Google is misleading them, but doesn’t the company need to have at least a slightly adversarial relationship with the people who try to manipulate Google’s results? research ? A recent study found that “search engines appear to be losing the cat-and-mouse game of SEO spam” and found “an inverse relationship between a page’s level of optimization and its perceived expertise, indicating that SEO can harm at least the subjective quality of the page. “None of this additional documentation is likely to benefit users or the quality of Google results. For example, now that people know that click-through rate affects search rankings, couldn’t you improve the list a website with a click farm?

News Source : arstechnica.com
Gn tech

Back to top button