|
In The Google Florida Update, we discussed how we
believe that Google has deployed the Hilltop algo in its ‘Florida’ algo
update. As usual, Google has been silent about the algo update so our
analysis is based on research and experimets. Why need a new
algo? While the PR algo did its job well all these years, there is
a basic flaw in the PR system and Google knew about this. The PageRank
(PR) system allocates an absolute ‘value of importance’ to a web page
based on the number and quality of sites that link to it. However, ‘PR
value’ is not specific to search terms and therefore a high-PR web page
that even contained a passing reference to an off-topic keyword phrase,
often got a high ranking for that phrase. Krishna Bharat from California
realized the flaw in this PR-based ranking system and came up with an
algorithm he called ‘Hilltop’ in the year 1999-2000. He filed for the
Hilltop patent in Jan 2001 with Google as an assignee. Needless to say,
Google realized the advantage this new algo would offer to their ranking
system if combined with their own PR system. Hilltop could perfectly
bridge the gap. The Hilltop algo may have gone through several
refinements/iterations from its original form, before this deployment.
What is the Hilltop algo? For the geeks who wish to go
into great depths, there is detailed info available here – Hilltop
Paper & Hilltop Patent : http://www.cs.toronto.edu/~georgem/hilltop/ For
the rest of us, here is a simple explanation – In a nutshell, PR
determines ‘authority’ of a web page in general. Hilltop
(LocalScore) determines the ‘authority’ of a web page related to the query
or search term. Bharat formulated that instead of using just the
‘PR value’ to find the ‘authoritative’ web pages; it would be more useful
if the ‘value’ has topical relevance. As such, counting links from ‘topic
relevant’ document to a web page would be more useful. He called these
‘topic relevant’ documents as ‘expert documents’ and links from these
expert documents to the target documents evaluated their ‘authority score’
The Hilltop algo calculates a ‘score of authority’ of web pages
(over-simplified) as follows: Run a normal search on the keyphrase
to locate a ‘corpus’ of expert documents. The qualifying rules of ‘expert
documents’ are stringent so the ‘corpus’ is a manageable number of web
pages. Filter affiliate* sites and duplicate sites from the
experts list. Pages are assigned a LocalScore of ‘authority’ based
on number and quality of votes they get from these expert documents. Pages
are then ranked based on their LocalScore. How does Hilltop
define affiliate sites? *Affiliate sites are defined as follows –
Pages that originate from the same domain (www.ibm.com,
www.ibm.com/us/, products.ibm.com, solutions.ibm.com etc.) Pages
that originate from the same domains but with different top level and
second level suffixes (like www.ibm.com, www.ibm.co.uk, www.ibm.co.jp
etc.) Pages that originate from neighborhood IPs (first 3 common
octet in the IP number like 66.165.238.xxx is common) Pages that
originate from affiliate of affiliates (if www.abc.com is hosted on the
same IP octet as www.ibm.com, then www.abc.com is an affiliate of
www.ibm.co.uk even if they are on a different IP series) It is
worth noting that the Hilltop algo bases its calculations only on ‘expert
documents’. Its algo requires finding at least two expert documents voting
for a page. If the algo does not find a minimum of two expert documents,
the results returned are zero. Which essentially means, that the Hilltop
algo fails to pass on any values to the rest of the ranking algo and
therefore becomes ineffective for the search term query in question.
This is a very important aspect of the Hilltop algo – It is
ineffective if sufficient expert documents are not located. This
unique feature of Hilltop algo, which has a high chance of returning a
‘zero’ score, based on highly specific query term, has led the majority of
SEO community to believe that Google is using a ‘money words’ filter list.
Actually, the ‘old Google’ results got displayed for specific search terms
where Hilltop failed to produce effect. The collection of these terms is
what the SEO community collected and called the ‘Money Words List’.
This effect also comes across as strong evidence, indicating the
deployment of Hilltop by Google. When Google introduced this new algo on
November 15th, 2003, an analyst figured out that if you search for a query
term added with some ‘exclusion’ trash characters, Google displayed the
original (pre-algo-change) results, bypassing the so-called ‘money words’
filter list. For example if you search for “real estate
–hgfhjfgjhgjg –kjhkhkjhkjhk” then Google would attempt to show you the
pages on “real estate” but excluding pages that contained the terms
“hgfhjfgjhgjg” and “kjhkhkjhkjhk”. Since it is easy to understand that,
there would hardly be any page containing the words “hgfhjfgjhgjg” and
“kjhkhkjhkjhk”, Google should be returning the same results as one would
get for the term ‘real estate’ alone. However that did not happen. Google
showed results, which seemed to be identical to pre-algo-change ranking.
In fact an anti-Google group setup a site (www.scroogle.org) to capture
the differences in rankings to extract a so-called ‘money words’ filter
list. What’s the real story behind the so-called ‘money
keywords list’ filter? We believe that the ‘money words’ filter
list effect was just a spin-off symptom of the Hilltop algo. Each time,
someone attempted to run a search term like “real estate –hgfhjfgjhgjg
–kjhkhkjhkjhk”, Google passed on this entire search term to Hilltop. Since
Hilltop was unable to locate sufficient ‘expert’ documents containing this
‘funny looking’ search term, it produced zero result. (read zero effect).
This essentially means that the Hilltop was simply ‘bypassed’ with the
exclusion search term. The rest of the Google algo was then left to
extract and display results, which obviously looked identical to the
pre-algo-update results. The growing popularity of
www.scroogle.org led Google to detect this bug. Google fixed this bug by
kicking in the Hilltop a 2-step process. The exclusion terms are withheld
while passing on the query to Hilltop; Hilltop does its work and extracts
results, passes results to Google algo, Google excludes the terms just
before displaying results. Simple. Exclusion terms are no longer passed on
to Hilltop so the Hilltop now works fine. As you can see on Google site,
the above exclusion method no longer shows ‘old Google’ results.
What does the new Google algo look like? What are the
implications? The combination of Hilltop algo, Google-PR and
on-page relevance factors seem to be a highly potent combination, very
difficult to beat. Not impossible, but very difficult. This new
combination has far-reaching implications on how link-popularity/PageRank
and links from Expert Documents (LocalScore) would affect your site
ranking. The exact Google algo will only be known to Google. It is
a closely guarded secret. I’m not good at maths (I wish I were), but here
is an attempt to simplify the new Google algorithm for the purpose of
understanding of how variables take effect – Old Google Ranking
Formula = {(1-d)+a (RS)} * {(1-e)+b (PR * fb)} New Google Ranking
Formula = {(1-d)+a (RS)} * {(1-e)+b (PR * fb)} * {(1-f)+c (LS)}
Where: RS = RelevanceScore: (Score based on keywords
appearing in Title, Meta tags, Headlines, Body text, URL, Alt text, Title
attribute, anchor text etc. of your site) PR = PageRank: (Score
based on number and PR value of pages linking to your site. Original
formula is PR (A) = (1-d) + d (PR (t1)/C (t1) + ... + PR (tn)/C (tn)),
where PR of page ‘A’ is the sum of the PR of each page linking to it
divided by the number of outgoing links on each of those pages. ‘d’ is a
dampening factor believed to be equal to 0.15) LS = LocalScore:
(Score computed from expert documents. Has variables and different values
for search term appearing in title (16), headline (6), anchor text (1),
search term density etc. Figures in parenthesis are the original values,
which may have been changed by Google) a, b, c = Tweak Weight
Controls: (available to Google for fine-tuning the results) d, e,
f = Dampener Controls: (available to Google for fine-tuning the results.
We believe that the value of ‘f’ is currently set at zero.) fb =
FactorBase: (The PageRank scale of 1 to 10 on Google bar is not linier but
an exponential/logarithmic one. As per our internal analysis, we believe
that it is a base ‘close to’ 8. This means that PR5 is 8 times more in
value than PR4. As such, a PR8 website has a value 4000 times more than a
PR4 website. This factor somehow needs to be built into the algo formula.
We have therefore taken a fb value to accommodate this factor)
Merits of the new Google algo Search engines have
always been a little wary of the extent they would like to rely their
ranking algo on ‘on-page’ factors. Most search engines discontinued
valuing factors prone to extreme abuse such as the keyword Meta tag long
back. On-page factors give too much control (for abuse) to the webmasters.
Visible parts of the web page have been less prone to spam because they
need to make sense to the human visitors. However, for quite some time,
even these on-page factors have been subject to abuse by way of presenting
sub-standard, over-optimized or even ‘cloaked’ content to the search
engines. What is the new ‘ranking’ weight distribution?
If you notice in the above new formula, Google has taken
significant weight off the on-page factors. The only on-page variable in
the formula is now the ‘RelevanceScore’ Our analysis of the above
formula and Google behavior indicates that the total weight distributed to
the 3 components (RS group, PR group and LS group) is as follows –
RelevanceScore = 20%, PageRank = 40%, LocalScore = 40%
Where: RS is the translation of all SEO efforts PR is
the translation of Link-building efforts LS is the translation of
links from the expert documents With this implementation, Google
has shifted significant weight to the off-page factors, taking away
ranking control from webmasters. As you can see, there is a fairly low
score level available to gain just from your SEO efforts. If an average
SEO expert is able to leverage 10% of this weight and a super expert SEO
can leverage 18% of this weight, the total difference in ranking between
an average SEO and a great SEO is just about 8%. News: The SEO and ranking
rules have just been changed!!! Is Hilltop running in
real-time? Google is primarily running its service through 10,000
Pentium servers distributed across the web. That’s how they have built
their server architecture. If we study the Hilltop algo, it is difficult
to believe that such Pentium servers would have the processing power to
locate ‘expert documents’ from thousands of topical documents, evaluate
LocalScore of target pages from all these documents and pass the value to
other components of Google algo, which then further process the results,
on the fly, all in just about 0.07 seconds – the speed Google is famous
for. So how and when does Hilltop kick in? We believe
that Google is running a batch processing of popular search terms
(so-called ‘money keywords list’) and stores the results ready to serve.
Google has vast database of popular search terms in its database,
collected from actual searches as well as keyword phrases used in AdWords
program. Google has perhaps set a threshold value to the number of
searches a search term needs to have before it qualifies to get into the
Hilltop pool for batch processing. The Hilltop runs on the total pool of
popular search terms, maybe once a month. Incremental smaller size batch
processing may be done more frequently on search terms that gain
popularity and qualify to get into the Hilltop pool. Results for the major
pool may be synchronized with the 10,000 servers once a month and the
smaller batches updated more frequently. Search terms that do not
qualify to kick in the Hilltop algo continue to show you the old Google
ranking. Many SEO’s are happy and claim that their listings have not gone
down for several client sites. They are perhaps checking with highly
specific search terms that have not qualified to be on Hilltop radar yet.
Google acquired the patent in February 2003. Why did it take
so long to deploy? Tests, tests, tests, compatibility issues, more
tests, result evaluations, fine-tuning and further tests. This was perhaps
going to be one hell of a change for Google to deploy. The algo needed to
work perfectly with the existing components of PR and RelevanceScore
Google algo. I guess all this takes time. Are there any
downsides / flaws with the new Google algo? As we do our further
analysis, we expect several bugs and shortcomings to manifest over time.
Here are a few that we feel could hurt Google and its users –
Hilltop is based on the assumption that each ‘expert document’ it
locates would be unbiased, Spam-free and manipulation-free. We feel that
this may not be the case. If even a small percentage of expert documents
are contaminated, the scores would magnify the error leading to a
significant number of ‘false positives’ in the top-ranks. Hilltop
attempts to arrive at a selection of pages voted to be ‘authoritative’.
There is no evidence of a guarantee that these pages would also mean
‘quality’ We believe that since a lot of processing power is
required to run Hilltop, it (probably) runs on a monthly batch processing
frequency for popular search terms. This coupled with the fact that
significant weight is assigned to the ‘Hilltop’ part of the Google algo;
we may expect to see sites continuing to rank without much fluctuation
until the next processing. Since voting patterns of the ‘expert’ pages is
unlikely to fluctuate much, we can expect to see ‘stale’ rankings over
sustained periods. This may work against the fabric of search engines, who
are expected to also include ‘new, good’ content in their search results.
‘Authoritative’ pages apart, people also want to see fresh content, which
will now be visible only on less competitive or unique search terms where
Hilltop fails to kick-in. New sites will find it increasingly
difficult to rank and with popular search terms. Google seems to have
created a bigger barrier for new sites or new content to rank with
extremely popular search terms. Since most commercial sites find
is easy to link to directories, trade associations, government sites of
trade authorities, educational institutions, non-profit organizations
(read non-competitive sites), such sites will populate the top-10 rankings
on the result pages more. Who will suffer in the near to
medium term? Affiliate sites / domain clusters / MLM programs
running on same servers. Sites relying heavily on ‘on-page’ site
optimization factors. Sites that rely on highly competitive search
terms to get traffic. Recommendations for site owners
They need to think out-of-the-box and seriously consider improving
PageRank and links from ‘Expert Documents’ as distinct promotional
campaigns. The rules of ranking have changed significantly Get
listing in as many major directories (DMOZ, Yahoo, About, LookSmart etc.),
trade directories, yellow pages, associations, resource pages, highly
classified sections pages etc. Avoid domain clusters / affiliate
programs or change nature of affiliate programs. Avoid reciprocal
links from suspect FFA sites and link farms Popular Myths:
Good site optimization will continue to support rankings to the extent of
its weigh Over-optimization is now being penalized:
Over-optimization (spam) has always been either discounted or penalized.
The current impact of rank loss is seen due to shift of weight from
on-page to off-page factors. Good site optimization will continue to
support rankings to the extent of its weight in the algo. Link
building is no longer important: Link-popularity building is as important
as before, perhaps even more important now. The PR algo continues to gain
importance. Google is using a ‘money words’ filter list: As you
can see from above arguments, Google is not using any filter list to
penalize commercial sites. The results just ‘seem’ to be indicating such
symptoms. Nor has Google implemented this algo for the sake of pushing
their AdWords or building their bottom-line for the forthcoming IPO
Listing in DMOZ, Google directory, commercial directories gives
Google the clue that your site is commercial and therefore penalizes it:
On the contrary, since most of these directories ‘qualify’ to be the
‘expert documents’, links from these sites are of great value.
About the Author: Atul Gupta is founder and CEO of
SEORank . With an experience of over 8 years in the Internet industry,
Atul Gupta has helped several companies formulate and roll out their
online marketing strategies targeted towards search engine positioning.
His knowledge and experience lends credibility to the company and fuels
his team of professionals. Related Reading: What’s the
new Google ‘Florida’ Algo buzz? http://www.seorank.com/google-florida-update.htm
Google PageRank Algorithm Explained http://www.seorank.com/google-pagerank.htm Search engine optimization (SEO) expert Dr. Andy
Williams has written a 21 page report called "Sitemaps: The
Missing Link of Search Engine Optimization". In it, he identifies
the key features you'll need in your website's site map if you want to
get the maximum mileage out of it. FIRST... WHO IS ANDY
WILLIAMS? He's the creator of the highly regarded Sitemap Creator
which automatically generates themed site maps. For the linking
text, his software uses the title of each web pages and it creates a
description for each link using your meta description tags (or your
visible text if there's no meta description). Then, before you upload
the newly created pages, you can change the order of the listings
so as to create themed sitemap pages if you wish. Fascinating
software indeed! But you can also create effective sitemaps BY
HAND -- if you know what you're doing. So let's get into it.
YOUR TWO VISITOR TYPES You have TWO types of visitors
to your site: search engine spiders and human visitors. And BOTH are
important. The spiders must visit your new web pages before they
can be listed in their indexes. And the people? -- aha,
they're the ones with the credit cards who will buy your products or
services. So your site map needs to appeal to BOTH groups.
After reading through Andy's report, here is my "profile of
the ideal sitemap". - How does your site measure up? - And
what changes can you change to get better results? ANATOMY OF
A SEARCH-ENGINE FRIENDLY SITEMAP * It contains the links. No
surprise here, of course! For example:
http://eProfitNews.com/2tier-affiliate-programs.html * Your
new page's keywords are hyperlinked. Include the most important
keywords in clickable text links. My example above shows how I did
this. Yes, text is better than graphics for links purposes. *
You can add value to your human visitors by adding a description of
the linked page. Andy's report illustrates how best to do that on page
11. * All the links on the sitemap page are theme-related. So
you may need several sitemaps, one for each cluster of themed pages.
You connect all of these themed sitemaps together by having one
overall sitemap for your site. * Link to your overall sitemap
page from your home page, so the search engine spiders will be able to
identify all your new pages quickly. Note: it's better NOT to
submit your new web pages directly to the search engines. Let Mr.
Spider find 'em on his own! * By adding a little more
information to your overall sitemap, you've created a page that the
search engines will just love. On page 13 of his report, Andy
explains what else to add. FREE COPY OF THE SITEMAP REPORT
If you would like to know more, you can download a copy of Dr.
Andy Williams 21-page report called "Sitemaps: The Missing Link of
Search Engine Optimization". It's in PDF format and is available by
subscribing to eProfitNews. No charge, of course.
http://eProfitNews.com/m The report explains in
non-technical jargon why sitemaps are important to your success, and
how to do them properly. GET THE EDGE When you look at
many sites, you'll probably agree with Andy that the humble sitemap is
one of the "most under-rated SEO weapons". And that's great news for
you, because now you know how to transform your sitemap into the
powerful weapon it can be. ABOUT THE AUTHOR Gary
Harvey is the driving force behind the amazing "HOTTEST eProfit
Strategies". This incredible asset outlines PROVEN techniques that
deliver MORE TRAFFIC and MORE MONEY. You'll be amazed at what you
don't know. http://eProfitNews.com/HOTTEST Copyright © 2003.
Gary Harvey.
|