## Page Rank Unmasked

If you are patient enough to read all this through, the results are pretty explosive if you are into SEO.

I have a way to predict the PageRank® offered by Google’s Toolbar with a correlation of around 0.8 (R^{2}=0.65)*. At the moment my working theory is that I may be able to predict Page Rank nearly 7 times out of 10 and should be able to predict within one PageRank® point over 95 times out of 100. I don’t think anyone has published this data before, so consider this post the unofficial "Google PageRank® is blown" post, with the caveat that other SEOs will need to verify and maybe improve on this research.

Firstly a bit of background. I cannot use the actual PageRank® algorithm in any research, even though – with all the link data of MajesticSEO at my disposal – I might be able to recreate it. In the UK, you cannot patent a mathematical formula, but if we took this then used it in a commercial product that was available in the USA I might have some legal issues to contest with from Google. So this research does not use the PageRank® algorithm in any way. What it does do is correlate numbers generated in my test with the Google PageRank® generated by Google’s toolbar for the same url to see if my numbers (which I am calling UV for "URL Value" or "Ultra Violet" because it lays the formula bare if you prefer) bear any resemblance whatsoever to Google’s scale.

**My (current) definition of "UValue" (UV)**

"UValue is the number of referring domains that link to the home page URL (not the whole domain) which have, themselves, got links from more than one referring domain."

Now this definition is pretty convenient, because if you look at the way MajesticSEO defines ACRank, this means every link that has an ACRank of 3 or more is a candidate – except that you need to also only count one link for every referring domain. Fortunately, MajesticSEO lets me find this in the standard reports. Here’s how:

1. Choose the URL you want to see the UValue for and put it into MajesticSEO.com

2. Buy the standard report for this url.

3. From the report’s domain overview, select URL > Backlinks checking that the filter is set to return best backlink only per domain

4. Scroll through the list (or download CSV) to find out how many in this list are ACRank 3+ and use this number as the UValue. (Screenshot)

Please note that UV is NOT ACRank. It is a very different number, running from zero to many thousands depending on the URL.

**My Hypothesis:**

In a vague attempt to recall some of the scientific credibility required for these things since I did my Maths degree over 20 years ago, I have a hypothesis. Luckily I also have a business partner who did his degree even longer back, but went on to get a doctorate and lectured at Iowa State and later at Cranfield University. Between us we hope not to over claim.

My hypothesis is that UV has a correlation with PageRank® (and therefore might be used as a predictor of PageRank®) when looking at home pages. I could expand the theory to inner pages, but for now decided to simply concentrate on home page research,

Now this is not too difficult to test. It is simply two sets of paired numbers. So it turns out that you really only need 25-30 pairs to test this hypothesis. We could repeat with hundreds or thousands of pairs, but it is unlikely our correlation would increase or decrease, just the degree of certainty that there is (or isn’t) a correlation. I took all the home page listings from two directory categories in Dmoz as a sample – one from a "mobile phone" category and one from an "underwater photography" category. Using genuinely random domains might be a way to improve on the test for anyone wishing to try to emulate the research.

After eliminating sites which did not generate a 200 response code, this gave me a list of 42 sites ranging from Page Rank 1 to Page Rank 6. The UV runs from 46 to 1819.

Using a linear regression, we see some good evidence of a decent fit – with an R^{2} of 0.62 (A perfect fot for R^{2} would be 1):

But having a Doctor helping me with the stats means we can do better. We looked at the same data again at used Excel’s wizardry to fit a Binomial distribution trend line to the data. Excel was able to do this with an R^{2} value (correlation) of 0.68. The graph below shows the data points.

However – even though this mathematically fits, I do not like the feel of the chart to be honest. I think that looking at the results, I can make a more intuitively logical prediction, mapping on the UValue range onto a prediction of PageRank® as follows:

IF UValue is Greater than Then PR is predicted to be

1200 6+

900 5

200 4

80 3

50 2

30 1

Below 30 Not Known

This table above predicted the correct PageRank® in my test 69% of the time (29 times out of 42) and predicts the answer within one over 95% of the time (40 out of 42).

I would love others to try to replicate this research and maybe make some modifications to improve it.

*PageRank is a concept owned in the US by Stanford University and Google and opinions expressed are my own.

Great research Dixon – one thing I couldn’t help but thinking from looking at the graph/data is that there would be a formula that would give a greater correlation. Specifically, something which originated from the origin.

What correlation do you get with a formula of the basic graph shape 2log(x+1)^0.5 (if the x axis is scaled to fit the data)? This graph has a basic shape which appears to fit the data better.

It also doesn’t slope downwards at higher values but flattens it’s rise.

Edit: I think both axes would need to be scaled, if you send me some sample data Dixon I could have a look and play around with it.

I conducted a similar study on PageRank in June. I used a large, relatively diverse sample, and I think you may gain some insight from my results.

I suggest that you use Spearman’s Coefficient rather than Pearson’s. PageRank is an ordinal variable, and Pearson’s coefficient is meant to be used with two interval-level variables. It will linearize your regression, and keep you from having to use a non-linear model.

Interesting research Dixon, perhaps we should try a larger sample size?

One further thing that should be done in studies of this type is to separate training and test data.

In your experiment you use training data to decide the parameters of the model and then you are evaluating the accuracy of the model on this same data set; this makes the accuracy seem higher than it actually is.

It is better to evaluate how useful the model is on a separate data set; otherwise you risk overfitting to the test data.

I’m sure I’m right about the above, not so much about my next point.

I think correlation is the wrong thing to optimise your model for; I think mean squared error (what Excel uses to draw the line of best fit) would be a better choice.

Hi Richard,

Yes – this is R squared and is 0.65.

It’s not my first (or only) experiment, but you are correct that it needs replicating on other data sets. That’s one of the reasons for publishing this – because there is only so much time in a week.

I’m also sure that a better fit can be found and Miles, one of the PHD guys I asked to look over the data before publishing also suggested a log scale and frankly I already know that from some research done years ago, which was theoretical because there was no dataset like Majestic to verify or repute the claim. If only I can find the research! Anyway – I’ll send of the spreadsheets Miles.

Hey Dixon,

Firstly, thanks for publishing this (and staying relatively modest about it!).

I’m still trying to get my head around the maths of this correlation stuff (TBH I have a mistrust of theories proved by correlations, but have to leave it to my more learned stats-trained collegues to stress test them).

One thing that immediately strikes me about this (and ACRank, which this post made me read more into – btw your “how ACRank is defined” link is broken):

Both UValue and ACRank seem to focus entirely on the VOLUME of linking domains. Whereas from the theory of PageRank and experience of what gets rankings I know the power of a single link from a powerful domain can outweigh that of a hundred links from diverse domains, especially when you get to the thick end of the SERPs.

I know that you are excluding links with ACRank < 3 here to try and filter out weak links - but then ACRank itself is determined by # of linking domains... So I guess the test here would be to look at homepages that have a large number of low level linking domains (AC3/AC4 links) vs homepages that have a small number of links from high PR pages (you know, the ones which inherit a PR5 off the back of 1 or 2 killer links) - and test the correlation there - at the edges of the theory? In my mind/experience these opposites are where the theory may fall apart. Not fully sure where I'm going with that to be honest; at the end of the day if this can become a test we can contribute data to and determine a statistically solid pattern, this way of 'predicting' homepage PR could become a really powerful link valuation tool, particularly in fallow patches between TBPR updates. The question of whether it is actually useful in SEO/linkbuilding practice comes down to whether it fills gaps in knowledge and helps gets results. Look forward to you trying to make sense of my comment Dixon - end of the day, appreciate you publishing this post

Hi Dixon,

Thanks for sharing your thoughts. Is your test (i.e. 95% accuracy) on a separate sample from the training set? I can’t see two sets of data easily in the post…

Assuming it is, I would be worried about a few things before digging into the calculations:

– you are right that a small sample-size is fine if we are selecting randomly from a random distribution, but we don’t know a lot about the distribution you are selecting from and the selection methodology for those sites is definitely not random

– you have a very small number of sites (reading from the chart, perhaps as few as 4) with PR above 4 – and they happen to fall in monotonic order so your algorithm predicts those perfectly. I find it hard to believe that this will continue with larger sample sizes

– if your algorithm actually predicts PR 1-4 reasonably well (i.e. with a 95% accuracy to within one “PR point”), off-by-one actually covers half the range!

Finally, given what we know about the underlying pagerank algorithm (and assuming that toolbar PR is at least partly determined from that) we could seek pathological examples such as sites with large numbers of relatively weak links and small numbers of very powerful links to test your theory out further.

Interested to hear more…

Thanks for the feedback Will and Jaamit. Sorry the comments took a while to go live, as I was at Pubcon so on the wrong time zone.

It’s a hard post, because it’s the start of a journey and I need others to look further into improving the correlation. For example, I expect you are correct, Jaamit, in that if you weighted UValue based on the spread of ACRank values of 3 and Above, you would get a better approximation. That would say be saying that good quality links count potentially much more than average ones. the problem is that to do that I (Not being an API whizz) would need to create 42 quite complicated spreadsheets before seeing if there is a match.

Further improvements may also be possible by making different assumptions about nofollow or redirect links, depending on each individual link strength, but that would start to be the work of a PHD project and I have a day job, so I am hoping that someone will use an API to make the task a bit quicker.

Distance from a trusted domain source would also be great – because in this sense my list was not at all random – in that I chose two ODP categories, so every site is ALREADY vetted by a human. So my research has an underlying assumption, that the sites are real businesses of value and not setting out to deceive the system. This might well mean, Jaamit, that your test of low quality linked sites will indeed break the theory – in which case the spread of ACRanks becomes a more obvious route to improve the corellation. Then again, how far do we really want to go to recreate an approximation of a figure that is only shown to the public to wet our appetites? the real correlation would be between a hypothesis and rankings – but we need to start somewhere.

Thanks for the pointing out the ACrank link – now fixed.

Thanks also for you comments Will. Firstly, the 95% was ONLY within one PageRank point margin of error and was only valid for this sample set.I value your input and accept that the data set on higher PR sites is small.

The trick would be to do more sample sets and try to validate the data. until that is done, I also would not be TOO excited about the results, but if the extra data is used to improve the model then we could be onto something. Newton’s theories took a while to develop, because someone made a rough incorrect assumption as to the size of the earth, but the assumption had been good enough to colonize the Americas.

So two steps to improve the predction:

1: Weight the UValue by the strength of the ACRanks spread

2: I have to assume that the site is not gaming links – otherwise sites on the fringe with unnatural link profiles will for sure break the model – but then again, as long as the number of sites that are doing that is re.ativey small, then this would be OK. So I guess this will work better outside the Pills finance and gaming industries.