Something we’ve come across recently is a search query to check for URL’s which are listed in Google’s main index. This appears to remove any webpage’s which may be listed in Google’s supplemental/secondary index.
The query to find this is: site:domain.com/*
At the moment I’m unsure how accurate this is, the only reference I can find is a comment from Halfdeck on Jim Boykin’s How to Find if a Page is in Google’s Secret Supplemental Results from two and a half years ago. And since then, despite pleas to bring it back – Google have removed the queries which have helped us to find supplemental pages.
But looking at the results, I think there is some truth to the results – even if these are not entirely accurate.
Example supplemental query – BBC.co.uk
For example, BBC.co.uk has 47.9 million indexed pages – but the supplemental query lists only 1.43 million results:

Believable? It seems a bit low to me, for a high quality site like the BBC to only have 2.98% of their pages listed in Google’s main index – but then again 1.43 million pages is still a lot of content. Amazon.co.uk is a similar story, with 3.22 million in the main index as opposed to 145 million in the full index.
So what about for a smaller site?
The most obvious site for me to review is SEOptimise. We have 3,300 pages indexed in Google’s main index, yet only 647 pages indexed for the non-supplemental query:

So is this one accurate? My honest answer is I’m unsure. I posted earlier this week that we now have 877 posts on the blog, in addition to this there is the main site and other blog pages such as tags and categories etc. So I’d like to think we’d have a large percentage of high-quality content which is well valued by Google. But in actual fact the pages which have built up a strong trust in Google, either via external links or internal linking/navigation/site structure, means that this figure may be very close. The one thing I didn’t understand was that a large number of blog tag pages were listed, often when used only once or twice before – so the number of links pointing to this is likely to be low – as is the quality of the content on the page (which is duplicated from each main post).
And what are the factors which affect this?
If this results are accurate, I’d expect the following factors to have an impact here:
- Link reputation/PageRank to be a large factor for indexation
- Age of site
- History in the search engines
- Inbound links (internal and external) to individual webpages
- Duplicate content
- Volume of content / amount of unique content on-page
So what do you think, have you used this query before and do the results seem accurate? I definitely don’t think they are 100% accurate, but it may be a useful indicator to keep an eye on – especially if you are having problems generating traffic for pages which are indexed for a regular site: command query and the solution to fix this isn’t an obvious one.














There is currently a difference in reported indexing size between US and ROW (rest of world), or at least seems to be.
For instance this might, depending on data center result in around 8M indexed pages for the BBC
http://www.google.com/search?q=site:bbc.co.uk&pws=0&gl=US
/* is one method, there are other queries that are similar which can be more or less consistant.
/* returns the same for both US and ROW currently
At least one of the alternatives returns similar though lower numbers with some variance based on geographical factors.
In many ways it is a chicken/egg situation with indexation
Pagerank helps get pages in the index, but you need authority/trust to get pages in the primary index.
More pages in the primary index is a good indication of higher trust, but does not directly imply significantly more search traffic
It is like the difference between 1 in 10 and 11 in 100 or 101 in 1000 – more Pages in the index = more PageRank but not necessarily more trust/authority.
Thanks for the comment Andy, interesting stuff – strange that there’s a difference between US and ROW.
And what you said about PageRank makes sense, obviously direct external links to content will help significantly here too (as opposed to using PageRank strength filtered down to get as many pages as possible indexed).
it looks like that site:domain.com/* displays the first (important) results of site:domain.com/ . So I don’t see big similarities with supplemental index
Something else: When I search for “site:example.com/*” I see the “Similar” link on pages, that don’t show a “Similar” link when I search “site:example.com/“. I’m talking about the exact same URLs and tested it multiple times with two domains. It might be those to searches are served by two different DCs but why? There is definitely something going on with the *-parameter in site:-search.
We’re fairly sure that the Santa Clara data centre is now also running Caffeine. The index difference people are seeing seems to match the server that Google earlier confirmed as Caffeine.
Interesting find, I’ve not noticed that before. I’ve checked against a small (lowish authority) site I have that ranks fairly well for some core terms, it’s a site I use quite often for testing. The site only has 80 pages on the site:domain.com search, but has just 14 for site:domain.com/*.
Checking the stats I can see that in the last week, Google has sent visits to 58 landing pages (all organic traffic). Don’t forget that that’s just a one off example, but that implies that Google will still rank and be willing to send traffic to pages even if they don’t appear in your site:domain.com/* search.
In summary and conclusion though, I still have no idea what the /* is showing…
I’m afraid this syntax is just a buggy regular expression and does NOT show the non-supplemental results.
Compare it with this one here: “/*.”
It shows all pages on a site after the slash that have one word file name strings before that first dot.
It hides subdomains, directories, even homepages lacking a filename like “index.php” and even complex filenames consisting of more than one word combined with a minus.
So you’ll see
example.com/filename.php
but won’t
subdomain.example.com/filename.php
example.com
example.com/directory/
example.com/file-name.php
etc.
When you take away the dot you’ll see more pages as the regular expression is less strict but still it’s just a regular expression.
Also compare this query: site:*.bbc.co.uk/ -www
where you only see the BBC subdomains.
I hope that helps.
I don’t see any major changes after using both queries as per your blog explanation. I feel that Google’s algo uses a lot of factors, and not just some calculation like “we can only give this query this many spots”.
This would also show only subdomains
site:bbc.co.uk/ -www
Numbers are different though, as no pattern matching
http://www.google.com/support/customsearch/bin/answer.py?hl=en&answer=71826
The numbers using * are slightly different to another method I am aware of, but the ratios are typically very similar.
Conventional wisdom is that site:domain.com and site:domain.com/* should be the same but they are not
Do you notice a similarity to this other method?
http://search.aol.co.uk/aol/search?s_it=topsearchbox.search&query=site%3Aandybeard.eu&rp=
http://www.google.co.uk/search?q=site:andybeard.eu/*&pws=0&gl=UK
Bigger sites the disparity is often much greater (60%) but with the BBC it isn’t actualy that far off
http://www.google.co.uk/search?q=site%3Abbc.co.uk%2F*&pws=0&gl=UK
http://search.aol.co.uk/aol/search?query=site%3Abbc.co.uk
As Andy said, /* and regular site: query should theoretically return the same number but they don’t. Sebastian uncovered another query using date filters a while ago – might wanna look that one up too. John Mu once said regarding the normal site: query that it was only an approximation and suggested Webmaster Tools as a better alternative for monitoring indexation, so I’d also interpret /* numbers as no more than ballpark figures.
Another way to check index penetration is cache date of course, but those numbers are also influenced by things like number of page crawls per day so you have inaccuracies there too.
Average pages crawled per day is also a reflection of domain strength since crawling depth is mainly controlled by PageRank.
Would this explain why sites rank very well in ROW but the exact same site is nowhere to be seen in the US. I have a client with several sites ranking very well in UK and indee anywhere else in the world that I check but not in the US?
Hey I found some fast pagerank tools, may be seo friends would like them.
seo41.com/pagerank-checker.php
seo41.com/internal-pagerank.php
one is bulk page rank checker that can check 1000 of pages PR with single click and another is internal PR checker to check pagerank of all internal pages on a website.
There is lot more helpful things for search engine optimization.