Index Bloat: What is It & How to Fix it (Carefully)

Maintaining a refined index in search engines, especially Google, is critical to surviving (and ideally benefiting from) Google’s regular quality-related updates. It’s not the only factor, but it’s important to ensure that Google deems your site to be “high quality” (assuming the rest of your content is quality; not duplicated with external websites, not having obvious spelling and grammatical issues, not lacking depth or completeness, not outdated, and not having poor engagement). Furthermore, cleaning up your index bloat is often easier than the page-by-page approach that is needed to improve other low-quality pages that you may want to keep indexed in Google and other search engines. For this reason, it’s a good starting point at the onset of optimizing a site.

When I do content audits and tech audits, I always look for index bloat (in Google, especially). Fixing it lays the foundation for future content improvements that you will make and new content that you will add. Think of how a garden needs nutrient-rich soil and occasional pruning for future vegetation to prosper. You need to improve the quality of your foundation before new content and existing content improvements can have maximum organic search impact (and weather the storm of algorithm updates).

What is Index Bloat?

Index bloat is when a search engine has a notable amount of additional URLs indexed than are included in your sitemap, assuming your sitemap is an accurate list of quality, indexable URLs on your site. In other words, it’s when your site has too many URLs indexed (that shouldn’t be).

Common Types of Index Bloat

Filter/Facet URLs – This URL type is most commonly found on eCommerce sites. They simply reorganize or filter the content shown on the page, for example. In eCommerce SEO, it’s important that these URLs be kept out of search engine indexes as they do not provide unique (content) value to search engine indexes.
Tracking URLs – This URL type typically involves a parameter appended to the core page URL and is only used for tracking purposes. Canonicalization is the best approach to remove these URLs from search engine indexes, as it still allows the URLs to be crawled, but apply the link equity to the core URL.
Paginated URLs – This URL type provides a path to a series of content pages that are divided up into multiple pages. On eCommerce categories and content archive pages, paginated URLs offer little value to search engine indexes. Previously, Google recommended that they should use proper rel=prev/next coding along with self-referencing canonicals in order to consolidate the paginated series for search engines. Ideally, we don’t want them indexed, but John Mueller from Google announced in 2017 that Google could treat the “follow” in a “noindex,follow” meta robots tag as “nofollow” after some time. We don’t want to limit crawling or PageRank (link equity) from passing to our indexable pages on paginated URLs. Also, some paginated URLs offer value to search engine indexes, such as paginated forum threads. These types of paginated URLs should typically be indexed. However, they should not be in your XML sitemap.
Tag Pages – This URL type creates a dynamically-generated page (typically found on WordPress sites) that lists blog post snippets of all posts tagged with the same post tag. Since they are dynamically-generated, they offer little unique (content) value to search engines and should typically be set to “noindex,follow” via meta robots tags. Alternatively, tag pages can be optimized similarly to category pages by adding unique content atop the page for important tag pages that might already be getting a good amount of organic search traffic).
Internal Search Results – This URL type should neither be crawled by search engines, nor indexed, so it’s usually recommended to disallow crawl in the robots.txt file and also apply “noindex,follow” meta robots tags to them.

Why Does Index Bloat Matter?

The problem of index bloat is especially concerning with Google since this “bloat” of extra URLs are usually low-quality content and often involve dynamically-generated pages. These are not the types of pages that should be indexed in search engines, nor included in an XML sitemap (if they are, remove them), and they can drag down your overall site quality in the eyes of Google, specifically.

The screenshot below shows organic search traffic for a client’s site who received a nice bump upward when Google released a quality-related algorithm update. We had pruned a fair amount of low-quality index bloat URLs.

How to Identify Index Bloat

There are a few ways to identify index bloat. The methods explained below should ideally be performed in tandem with each other in order to gain a complete understanding of the problem.

Compare Index Status to Sitemaps in Search Console

Google Search Console offers a Coverage report which can be used to identify index bloat. Previously, the older Search Console offered an Index Status and Sitemap Indexation report. Here is an example of a site that had nearly 500 more URLs indexed in Google than are listed in its sitemap.

Index Status

Note: You can additionally perform the “site:” advanced operator in Google, however, it’s less accurate than the Coverage report from Search Console.

Sitemap Indexation

This amounts to a 40% index bloat, which is quite high. If you only have a small percentage of additional URLs indexed than what’s in your sitemap, say 1-10%, it’s less likely that negative impact will be experienced at the time of a Google quality-related algorithm update (although it should still be addressed). It’s also possible that the indexation of these additional URLs will decrease over time as Google continues to refine its index (if you have the proper non-indexation signals in place: canonical, “noindex” meta robots tags, robots.txt commands, or redirects).

Crawl the Site, Yourself

Crawling your own website with a tool like Screaming Frog or Deep Crawl can help to identify specific URLs that might be indexed in search engines (if they are indexable). The goal is to create a list of all indexable URLs and then compare them to your XML sitemap in an Excel spreadsheet or Google Sheet.

Here’s how I identify these URLs using Screaming Frog, when necessary. Update: Screaming Frog now offers a “Crawl Analysis” feature which can check for URLs missing from the sitemap. The method below can still be used, however.

Crawl the site (excluding external sites)
Export the Internal HTML report into a CSV file, and then upload it to Google Drive as a Google Sheet.
Filter the header row in order to identify and remove any URLs that are not indexable.
Crawl your XML sitemap in list mode (without following redirects) and exporting the Internal HTML report into a CSV file and uploading it to a new tab of the same Google Sheet.
1. Note #1: Ensure that you delete any rows for URLs that are non-indexable.
2. Note #2: Review the remaining URLs and ensure that there are no obvious low-quality page types that should be removed from the sitemap (tag any that you find).
Write VLOOKUP formulas (in the main crawl tab) to determine which URLs from the Sitemap tab are listed.
1. Note: The reverse of this approach, writing VLOOKUP formulas in the Sitemap tab, can help identify orphaned URLs not discovered during your crawl.
Filter to show only those URLs not listed in the Sitemap tab. Review them, and search for some of them in Google to see if they are indexed. Note: Use advanced operators such as “site:” and “inurl:” to get a sense of how many URLs of a particular URL structure are indexed.

Google Queries

Expanding on the last step above, here’s how you can use advanced operators in Google to identify index bloat URLs.

The “site:” operator – If I wanted to get a general sense of my indexation in Google, a query such as site:https://www.kernmedia.com will give me search results dedicated to my own site’s indexation.
The “inurl:” operator – If I wanted to see what pages in my /services/ subfolder are indexed, I can enter a query such as site:https://www.kernmedia.com inurl:/services/.
The extended “site:” operator – If the URL structure of a particular page type of interest is close to the root, you can append it to the end of the “site:” operator as such: site:https://www.kernmedia.com/services/.

For WordPress sites, it’s common for tag pages (dynamically generated, with no unique content) to be indexed in Google. A query such as site:https://www.kernmedia.com inurl:/tag/ shows that my site does not have any tag pages indexed.

However, if you look at a site like TechCrunch, you’ll notice that they have roughly 86,000 /tag/ pages indexed at the time of this writing.

Tag pages do not appear to be in their XML sitemap (which I admit that I only spot-checked since it’s huge). They are not set to “noindex” and they’re also not disallowed in their robots.txt file. That’s an enormous amount of index bloat, and I suspect that they might drive a fair amount of traffic due to the high indexation volume. Note: keep in mind that this number from Google might not be fully accurate, but still useful.

How to Fix Indexation Bloat

Once you’ve identified the URLs on your site that are indexed, don’t rush off to remove all of them from search engine indexes at once. If these URLs are driving notable organic search traffic and/or conversions, removing them from search engine indexes abruptly can cause a potentially hazardous decline in organic search traffic and revenue. You may need to create new pages targeting the keywords that the tag pages, which are driving traffic, are currently ranking for in search engines.

Review Organic Metrics in Google Analytics

It’s imperative to review the organic search traffic and conversions to these index bloat URLs before pruning them from search engine indexes. Here’s an example of a client’s site who had WordPress tag pages driving a notable amount of organic search traffic. We decided that the tag pages weren’t the best pages to rank so we replaced those that drove the most traffic with blog posts and forfeited the organic search traffic from the rest (the long tail), which we were okay with. It was minimal. Notice how the site didn’t lose too much organic search traffic in the short term (compare the red boxes) and allowed us to set a foundation for future growth. Note: I did not work with this client from May-December 2016, so the organic search traffic growth would have happened sooner had we not taken a break until January 2017.

(Planned) Organic Traffic Decline to Tag Pages

All Organic Traffic

Determine Partial Removals (if needed)

The example above shows that index bloat URLs can drive organic search traffic, which you need to be aware of. They were targeting low competition keyword phrases so I suspect that Google was ranking them out of necessity in order to provide relevant search results. If you run into a similar situation, you can choose to either improve the content of these low-quality pages, prune them (remove from search engine indexes), partially prune them (only those driving low organic search traffic), or replace them (in full, or partially) with higher quality pages.

In these situations, ensure that you review the keywords that are driving traffic to the pages (in Google Search Console’s “Search Analytics” report) and use them in the new or existing pages being redirected to (that will replace the bloat URLs). This will help to ensure that Google still sees relevancy for the keyword, and will help to maintain or improve the ranking.

However, sometimes the problem is much too vast to take this “scalpel” approach, and a “hatchet” approach must be taken after due diligence has been performed to determine that there are appropriate replacement pages that can rank in place of the pages being removed.

Below is a graph from an eCommerce client (who I helped Inflow with) who has driven over $2M sessions and over $3M in revenue over an 18-month period from bloat URLs. Specifically, they were parameter URLs. There were 14 different parameter types that have driven rather consistent organic search traffic and revenue through the entire time period, and de-indexation approaches needed to be implemented very carefully.

Organic Search Traffic

Organic Search Revenue

As you can see, attempting to remove the index bloat (parameter URLs) for this particular site too quickly could cause an unwanted decline in organic search traffic and revenue. In cases like this, I recommend determining if there are higher-quality pages that can rank in their place, and remove the parameter types (from Google’s index) that drive fewer organic search traffic and revenue than the others as a sample test to measure impact. In this particular case, it would include removing the parameter URLs from the XML sitemap and canonicalize them to their core URLs.

Determine the Best Approach to Remove from Google & Other Search Engines

There are a number of ways that you can remove index bloat URLs from Google’s index and other search engines.

Meta Robots Tags – You can set pages to “noindex,follow” or “noindex,nofollow” with a meta robots tag in the <head>, depending on the situation. Typically, the “noindex,follow” meta robots tag should be applied to page types such as internal search results, tag pages, and other page types where you want to prevent search engines from indexing the page, but flow internal link equity to the links on the page should they find and crawl it (although likely not permanently, at least with Google). The “noindex,nofollow” attribute can be reserved for all pages on a development site, sponsored pages (advertorials) that you’re receiving compensation for, and other page types where you want to limit equity to the links on the page, and prevent the page from being indexed.
Robots.txt Disallow Rules – You can completely disallow search engines from crawling specific pages or page types (by using wildcards) with “disallow” rules in your robots.txt file. Keep in mind, however, that currently indexed URLs will remain indexed if you disallow them from crawling in the robots.txt file. This will essentially suppress them in Google’s search results, for example, but they’ll still remain indexed. It’s typically best to implement robots.txt disallow rules after pages have been fully removed from search engine indexes in order to prevent them from being crawled in the future. This can help improve the efficiency of a search bot’s crawl budget for your site.
Canonicalization – You can canonicalize tracking URLs and other URL types (that are duplicative in nature, such as filter/facet URLs) to the core URL in order to provide a “hint” to Google and other search engines that the URL should not be indexed.
Delete & Redirect – If a page is not valuable (low-quality content, low organic search metrics), then you can choose to delete and redirect it to a topically-relevant URL.

Where Applicable, Put Together a Timeline & Do Sample Tests

If you have any types of bloat URLs that are driving notable organic search traffic and/or conversions, you’ll want to ensure that you carefully measure potential impact as described above. Look for other URLs on your site that should be ranking in their place (that are not currently ranking, and whether they are likely to) and determine if their content quality and internal link signals are likely to influence search engines to rank them in place of the lower-quality URLs that you wish to remove.

If you’re concerned about the impact on organic search metrics, put together a series of low-impact tests and also a timeline for implementation and measurement. Sometimes you have to take a risk, however, you can choose to wait for signs of declining organic metrics for these bloat URLs to take such a risk. Every situation is different and it’s typically a judgment call.

If I can help with any indexation bloat problems that you’re having, please contact me or leave a comment below.

Index Bloat: What is It & How to Fix it (Carefully)

What is Index Bloat?

Common Types of Index Bloat

Why Does Index Bloat Matter?

How to Identify Index Bloat