eCommerce Crawl Budget Optimisation Guide | SearchUp

If Googlebot wastes time on filters, thin variants, and JS payloads, new SKUs sit unindexed.

This guide shows how to measure crawl waste from logs, fix it with clear rules, and keep bots focused on pages that sell.

What crawl budget means for large stores

Crawl rate is how often Googlebot requests URLs from your host. Crawl demand is how much Google wants to recrawl based on importance and staleness.

You win when rate × demand is spent on pages that you want indexed or refreshed.

Common warning signs

Category pages change but stay stale in cache.
Logs show high bot activity on filters and sort orders.
New product detail pages (PDPs) take days to appear.
5xx bursts reduce Googlebot hits for hours.
Orphaned categories or PDPs see near-zero bot visits.

Measure first - a log-driven crawl audit

You can’t optimise what you don’t measure. Start with 30–90 days of raw logs.

Data you need

User-agent: separate Googlebot, Googlebot-Image, Bingbot, others.
URL path, query string, status, response time, bytes, timestamp.
Response type: HTML vs asset (JS/CSS/image) via path or content-type.
Template tagging: PDP, category/PLP, brand, blog, search, filters, misc.

Key KPIs to compute

KPI	How to calculate	Target / Direction
% crawl on indexable templates	Googlebot HTML hits on PDP/PLP / all Googlebot hits	Up
Waste share by pattern	Googlebot hits on disallowed/low-value patterns / all	Down
JS/HTML request ratio (bot)	Googlebot requests to JS assets / Googlebot HTML	Down on money templates
4xx/5xx by bot	Error hits by Googlebot / all Googlebot hits	As low as possible
Median time-to-first-index (TTFI)	Days from first crawl to first index for new PDPs	Down
Render time buckets	% of requests in <200ms, 200–500ms, 500ms–1s, >1s	Shift left
Depth to discovery	Average link depth of newly crawled PDPs	Down

Seven-step process (repeat monthly)

Collect logs from CDN + origin and normalise into a single table.
Bot segmentation via UA patterns and reverse DNS for Googlebot if needed.
Template mapping with deterministic rules or regex (e.g., /p/ for PDPs, /c/ for categories).
Parameter bucketing for ?color=, ?size=, ?sort=, ?page= and session IDs.
Waste map: group hits by pattern; flag top 10 URL patterns by wasted bot time.
Freshness map: track how long new PDPs take to reach first index; compare to crawl hits.
Opportunity delta: estimate gained crawl if top three waste patterns are fixed.

Kill waste by URL patterns (filters, sort, pagination)

Faceted navigation creates the fastest crawl leak. Treat each facet type with a policy.

A simple decision tree

Does the facet change product set in a commercially meaningful way?
- Yes, high value (e.g., “women’s running shoes” → “trail running”): give a clean, indexable path and link to it from PLPs.
- Maybe, mid value (e.g., “size 8”): keep as parameter, noindex, and disallow crawl.
- No, low value (e.g., ephemeral sort orders): keep as parameter, disallow.

Canonical + robots combos

robots.txt (keep it tight - don’t block canonical pages):

User-agent: *
Disallow: /*?sort=
Disallow: /*?sessionid=
Disallow: /*?view=
# Facet parameters kept crawl-closed but still usable by users:
Disallow: /*?color=
Disallow: /*?size=

PDP canonical example (variant consolidation):

<link rel="canonical" href="https://www.example.com/p/ultra-trail-shoe" />

PLP with allowed facet as a static path:

<link rel="canonical" href="https://www.example.com/womens/running/trail/" />

Parameter handling rules (examples)

Parameter	Policy	Reason
sort	Disallow + noindex	Duplicate sets; shuffles order only
page	Allow crawl, canonical to page series or self	Useful pagination; avoid infinite loops
color, size	Disallow + noindex	SKU variants already handled on PDP
sessionid, utm_*	Disallow	Pure noise

Pagination notes

Use consistent URLs like /category/?page=2.
Keep self-canonicals on page 2+ unless you consolidate to page 1 with a strong reason.
Provide crawlable “next” links in HTML; don’t rely on JS-only infinite scroll.

XML sitemaps that steer crawlers

Sitemaps won’t force crawling, but they help bots prioritise.

Structure that works for retail

Split by type: /sitemap-categories.xml, /sitemap-products.xml, /sitemap-content.xml.
Keep files small (≤50k URLs, but thousands are already large for change tracking).
Rotate deltas: publish /sitemap-products-new.xml for new or back-in-stock items.

Example entries with freshness hints

<url>
  <loc>https://www.example.com/p/ultra-trail-shoe</loc>
  <lastmod>2025-10-29</lastmod>
  <changefreq>daily</changefreq>
  <priority>0.9</priority>
</url>

Ops tip: sync lastmod with real content changes, not deployment timestamps.

If you use ETag/Last-Modified headers on HTML, keep them consistent with sitemap lastmod.

JS rendering without burning bot time (CSR vs SSR vs prerender)

JavaScript can stall discovery if a bot has to render before it sees links or content.

Pick per template, then measure

Template	CSR	SSR	Prerender
Home	OK	Good	Overkill
Category/PLP	Risky when links appear post-render	Strong baseline	Useful if SSR is hard
PDP	Often fine if HTML has key content	Strong for speed and stability	Useful for heavy widgets
Blog/Guides	Fine	Fine	Not needed

What to track after a switch

HTML discovery rate: % of links visible in raw HTML.
Render-time buckets: TTFB + server render; track <200ms, 200–500ms, 500ms–1s, >1s.
Index latency: days from first crawl to first index for new PLPs/PDPs.

Caution: partial prerender with heavy hydration can spike CPU and 5xx under load.

Gate rollouts and keep a rollback route in your CDN or router.

Speed and stability levers that lift crawl

Googlebot adapts to server health. Fast, consistent responses invite more crawling.

Good cache validators

Cache-Control: max-age=0, must-revalidate
ETag: "pdp-ultra-trail-shoe-v3"
Last-Modified: Wed, 29 Oct 2025 16:04:00 GMT

Useful patterns

Keep TTFB stable on HTML; aim for a big share <500ms.
Reduce 5xx by isolating search and cart APIs from catalog HTML.
Serve lightweight HTML for PLP/PDP with meaningful text and links before hydration.
Avoid long 301 chains; fix old paths in one hop.

Incident mini-playbook

Detect 5xx spike from Googlebot.
Temporarily lower crawl rate via server hints only if capacity is at risk.
Freeze non-essential deployments.
Move bot traffic for HTML through a less congested cluster if possible.
Post-mortem: tie errors to time of day, build, or promotions; add guardrails.

Internal linking that concentrates bot time

Bots follow links. Give them obvious, stable paths to money pages.

Keep link depth shallow: home → category → PDP in ≤3 clicks where possible.
Add “Trending” or “Seasonal” modules with static HTML links on the homepage.

Ensure breadcrumbs reflect canonical paths and render in HTML.

BreadcrumbList JSON-LD

{
  "@context": "https://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [
    {
      "@type": "ListItem",
      "position": 1,
      "name": "Women",
      "item": "https://www.example.com/women/"
    },
    {
      "@type": "ListItem",
      "position": 2,
      "name": "Running",
      "item": "https://www.example.com/women/running/"
    },
    {
      "@type": "ListItem",
      "position": 3,
      "name": "Trail",
      "item": "https://www.example.com/women/running/trail/"
    }
  ]
}

Orphan checks

Export all canonical URLs and compare with internal link graph.
Pages with zero inlinks should not be canonical unless intentional.

Peak-season runbook (Black Friday, major drops, sales)

Plan for traffic spikes and fast-changing assortments.

Timeline

When	Owner	Action
T-14 days	SEO + Eng	Freeze low-value sections. Publish runbook and rollback.
T-7 days	SEO	Push delta sitemaps for upcoming products and event landing pages.
T-3 days	Eng	Raise capacity and watch 5xx alarms. Add CDN rules for bot priority paths.
T-1 day	SEO	Re-check robots rules; confirm no new noisy parameters slipped in.
Live day	NOC	Monitor dashboards: % crawl on priority templates, 5xx by bot, index latency.
+24h	SEO	Compare log KPIs and Search Console crawl stats; list follow-ups.

Bot-shaping ideas

Prioritise HTML on category and promo pages at the CDN; keep asset caching aggressive.
Throttle endpoints like /search and /tracking for bots if needed.
Freshness signals: update lastmod for deltas only; don’t bump everything.

From raw logs to a repeatable audit - worked example

Imagine a retailer with 35% of Googlebot hits landing on filter parameters and 12% on ?sort= pages.

Disallow ?sort= and ?sessionid= in robots.txt.
Keep ?color= and ?size= as noindex via meta on PLP templates; add self-canonical.
Convert “Trail Running” and “Wide Fit” to curated, static PLPs with clean paths and internal links.
Split sitemaps; push a products-new sitemap for the latest 5k SKUs.
Move PLPs to SSR; keep PDPs on CSR with server-rendered critical content.

After 28 days, measure: % crawl on indexable templates jumps from 41% → 68%, TTFI drops from 7 days → 3 days, JS/HTML ratio on PLPs falls by half.

Implementation quick reference

robots.txt starter

User-agent: *
Disallow: /*?sort=
Disallow: /*?sessionid=
Disallow: /*?view=
Disallow: /*?color=
Disallow: /*?size=

Canonical on PLP

<link rel="canonical" href="https://www.example.com/womens/running/" />

PDP cache validators

Cache-Control: max-age=0, must-revalidate
ETag: "pdp-<sku>-v4"
Last-Modified: Wed, 29 Oct 2025 16:04:00 GMT

‍Sitemap split (example)

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://www.example.com/sitemaps/sitemap-categories.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://www.example.com/sitemaps/sitemap-products.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://www.example.com/sitemaps/sitemap-products-new.xml</loc>
  </sitemap>
</sitemapindex>

Post-publish monitoring (30/60/90 days)

30 days: validate log KPIs and confirm a higher share of Googlebot on PLPs/PDPs.
60 days: compare time-to-first-index for new SKUs; adjust rendering or sitemap deltas.
90 days: review error rates by bot, expand static facet paths that drove revenue, and close any newly noisy parameters.

If the KPIs stall, revisit the waste map and try a different frame: a stricter facet policy, a faster SSR path for PLPs, or stronger internal links to seasonal hubs.

Shopify Crawl Budget Optimisation

Shopify has its own shapes for URLs, sitemaps, and rendering.

These sections give you platform-specific ways to cut crawl waste and speed up discovery.

Shopify URL patterns to watch

Collection pages live at /collections/{handle} and often carry tags, filter params, ?page=, and ?sort_by=.

Keep high-value facets as curated collections, and keep the rest as parameters with noindex + crawl blocks.

Focus areas

/collections/{handle}?sort_by= and /collections/{handle}?page=
Filter params from Search & Discovery (e.g., filter.v.option.size=)
Alternate views like ?view=grid or ?view=ajax

Search results: /search and /search?type=product&q=

robots.txt.liquid on Shopify

Shopify lets you customise robots at templates/robots.txt.liquid.

Close noisy params and system paths while keeping canonical pages open.

# templates/robots.txt.liquid
User-agent: *
# crawl noise
Disallow: /*?sort_by=
Disallow: /*?view=
Disallow: /*?q=
Disallow: /search
# keep filter params crawl-closed if you don't want them discovered
Disallow: /*?filter.
# tracking and session-like junk
Disallow: /*utm_
Disallow: /*gclid=
Sitemap: https://{{ shop.primary_domain }}/sitemap.xml

Keep collection and product base URLs crawlable.

Test changes in a staging theme before publishing.

Canonicals in Liquid (collection and product templates)

Make sure canonicals ignore query strings and point to the clean URL.

Place this in theme.liquid or each template near <head>.

{%- capture canonical -%}{{ canonical_url | split:'?' | first }}{%- endcapture -%}
<link rel="canonical" href="{{ canonical }}" />

For collection pages with tags or filter params, keep the canonical on the base collection unless you’ve created a curated, static collection for that facet.

Avoid canonicals that bounce between param states.

Conditional meta robots for filter states

If you keep parameterised filters for UX, prevent them entering the index.

Use a simple guard on query strings.

{%- assign q = request.query_string | downcase -%}
{%- if q contains 'filter.' or q contains 'sort_by=' or q contains 'view=' -%}
  <meta name="robots" content="noindex,follow">
{%- endif -%}

This still lets bots crawl links on the page while keeping the URL out of the index.

Pair with the robots.txt blocks to cut repeated crawling.

Shopify sitemaps: what you can and can’t do

Shopify auto-generates sitemap.xml and child sitemaps for products, collections, pages, and posts.

You can’t hand-edit them or ship true “delta” sitemaps.

Workable tactics

Build a “New arrivals” or “Back in stock” collection and link it in the main nav so new SKUs get internal links fast.
Submit important collection and promo URLs in Search Console for quicker discovery.
Keep product publishing tidy: push only truly live, stocked items to avoid churn in sitemaps.

Internal linking on Online Store 2.0

Add static, crawlable links to priority collections from the homepage and top categories.

Use Collection list and Featured collection sections instead of JS-only carousels.

Quick wins

Make a Seasonal block on the homepage that links to 6–12 top collections.
Add related collections at the bottom of collection templates using lists, not dropdowns.

Use metafields on collections to surface two or three strategic internal links in HTML.

Pagination and infinite scroll

Shopify uses ?page= for collection pagination.

Expose numbered links in HTML and keep self-canonical on page 2+ unless you have a solid consolidation plan.

Avoid infinite scroll that hides links behind JS.

If you need it, also render a classic pager so bots can traverse pages.

Search results and filter archives

Disallow /search in robots and keep it noindex.

Never link to raw search queries from navigation.

For filter archives, either:

Create a curated, static collection when revenue warrants it, or
Keep it as a param with noindex + crawl block and rely on the base collection.

App hygiene (JS budget + crawl)

Every app usually ships scripts, DOM changes, or extra endpoints.

Audit quarterly and remove anything that doesn’t move revenue or discovery.

Tidy-up list

Turn off unused App embeds in the theme editor.
Defer or load analytics and widgets after critical HTML.
Prefer apps that render server-side blocks over client-injected markups.

Theme performance that influences crawl

Liquid renders server-side, then storefront JS hydrates sections.

Large sections, heavy loops, and big images slow TTFB and increase crawl cost.

Keep it lean

Flatten nested loops and paginate large product lists.
Compress images and serve modern formats with Shopify’s CDN params (_800x etc.).
Inline only the CSS that’s truly critical; ship the rest as a small, cacheable file.

Breadcrumbs and structured data

Output a BreadcrumbList per collection/product so bots understand the path.

Keep names and URLs aligned with canonicals.

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "BreadcrumbList",
  "itemListElement": [
    { "@type": "ListItem", "position": 1, "name": "Women", "item": "https://{{ shop.primary_domain }}/collections/women" },
    { "@type": "ListItem", "position": 2, "name": "Running", "item": "https://{{ shop.primary_domain }}/collections/women-running" },
    { "@type": "ListItem", "position": 3, "name": "Trail", "item": "https://{{ shop.primary_domain }}/collections/women-running-trail" }
  ]
}
</script>

This pairs with your internal link plan and keeps paths consistent.

Avoid schema that references parameterised URLs.

Shopify Search & Discovery filters

The official app powers faceted filtering and builds param URLs.

Decide which facets deserve curated collections and which stay as params.

Practical split

Indexable as static collections: core category + a small set of high-demand facets.
Param-only: size, colour, price sliders, rating, availability.

Document the policy so merch and dev teams don’t create accidental index traps.

Keep any param-only facet links out of navigation.

Logs and measurement on Shopify

You don’t get raw server logs on the platform.

Use Search Console → Crawl Stats, URL Inspection API, and your analytics to infer crawl patterns.

What to track

% of Googlebot hits landing on collections/products vs everything else (from Crawl Stats categories).
Discovery time for new collections and SKUs.
Parameter creep in top crawled URLs (sample with log-like reports from Crawl Stats and on-site link exports).

If you front the store with a proxy like Cloudflare on Plus, you can sample bot traffic there.

Keep origin visibility enabled so Shopify still caches and serves cleanly.

Shopify rendering choices (native vs headless)

Native themes are server-rendered with Liquid and usually good for discovery.

Headless (Hydrogen + Oxygen) brings SSR control and edge caching but shifts sitemap and linking duties to you.

If you go headless

Generate and submit your own sitemaps.
Pre-render category and PDP routes, and make sure internal links exist in HTML.
Keep filter states as params with noindex unless you’re deliberately creating static, indexable routes.

Shopify-specific peak plan

Create and link promo collections two weeks out; keep URLs stable.
Freeze new filters and experimental apps a week out.
On live day, check Crawl Stats hourly and watch for spikes to /search and ?sort_by=, then tighten rules if needed.

Shopify quick checklist

robots.txt.liquid closes /search, ?sort_by=, ?view=, and ?filter. you don’t want crawled
Canonical strips params site-wide; curated facet pages have clean paths
Collection pagination exposes numbered links in HTML
Homepage links to 6 - 12 revenue collections in static HTML
Search & Discovery facets mapped: which are static collections vs param-only
App embeds audited; unused scripts disabled
Breadcrumb JSON-LD on collections and products
GSC Crawl Stats monitored; new SKUs show faster index times week over week

Key Takeaways

Measure first: segment server logs by bot, template, parameters, status, and HTML vs JS requests.
Raise the % of Googlebot hits on indexable templates and cut waste by URL pattern.
Shape crawl with clean XML sitemaps, cache validators (ETag, Last-Modified), and stable response times.
Pick the right rendering model per template; validate with index latency and render-time buckets.
Ship a peak-season runbook so promotions get discovered fast without melting servers.

Frequently Asked Questions

Do canonicals alone fix filter duplication?

No. Canonicals suggest a preferred URL, but Google may still crawl the duplicates if they’re open; close low-value patterns with robots rules and avoid linking to them.

Are noindex pages a waste of crawl?

They still get crawled. Use noindex for short-term control, but pair it with blocked crawling for noisy parameters that don’t need discovery.

Does prerender always reduce crawl cost?

Not always. If prerender inflates CPU under load and causes 5xx, crawl rate can drop; validate with render-time buckets and index latency before committing.

Get in touch today

complete the form below for an informal chat about your business

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.