Table of Contents
If Googlebot wastes time on filters, thin variants, and JS payloads, new SKUs sit unindexed.
This guide shows how to measure crawl waste from logs, fix it with clear rules, and keep bots focused on pages that sell.
What crawl budget means for large stores
Crawl rate is how often Googlebot requests URLs from your host. Crawl demand is how much Google wants to recrawl based on importance and staleness.
You win when rate × demand is spent on pages that you want indexed or refreshed.
Common warning signs
- Category pages change but stay stale in cache.
- Logs show high bot activity on filters and sort orders.
- New product detail pages (PDPs) take days to appear.
- 5xx bursts reduce Googlebot hits for hours.
- Orphaned categories or PDPs see near-zero bot visits.
Measure first - a log-driven crawl audit
You can’t optimise what you don’t measure. Start with 30–90 days of raw logs.
Data you need
- User-agent: separate Googlebot, Googlebot-Image, Bingbot, others.
- URL path, query string, status, response time, bytes, timestamp.
- Response type: HTML vs asset (JS/CSS/image) via path or content-type.
- Template tagging: PDP, category/PLP, brand, blog, search, filters, misc.
Key KPIs to compute
Seven-step process (repeat monthly)
- Collect logs from CDN + origin and normalise into a single table.
- Bot segmentation via UA patterns and reverse DNS for Googlebot if needed.
- Template mapping with deterministic rules or regex (e.g., /p/ for PDPs, /c/ for categories).
- Parameter bucketing for ?color=, ?size=, ?sort=, ?page= and session IDs.
- Waste map: group hits by pattern; flag top 10 URL patterns by wasted bot time.
- Freshness map: track how long new PDPs take to reach first index; compare to crawl hits.
- Opportunity delta: estimate gained crawl if top three waste patterns are fixed.
Kill waste by URL patterns (filters, sort, pagination)
Faceted navigation creates the fastest crawl leak. Treat each facet type with a policy.
A simple decision tree
- Does the facet change product set in a commercially meaningful way?
- Yes, high value (e.g., “women’s running shoes” → “trail running”): give a clean, indexable path and link to it from PLPs.
- Maybe, mid value (e.g., “size 8”): keep as parameter, noindex, and disallow crawl.
- No, low value (e.g., ephemeral sort orders): keep as parameter, disallow.
- Yes, high value (e.g., “women’s running shoes” → “trail running”): give a clean, indexable path and link to it from PLPs.
Canonical + robots combos
robots.txt (keep it tight - don’t block canonical pages):
User-agent: *
Disallow: /*?sort=
Disallow: /*?sessionid=
Disallow: /*?view=
# Facet parameters kept crawl-closed but still usable by users:
Disallow: /*?color=
Disallow: /*?size=PDP canonical example (variant consolidation):
<link rel="canonical" href="https://www.example.com/p/ultra-trail-shoe" />PLP with allowed facet as a static path:
<link rel="canonical" href="https://www.example.com/womens/running/trail/" />Parameter handling rules (examples)
Pagination notes
- Use consistent URLs like /category/?page=2.
- Keep self-canonicals on page 2+ unless you consolidate to page 1 with a strong reason.
- Provide crawlable “next” links in HTML; don’t rely on JS-only infinite scroll.
XML sitemaps that steer crawlers
Sitemaps won’t force crawling, but they help bots prioritise.
Structure that works for retail
- Split by type: /sitemap-categories.xml, /sitemap-products.xml, /sitemap-content.xml.
- Keep files small (≤50k URLs, but thousands are already large for change tracking).
- Rotate deltas: publish /sitemap-products-new.xml for new or back-in-stock items.
Example entries with freshness hints
<url>
<loc>https://www.example.com/p/ultra-trail-shoe</loc>
<lastmod>2025-10-29</lastmod>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>Ops tip: sync lastmod with real content changes, not deployment timestamps.
If you use ETag/Last-Modified headers on HTML, keep them consistent with sitemap lastmod.
JS rendering without burning bot time (CSR vs SSR vs prerender)
JavaScript can stall discovery if a bot has to render before it sees links or content.
Pick per template, then measure
What to track after a switch
- HTML discovery rate: % of links visible in raw HTML.
- Render-time buckets: TTFB + server render; track <200ms, 200–500ms, 500ms–1s, >1s.
- Index latency: days from first crawl to first index for new PLPs/PDPs.
Caution: partial prerender with heavy hydration can spike CPU and 5xx under load.
Gate rollouts and keep a rollback route in your CDN or router.
Speed and stability levers that lift crawl
Googlebot adapts to server health. Fast, consistent responses invite more crawling.
Good cache validators
Cache-Control: max-age=0, must-revalidate
ETag: "pdp-ultra-trail-shoe-v3"
Last-Modified: Wed, 29 Oct 2025 16:04:00 GMTUseful patterns
- Keep TTFB stable on HTML; aim for a big share <500ms.
- Reduce 5xx by isolating search and cart APIs from catalog HTML.
- Serve lightweight HTML for PLP/PDP with meaningful text and links before hydration.
- Avoid long 301 chains; fix old paths in one hop.
Incident mini-playbook
- Detect 5xx spike from Googlebot.
- Temporarily lower crawl rate via server hints only if capacity is at risk.
- Freeze non-essential deployments.
- Move bot traffic for HTML through a less congested cluster if possible.
- Post-mortem: tie errors to time of day, build, or promotions; add guardrails.
Internal linking that concentrates bot time
Bots follow links. Give them obvious, stable paths to money pages.
- Keep link depth shallow: home → category → PDP in ≤3 clicks where possible.
- Add “Trending” or “Seasonal” modules with static HTML links on the homepage.
Ensure breadcrumbs reflect canonical paths and render in HTML.
BreadcrumbList JSON-LD
{
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{
"@type": "ListItem",
"position": 1,
"name": "Women",
"item": "https://www.example.com/women/"
},
{
"@type": "ListItem",
"position": 2,
"name": "Running",
"item": "https://www.example.com/women/running/"
},
{
"@type": "ListItem",
"position": 3,
"name": "Trail",
"item": "https://www.example.com/women/running/trail/"
}
]
}Orphan checks
- Export all canonical URLs and compare with internal link graph.
- Pages with zero inlinks should not be canonical unless intentional.
Peak-season runbook (Black Friday, major drops, sales)
Plan for traffic spikes and fast-changing assortments.
Timeline
Bot-shaping ideas
- Prioritise HTML on category and promo pages at the CDN; keep asset caching aggressive.
- Throttle endpoints like /search and /tracking for bots if needed.
- Freshness signals: update lastmod for deltas only; don’t bump everything.
From raw logs to a repeatable audit - worked example
Imagine a retailer with 35% of Googlebot hits landing on filter parameters and 12% on ?sort= pages.
- Disallow ?sort= and ?sessionid= in robots.txt.
- Keep ?color= and ?size= as noindex via meta on PLP templates; add self-canonical.
- Convert “Trail Running” and “Wide Fit” to curated, static PLPs with clean paths and internal links.
- Split sitemaps; push a products-new sitemap for the latest 5k SKUs.
- Move PLPs to SSR; keep PDPs on CSR with server-rendered critical content.
After 28 days, measure: % crawl on indexable templates jumps from 41% → 68%, TTFI drops from 7 days → 3 days, JS/HTML ratio on PLPs falls by half.
Implementation quick reference
robots.txt starter
User-agent: *
Disallow: /*?sort=
Disallow: /*?sessionid=
Disallow: /*?view=
Disallow: /*?color=
Disallow: /*?size=Canonical on PLP
<link rel="canonical" href="https://www.example.com/womens/running/" />PDP cache validators
Cache-Control: max-age=0, must-revalidate
ETag: "pdp-<sku>-v4"
Last-Modified: Wed, 29 Oct 2025 16:04:00 GMTSitemap split (example)
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.example.com/sitemaps/sitemap-categories.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.example.com/sitemaps/sitemap-products.xml</loc>
</sitemap>
<sitemap>
<loc>https://www.example.com/sitemaps/sitemap-products-new.xml</loc>
</sitemap>
</sitemapindex>Post-publish monitoring (30/60/90 days)
- 30 days: validate log KPIs and confirm a higher share of Googlebot on PLPs/PDPs.
- 60 days: compare time-to-first-index for new SKUs; adjust rendering or sitemap deltas.
- 90 days: review error rates by bot, expand static facet paths that drove revenue, and close any newly noisy parameters.
If the KPIs stall, revisit the waste map and try a different frame: a stricter facet policy, a faster SSR path for PLPs, or stronger internal links to seasonal hubs.
Shopify Crawl Budget Optimisation
Shopify has its own shapes for URLs, sitemaps, and rendering.
These sections give you platform-specific ways to cut crawl waste and speed up discovery.
Shopify URL patterns to watch
Collection pages live at /collections/{handle} and often carry tags, filter params, ?page=, and ?sort_by=.
Keep high-value facets as curated collections, and keep the rest as parameters with noindex + crawl blocks.
Focus areas
- /collections/{handle}?sort_by= and /collections/{handle}?page=
- Filter params from Search & Discovery (e.g., filter.v.option.size=)
- Alternate views like ?view=grid or ?view=ajax
Search results: /search and /search?type=product&q=
robots.txt.liquid on Shopify
Shopify lets you customise robots at templates/robots.txt.liquid.
Close noisy params and system paths while keeping canonical pages open.
# templates/robots.txt.liquid
User-agent: *
# crawl noise
Disallow: /*?sort_by=
Disallow: /*?view=
Disallow: /*?q=
Disallow: /search
# keep filter params crawl-closed if you don't want them discovered
Disallow: /*?filter.
# tracking and session-like junk
Disallow: /*utm_
Disallow: /*gclid=
Sitemap: https://{{ shop.primary_domain }}/sitemap.xmlKeep collection and product base URLs crawlable.
Test changes in a staging theme before publishing.
Canonicals in Liquid (collection and product templates)
Make sure canonicals ignore query strings and point to the clean URL.
Place this in theme.liquid or each template near <head>.
{%- capture canonical -%}{{ canonical_url | split:'?' | first }}{%- endcapture -%}
<link rel="canonical" href="{{ canonical }}" />For collection pages with tags or filter params, keep the canonical on the base collection unless you’ve created a curated, static collection for that facet.
Avoid canonicals that bounce between param states.
Conditional meta robots for filter states
If you keep parameterised filters for UX, prevent them entering the index.
Use a simple guard on query strings.
{%- assign q = request.query_string | downcase -%}
{%- if q contains 'filter.' or q contains 'sort_by=' or q contains 'view=' -%}
<meta name="robots" content="noindex,follow">
{%- endif -%}This still lets bots crawl links on the page while keeping the URL out of the index.
Pair with the robots.txt blocks to cut repeated crawling.
Shopify sitemaps: what you can and can’t do
Shopify auto-generates sitemap.xml and child sitemaps for products, collections, pages, and posts.
You can’t hand-edit them or ship true “delta” sitemaps.
Workable tactics
- Build a “New arrivals” or “Back in stock” collection and link it in the main nav so new SKUs get internal links fast.
- Submit important collection and promo URLs in Search Console for quicker discovery.
- Keep product publishing tidy: push only truly live, stocked items to avoid churn in sitemaps.
Internal linking on Online Store 2.0
Add static, crawlable links to priority collections from the homepage and top categories.
Use Collection list and Featured collection sections instead of JS-only carousels.
Quick wins
- Make a Seasonal block on the homepage that links to 6–12 top collections.
- Add related collections at the bottom of collection templates using lists, not dropdowns.
Use metafields on collections to surface two or three strategic internal links in HTML.
Pagination and infinite scroll
Shopify uses ?page= for collection pagination.
Expose numbered links in HTML and keep self-canonical on page 2+ unless you have a solid consolidation plan.
Avoid infinite scroll that hides links behind JS.
If you need it, also render a classic pager so bots can traverse pages.
Search results and filter archives
Disallow /search in robots and keep it noindex.
Never link to raw search queries from navigation.
For filter archives, either:
- Create a curated, static collection when revenue warrants it, or
- Keep it as a param with noindex + crawl block and rely on the base collection.
App hygiene (JS budget + crawl)
Every app usually ships scripts, DOM changes, or extra endpoints.
Audit quarterly and remove anything that doesn’t move revenue or discovery.
Tidy-up list
- Turn off unused App embeds in the theme editor.
- Defer or load analytics and widgets after critical HTML.
- Prefer apps that render server-side blocks over client-injected markups.
Theme performance that influences crawl
Liquid renders server-side, then storefront JS hydrates sections.
Large sections, heavy loops, and big images slow TTFB and increase crawl cost.
Keep it lean
- Flatten nested loops and paginate large product lists.
- Compress images and serve modern formats with Shopify’s CDN params (_800x etc.).
- Inline only the CSS that’s truly critical; ship the rest as a small, cacheable file.
Breadcrumbs and structured data
Output a BreadcrumbList per collection/product so bots understand the path.
Keep names and URLs aligned with canonicals.
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "BreadcrumbList",
"itemListElement": [
{ "@type": "ListItem", "position": 1, "name": "Women", "item": "https://{{ shop.primary_domain }}/collections/women" },
{ "@type": "ListItem", "position": 2, "name": "Running", "item": "https://{{ shop.primary_domain }}/collections/women-running" },
{ "@type": "ListItem", "position": 3, "name": "Trail", "item": "https://{{ shop.primary_domain }}/collections/women-running-trail" }
]
}
</script>This pairs with your internal link plan and keeps paths consistent.
Avoid schema that references parameterised URLs.
Shopify Search & Discovery filters
The official app powers faceted filtering and builds param URLs.
Decide which facets deserve curated collections and which stay as params.
Practical split
- Indexable as static collections: core category + a small set of high-demand facets.
- Param-only: size, colour, price sliders, rating, availability.
Document the policy so merch and dev teams don’t create accidental index traps.
Keep any param-only facet links out of navigation.
Logs and measurement on Shopify
You don’t get raw server logs on the platform.
Use Search Console → Crawl Stats, URL Inspection API, and your analytics to infer crawl patterns.
What to track
- % of Googlebot hits landing on collections/products vs everything else (from Crawl Stats categories).
- Discovery time for new collections and SKUs.
- Parameter creep in top crawled URLs (sample with log-like reports from Crawl Stats and on-site link exports).
If you front the store with a proxy like Cloudflare on Plus, you can sample bot traffic there.
Keep origin visibility enabled so Shopify still caches and serves cleanly.
Shopify rendering choices (native vs headless)
Native themes are server-rendered with Liquid and usually good for discovery.
Headless (Hydrogen + Oxygen) brings SSR control and edge caching but shifts sitemap and linking duties to you.
If you go headless
- Generate and submit your own sitemaps.
- Pre-render category and PDP routes, and make sure internal links exist in HTML.
- Keep filter states as params with noindex unless you’re deliberately creating static, indexable routes.
Shopify-specific peak plan
- Create and link promo collections two weeks out; keep URLs stable.
- Freeze new filters and experimental apps a week out.
- On live day, check Crawl Stats hourly and watch for spikes to /search and ?sort_by=, then tighten rules if needed.
Shopify quick checklist
- robots.txt.liquid closes /search, ?sort_by=, ?view=, and ?filter. you don’t want crawled
- Canonical strips params site-wide; curated facet pages have clean paths
- Collection pagination exposes numbered links in HTML
- Homepage links to 6 - 12 revenue collections in static HTML
- Search & Discovery facets mapped: which are static collections vs param-only
- App embeds audited; unused scripts disabled
- Breadcrumb JSON-LD on collections and products
- GSC Crawl Stats monitored; new SKUs show faster index times week over week
Key Takeaways
- Measure first: segment server logs by bot, template, parameters, status, and HTML vs JS requests.
- Raise the % of Googlebot hits on indexable templates and cut waste by URL pattern.
- Shape crawl with clean XML sitemaps, cache validators (ETag, Last-Modified), and stable response times.
- Pick the right rendering model per template; validate with index latency and render-time buckets.
- Ship a peak-season runbook so promotions get discovered fast without melting servers.
Frequently Asked Questions
Do canonicals alone fix filter duplication?
No. Canonicals suggest a preferred URL, but Google may still crawl the duplicates if they’re open; close low-value patterns with robots rules and avoid linking to them.
Are noindex pages a waste of crawl?
They still get crawled. Use noindex for short-term control, but pair it with blocked crawling for noisy parameters that don’t need discovery.
Does prerender always reduce crawl cost?
Not always. If prerender inflates CPU under load and causes 5xx, crawl rate can drop; validate with render-time buckets and index latency before committing.
Get in touch today
complete the form below for an informal chat about your business





