Technical SEO Guidelines for XML Sitemaps

An XML sitemap is a structured file that helps search engines discover important URLs on a website. It does not guarantee indexing, and it does not replace good internal linking, but it gives crawlers a clear, machine-readable view of the pages a site wants search engines to know about.

For technical SEO, a sitemap is best treated as a discovery and maintenance signal. It should be accurate, current, canonical, and limited to URLs that are intended to be indexed.

What an XML Sitemap Does

An XML sitemap lists URLs in a standardized format that search engines can read. Its main purpose is to support crawl discovery, especially for pages that may not be easy to find through navigation or internal links alone.

A sitemap can help search engines understand:

Which URLs are available for crawling
Which URLs are considered important enough to submit
When a page was last meaningfully updated, if the lastmod value is accurate
How large sections of a website are organized, when sitemap indexes are used well

A sitemap is not a command. Search engines may crawl URLs not included in the sitemap, and they may choose not to index URLs that are included. Indexing still depends on crawlability, content quality, canonical signals, robots directives, internal links, duplication, page status, and broader site context.

For a broader definition, see URLMD’s guide to sitemaps.

What to Include in an XML Sitemap

An XML sitemap should include the canonical, indexable URLs that represent the public content you want search engines to discover and consider for indexing.

Common examples include:

Homepage
Main service or product pages
Category and subcategory pages that provide unique value
Location pages
Blog posts and articles
Evergreen resource pages
Important documentation or support pages
Public job listings, when applicable
Publicly available media pages, when they have indexable value

The sitemap should reflect the pages that matter in the site’s information architecture. It should not become a storage bin for every URL the website can technically generate.

What to Exclude from an XML Sitemap

A clean sitemap is often more useful than a large sitemap. URLs that are blocked, duplicated, thin, expired, private, or not intended for indexing should generally be left out.

Avoid including:

URLs blocked by robots.txt
Pages with noindex directives
Redirecting URLs
404, soft 404, or server error URLs
Duplicate URL variants
Filtered or faceted URLs that do not provide distinct indexable value
Internal search result pages
Login, cart, checkout, account, or administrative pages
Staging, development, or test URLs
Expired pages that should no longer be indexed

As a general rule, if a URL is not meant to be a search result, it usually does not belong in the XML sitemap.

Sitemap Structure and Size Limits

Each XML sitemap should follow the standard sitemap protocol. The widely used limits are:

Maximum of 50,000 URLs per sitemap file
Maximum uncompressed file size of 50MB per sitemap file

Large websites should use multiple sitemap files and organize them with a sitemap index. A sitemap index is a file that points search engines to the individual sitemap files.

For example, a site might use separate sitemaps for:

/post-sitemap.xml
/page-sitemap.xml
/product-sitemap.xml
/location-sitemap.xml
/image-sitemap.xml

Then a sitemap index, often located at /sitemap.xml, can list those individual files.

This structure is useful because it helps keep sitemap files manageable and makes troubleshooting easier. If one sitemap has errors, the issue can often be isolated to a specific content type or section of the website.

Use Canonical URLs

The URLs in an XML sitemap should be canonical URLs. A canonical URL is the preferred version of a page when multiple URL versions may exist.

For example, these may all point to similar or identical content:

https://example.com/page/
https://www.example.com/page/
http://example.com/page/
https://example.com/page/?ref=campaign

The sitemap should include only the preferred indexable version. This helps reduce conflicting signals and supports cleaner crawl interpretation.

For more detail, see Technical SEO Guidelines for Canonical URLs

Lastmod, Changefreq, and Priority

The sitemap protocol supports optional fields such as lastmod, changefreq, and priority. These fields should be handled carefully.

`lastmod`

The lastmod field tells search engines when the content at a URL was last modified. This can be useful when it is accurate.

Use lastmod when your system can reliably reflect meaningful content changes, such as:

Updated article content
Changed product information
Revised location details
New or changed structured page data
Substantial edits to the visible page

Avoid updating lastmod automatically for minor template changes, tracking changes, sidebar updates, or unrelated sitewide elements. If every page appears freshly modified every day, the signal becomes less useful.

`changefreq`

The changefreq field suggests how often a page is expected to change. In practice, modern search engines may not rely heavily on this value. It can still be included if your system uses it accurately, but it should not be treated as a strong crawling control.

`priority`

The priority field suggests the relative importance of URLs within the site. Search engines may ignore or discount it. If used, it should reflect real site hierarchy rather than assigning every page a high value.

For many websites, an accurate URL list and reliable lastmod data matter more than aggressive use of changefreq or priority.

Image, Video, and News Sitemaps

Some websites benefit from specialized sitemaps for media or time-sensitive content.

Image Sitemaps

Image sitemap data can help search engines discover important images, especially when images are loaded through JavaScript or are not easy to find in the HTML.

This may be useful for:

Photography websites
Ecommerce product imagery
Visual portfolios
Recipe sites
Real estate listings
Educational diagrams or visual resources

Image discovery should also be supported through accessible image markup, useful filenames, appropriate file formats, and descriptive alt text where alt text is needed.

Video Sitemaps

Video sitemaps can help search engines understand video content, including titles, descriptions, thumbnails, durations, and content locations.

They may be useful for sites with original video libraries, tutorials, product demonstrations, educational content, or media archives.

News Sitemaps

News sitemaps are specialized and should only be used when a website qualifies for news visibility and publishes timely news content. They are not necessary for ordinary blog posts or evergreen articles.

Submitting and Monitoring XML Sitemaps

Once a sitemap is created, it should be discoverable and monitored.

Common submission and discovery methods include:

Listing the sitemap in robots.txt
Submitting the sitemap in Google Search Console
Submitting the sitemap in Bing Webmaster Tools
Linking to the sitemap index from a predictable location such as /sitemap.xml

A typical robots.txt sitemap reference looks like this:

Sitemap: https://example.com/sitemap.xml

After submission, monitor the sitemap for:

Discovery status
Fetch errors
Submitted URLs that are not indexed
Redirects or broken URLs
URLs blocked by robots directives
Mismatch between submitted canonical URLs and selected canonical URLs

Sitemap reporting is not a full SEO audit, but it is a useful maintenance surface. It can reveal crawling and indexing issues that deserve closer review.

Common XML Sitemap Issues

Many sitemap problems come from stale automation, conflicting directives, or unclear canonical signals. The following checks are useful during technical SEO reviews.

1. The Sitemap Includes Non-Indexable URLs

If a URL is marked noindex, blocked by robots.txt, redirected, or returning an error, it should usually not be submitted in the sitemap.

2. The Sitemap Uses Non-Canonical URLs

If the sitemap lists one URL but the page canonicalizes to another, search engines receive mixed signals. The sitemap should usually list the canonical version.

3. The Sitemap Is Not Updated After Content Changes

When new pages are published, old pages are removed, or URLs change, the sitemap should reflect those changes. For dynamic websites, automation is preferred. For smaller static websites, scheduled review may be enough.

4. The Sitemap Contains Staging or Development URLs

Staging URLs should not be submitted to search engines. This can happen after migrations, redesigns, or plugin misconfiguration.

5. The Sitemap Is Too Large or Not Split Logically

Large websites should split sitemaps by content type or site section. This improves maintainability and makes reporting easier to interpret.

6. The Sitemap Cannot Be Fetched

Search engines need to be able to access the sitemap. Server errors, authentication requirements, firewall rules, malformed XML, or incorrect redirects can prevent sitemap fetching.

7. The Sitemap Lists URLs with Poor Internal Link Support

A sitemap can help with discovery, but it should not be the only path to important content. Important pages should also be supported by appropriate internal links. Internal linking remains a core part of website structure and crawl discovery.

Practical XML Sitemap Checklist

Include only canonical, indexable, public URLs.
Exclude redirects, errors, blocked URLs, and noindex pages.
Keep each sitemap under 50,000 URLs and 50MB uncompressed.
Use a sitemap index when multiple sitemap files are needed.
Keep lastmod accurate and tied to meaningful content changes.
Do not rely on priority or changefreq as strong crawl controls.
Submit the sitemap in Google Search Console and Bing Webmaster Tools when appropriate.
Reference the sitemap in robots.txt.
Validate the XML format after major changes.
Review sitemap reports periodically for errors and indexing patterns.

XML Sitemap FAQ

Does an XML sitemap guarantee indexing?

No. An XML sitemap helps search engines discover URLs, but it does not guarantee that those URLs will be indexed. Search engines still evaluate crawlability, canonical signals, content quality, duplication, internal links, and other factors.

Should every page be in the XML sitemap?

No. The sitemap should include important public URLs that are intended to be indexed. Private pages, duplicate pages, filtered URLs, redirects, error pages, and noindex pages should generally be excluded.

How often should an XML sitemap be updated?

Ideally, the sitemap should update automatically when indexable content is added, removed, or changed. If automation is not available, review the sitemap on a regular schedule. For many smaller websites, monthly or quarterly review is reasonable, depending on publishing frequency.

Where should the XML sitemap be located?

Many websites place the sitemap or sitemap index at https://example.com/sitemap.xml. The exact location can vary, but it should be accessible to search engines and referenced in robots.txt or submitted through webmaster tools.

Can a website have more than one XML sitemap?

Yes. Larger websites often use multiple sitemaps organized under a sitemap index. This is common for sites with many posts, products, locations, images, or videos.

References

An XML sitemap is strongest when it reflects the real structure and intent of the website. Keep it clean, canonical, accessible, and current. It should support discovery, not compensate for broken architecture.

When the sitemap, internal links, canonical tags, robots directives, and page content all point in the same direction, search engines have a clearer path through the site.