How to generate nicer sitemaps with the sitemap_generator gem

7 min read Kane Jamison Kane Jamison

If you've ever set up a sitemap in Rails, you might have used the sitemap_generator gem (docs: /kjvarga/sitemap_generator) to automatically generate an XML sitemap.

On a recent client project we were generating a single sitemap for a variety of content types, including static pages, blog posts, and 2-3 custom content directories.

You can imagine a basic config/sitemap.rb setup like this from the sitemap_generator Readme:

SitemapGenerator::Sitemap.default_host = 'http://example.com'

SitemapGenerator::Sitemap.create do
  add '/about', :changefreq => 'daily', :priority => 0.9
  add '/contact_us', :changefreq => 'weekly'

  Article.published.each do |article|
    add article_path(article), priority: 0.64, changefreq: 'weekly'
  end
end

SitemapGenerator::Sitemap.ping_search_engines # Not needed if you use the rake tasks

I want to walk through a few changes I made to improve this sitemap and why.

Break Into Sitemap Index + Multiple Sitemaps

I prefer to generate multiple individual sitemaps for different content directories.

That's because when you submit those sitemaps to Google Search Console, it's much easier to analyze URL performance by content type:

An example of a Sitemap Index file with 3 additional submitted sitemaps for Posts, Pages, and Tags (which are important index pages for this domain)

Now, after submitting each of those sitemaps into Search Console, I can get more granular breakdowns of which content is being indexed and which isn't, which helps me make decisions around internal linking, identify why pages aren't ranking, etc.

An example of the Page Indexing report specifically for my /sitemap-posts.xml file, which allows me to dive deeper into pages that are "Crawled - currently not indexed" and other vague Google index statuses.

Break Sitemap into separate files using group(filename: xyz)

Sitemap_generator has a nice built-in group feature to handle this:

# Set the root domain
SitemapGenerator::Sitemap.default_host = "https://www.example.com"

SitemapGenerator::Sitemap.create do
  group(filename: :sitemap_pages) do
    add '/about', :changefreq => 'daily', :priority => 0.9
    add '/contact_us', :changefreq => 'weekly'
  end
  
  group(filename: :sitemap_articles) do
    Article.published.each do |article|
      add article_path(article), priority: 0.64, changefreq: 'weekly'
    end
  end
end

Add to root back to the pages group using into separate files using include_root: true

When you use the group feature, sitemap_generator stops adding your homepage to a sitemap. You can add it back using include_root: true in the pages group we established above:

group(filename: :pages, include_root: true) do
    add '/about', :changefreq => 'daily', :priority => 0.9
    add '/contact_us', :changefreq => 'weekly'
  end

Set create_index = true to generate a sitemap_index file

Our next step is to tell sitemap_generator to build a sitemap_index file. A sitemap_index file is simply an index of all of your grouped sitemaps.

Here's the Ghost-generated current sitemap.xml index file for RailsGrowth, broken apart by pages, posts, authors, and tags.

It is a nice feature because we can add a single sitemap reference to robots.txt and bots will find all of the sitemaps we might generate in the future:

# robots.txt

Sitemap: https://www.example.com/sitemaps/sitemap_index.xml.gz

No matter how many new groups we add to our config/sitemap.rb, they'll automatically be referenced by this robots.txt entry. For big sites that is huge because it will automatically include sitemaps that get paginated or broken up after exceeding the maximum URL count per sitemap file.

We can generate that using the following before our create command:

# Create a sitemap index which points to all sitemap files
SitemapGenerator::Sitemap.create_index = true

Many CMSs or SEO plugins will name this file sitemap_index.xml by default. Sitemap_generator doesn't do that, it just names the file sitemap.xml, so if we'd like, we can override the index filename in the create command to make the filename more explicit.

# Set the root domain
SitemapGenerator::Sitemap.default_host = "https://www.example.com"

# Create a sitemap index which points to all sitemap files
SitemapGenerator::Sitemap.create_index = true

SitemapGenerator::Sitemap.create(filename: 'sitemap_index') do
  group(filename: :sitemap_pages, include_root: true) do
    add '/about', :changefreq => 'daily', :priority => 0.9
    add '/contact_us', :changefreq => 'weekly'
  end
  
  group(filename: :sitemap_articles) do
    Article.published.each do |article|
      add article_path(article), priority: 0.64, changefreq: 'weekly'
    end
  end
end

Our current sitemap, which outputs a renamed sitemap_index.xml.gz file, sitemap_pages.xml.gz, and sitemap_articles.xml.gz

Generate both .xml and .xml.gz sitemap versions

By default, sitemap_generator will generate compressed sitemaps, meaning a gzipped .xml.gz file.

Those are smaller in filesize and better for serving to bots as well as for submitting to Google Search Console.

But - they don't render in the browser, they download when you load them. And I don't know about you, but Microsoft Word is really excited to be the default app on my Mac when I unzip them and try to load the .xml file locally.

So instead, it's nice to generate an uncompressed .xml file as well which can be viewed in the browser. This makes it much easier for developers to load the .xml version in local, staging, and production to see which how URLs are being loaded in each sitemap file.

Thankfully, we can do this pretty easily by calling the :compress option, and running SitemapGenerator::Sitemap.create twice:

# Set the root domain
SitemapGenerator::Sitemap.default_host = "https://www.example.com"

# Create a sitemap index which points to all sitemap files
SitemapGenerator::Sitemap.create_index = true

# Generate both compressed and uncompressed versions
[true, false].each do |compress_value|
  SitemapGenerator::Sitemap.compress = compress_value

  SitemapGenerator::Sitemap.create(filename: 'sitemap_index') do
    group(filename: :sitemap_pages, include_root: true) do
      add '/about', :changefreq => 'daily', :priority => 0.9
      add '/contact_us', :changefreq => 'weekly'
    end
    
    group(filename: :sitemap_articles) do
      Article.published.each do |article|
        add article_path(article), priority: 0.64, changefreq: 'weekly'
      end
    end
  end
end

This change allows us to generate both .xml and .xml.gz sitemaps.

In this example we're iterating through the [true, false].each do |compress_value| array, using it to set SitemapGenerator::Sitemap.compress = compress_value before running .create, which ends up generating .xml and .xml.gz copies of every sitemap file.

Here's an example of what our output looks like for a site with 4 groups, a sitemap index, and both .xml and .xml.gz output:

In '/Users/kanejamison/Github/ltbweb/public/':
+ sitemap_pages.xml.gz                   26 links / 591 Bytes
+ sitemap_blog.xml.gz                    22 links / 911 Bytes
+ sitemap_resources.xml.gz             2713 links /   40.2 KB
+ sitemap_services.xml.gz                19 links / 453 Bytes
+ sitemap_sitemap_index.xml.gz         4 sitemaps / 263 Bytes

Sitemap stats: 2,780 links / 4 sitemaps / 0m00s

In '/Users/kanejamison/Github/ltbweb/public/':
+ sitemap_pages.xml                      26 links /   4.71 KB
+ sitemap_blog.xml                       22 links /   5.14 KB
+ sitemap_resources.xml                2713 links /    510 KB
+ sitemap_services.xml                   19 links /   3.97 KB
+ sitemap_sitemap_index.xml            4 sitemaps / 776 Bytes

Sitemap stats: 2,780 links / 4 sitemaps / 0m00s

Take note at the filesize difference on the gzipped versions. It's certainly nice to point bots to those versions, especially with how much AI bot activity is happening nowadays.

If you're on Heroku, you're not done.

This part gets a little confusing.

If you're on a host that allows you to build static xml files dynamically on the server, you might be good to go. You'll need to set up a cron job for rake sitemap:refresh, after which, you'll have a fresh sitemap however often you like.

But if you're on Heroku, you can't build static assets on the server and keep them there. If I understand correctly they'll let you build the files in /tmp/, but you can't just drop them in /public/, you'll need to set up a storage adapter, which is heavily documented on sitemap_generator's docs.

Sometimes it is desirable to host your sitemap files on a remote server, and point robots and search engines to the remote files. For example, if you are using a host like Heroku, which doesn't allow writing to the local filesystem. You still require some write access, because the sitemap files need to be written out before uploading. So generally a host will give you write access to a temporary directory. On Heroku this is tmp/ within your application directory.

When you switch to using a storage adapter, then a lot of our setup above becomes a problem.

As sitemap_generator explains in their section on storage adapters:

Note that SitemapGenerator will automatically turn off include_index in this case because the sitemaps_host does not match the default_host. The link to the sitemap index file that would otherwise be included would point to a different host than the rest of the links in the sitemap, something that the sitemap rules forbid.

So, that's a problem - if we can't store our sitemap on the server, and we can't generate a sitemap index, then we're left with a couple of options.

One is that we can hardcode our sitemap file references in robots.txt. On a smaller site that might be feasible, but if we're creating multiple sitemaps of a certain type, eg sitemap_posts.xml, sitemap_posts2.xml, then it would require us to manually keep track of how many sitemaps are getting created and update that robots.txt entry manually.

[post in progress until this note is gone, still finishing this end section]

STILL TO COVER:

  • REVERSE PROXY / NGINX to serve files from our domain URLs.
  • Or, whether redirects will work.
  • Whether we can override the sitemap_index disabled feature if we're willing to handle the URL controls in one of those ways.