Bare-minimum Google indexing for Jekyll

Searching for this blog on Google right now does not return any results. Let’s fix that.

If you think about it, that makes sense! If there are no links pointing to this blog, Google cannot follow them to find it and index it¹. Luckily we can welcome Google’s crawlers here by:

Adding a new “property” for this blog to the “Google Search Console”.
Verifying we own the “property”.
Ensuring that Google crawlers will find all articles by populating sitemap.xml and robots.txt files.

# Adding and verifying a new property

This Google support page shows how add the property. We are using a Cloudflare Pages’ domain, so we don’t control DNS, and we will need to create a URL-prefix property.

To verify the URL-prefix we need the following snippet within the <head> of our index:

<meta name="google-site-verification" content="<verification-token-provided-by-Google>" />

I first considered doing this by hand, but then found out that the jekyll-seo-tag plugin² supports this. Adding a google_site_verification entry to the site configuration and deploying the change was enough to complete the verification.

# Indexing content

Next, we want web crawlers to index all articles. In theory (based on “how-Google-works-101”), crawlers will follow links as they find them, and having all articles listed in the blog index should be enough. But who knows what really goes on behind the scenes?

A sitemap provides a more robust solution, listing all entries in an XML document (similar to an RSS feed). The jekyll-sitemap plugin takes care of populating it and updating it with new articles.

The behaviour of the sitemap plugin depends on the index at which we add it to Jekyll’s plugins array in the site configuration.

In my case, I added it last – it “knows” not to add my RSS feed to the sitemap and I want it to index the rest of the content.

By adding a robots.txt file we inform bots about the sitemap:

User-agent: *
Sitemap: https://aldur.pages.dev/sitemap.xml

Jekyll exposes a url configuration entry, which defaults to localhost in development. When deploying on Cloudflare Pages, I override it in a separate configuration file. We can use Liquid to inject the correct URL using https://aldur.blog/sitemap.xml (credits).

# Results

I prepared and merged the changes. Then, I refreshed Google’s robots.txt cache (instructions here).

The Search Console tells me it will take a couple of days to index everything. I will see how that works out and update this article if I find out there’s anything else I need to do.

If you ended up through a Google search: It worked!

# Post scriptum

It turns out that despite all above, these days Google will not index³ your website unless one of its crawlers finds a reference to it. In my case, this tweet triggered some traffic and an email from Google, informing me that I can monitor incoming traffic from the console. Alas, as of February 2024, the console shows little traffic but has not picked up any page in the index yet.

# Footnotes

Unless you use Google Analytics. In that case, I expect that the analytics scripts take care of indexing, so you won’t need the steps described here. ↩
The plugin also makes social information and title/excerpts of posts available to search crawlers. The minima theme I am using suggests adding it in its default configuration. ↩
In my case, trying to manually request the indexing from the Google Search Console got me:

Sorry–we couldn’t process this request because you’ve exceeded your daily quota. Please try submitting this again tomorrow.

I got this on my first request of the day. Is the quota zero? ↩