Bare-minimum Google indexing for Jekyll
Searching for this blog on Google right now does not return any results. Let’s fix that.
If you think about it, that makes sense! If there are no links pointing to this blog, Google cannot follow them to find it and index it1. Luckily we can welcome Google’s crawlers here by:
- Adding a new “property” for this blog to the “Google Search Console”.
- Verifying we own the “property”.
- Ensuring that Google crawlers will find all articles by populating
sitemap.xml
androbots.txt
files.
# Adding and verifying a new property
This Google support page shows how add the property. We are using a Cloudflare Pages’ domain, so we don’t control DNS, and we will need to create a URL-prefix property.
To verify the URL-prefix we need the following snippet within the <head>
of
our index:
<meta name="google-site-verification" content="<verification-token-provided-by-Google>" />
I first considered doing this by hand, but then found out that the
jekyll-seo-tag plugin2
supports this. Adding a google_site_verification
entry to the site configuration and deploying the change was enough to complete the
verification.
# Indexing content
Next, we want web crawlers to index all articles. In theory (based on “how-Google-works-101”), crawlers will follow links as they find them, and having all articles listed in the blog index should be enough. But who knows what really goes on behind the scenes?
A sitemap provides a more robust solution, listing all entries in an XML document (similar to an RSS feed). The jekyll-sitemap plugin takes care of populating it and updating it with new articles.
The behaviour of the sitemap plugin depends on the index at which we add it to
Jekyll’s plugins
array in the site configuration.
In my case, I added it last – it “knows” not to add my RSS feed to the sitemap and I want it to index the rest of the content.
By adding a robots.txt
file
we inform bots about the sitemap:
User-agent: *
Sitemap: https://aldur.pages.dev/sitemap.xml
Jekyll exposes a url
configuration entry, which defaults to localhost
in
development. When deploying on Cloudflare Pages, I override it in a separate
configuration file. We can use Liquid to inject the correct URL using https://aldur.blog/sitemap.xml
(credits).
# Results
I prepared and merged the changes. Then,
I refreshed Google’s robots.txt
cache (instructions
here).
The Search Console tells me it will take a couple of days to index everything. I will see how that works out and update this article if I find out there’s anything else I need to do.
If you ended up through a Google search: It worked!
# Post scriptum
It turns out that despite all above, these days Google will not index3 your website unless one of its crawlers finds a reference to it. In my case, this tweet triggered some traffic and an email from Google, informing me that I can monitor incoming traffic from the console. Alas, as of February 2024, the console shows little traffic but has not picked up any page in the index yet.
# Footnotes
-
Unless you use Google Analytics. In that case, I expect that the analytics scripts take care of indexing, so you won’t need the steps described here. ↩
-
The plugin also makes social information and title/excerpts of posts available to search crawlers. The
minima
theme I am using suggests adding it in its default configuration. ↩ -
In my case, trying to manually request the indexing from the Google Search Console got me:
Sorry–we couldn’t process this request because you’ve exceeded your daily quota. Please try submitting this again tomorrow.
I got this on my first request of the day. Is the quota zero? ↩