How to set up a robots.txt to disallow OpenAI's GPTBot from crawling your website for LLM Models
OpenAI has launched GPTBot, a new web crawler to improve future artificial intelligence models like GPT-4 and the following user agent and string identifies it.
How GPTBot works
Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.
As an author or owner of a Ghost blog, you may want to prevent GPTBot from accessing your Ghost site.
Disallowing GPTBot on your Ghost Website
To disallow GPTBot on your Ghost website, you have to set up your robots.txt file to disallow the crawler. The robots.txt for Ghost websites exists in the root of your themes directory.
There are two ways to create a robots.txt file. One way is by Command Line. Another method is to download your themes file, create the robots.txt file and upload it via the Ghost dashboard.
Download a theme in Ghost Admin to create the Robots.TXT file
From your dashboard, go to Settings, then go to the Design page.
Click Change theme.
Choose the Advanced toggle.
Click the overflow menu (...) and then Download. A copy of the theme will be downloaded to your computer.
Open the theme file in your code editor
Create a robots.txt file. In the txt file add
Sitemap: http://yourdomain/rss
User-agent: *
User-agent: GPTBot
Disallow: /
To allow GPTBot to access only parts of your site you can add the GPTBot token to your site’s robots.txt like this:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
Zip up your theme files. On the Themes page, click Upload theme.
Member discussion