robots.txt

2025-02-05
#robot #server #ai #web #sysadmin #network #internet

robots.txt are

robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

robots.txt - wikipedia

AI bot crawlers, how to handle them

There are a couple of projects that aim at helping against ai companies bots that crawl and scrape our content. Amongst them:

I'm going to setup in my Justfile a update setting for my robots.txt file that is in my static folder.

https://github.com/ai-robots-txt/ai.robots.txt/releases/latest/download/robots.txt

curl --url https://github.com/ai-robots-txt/ai.robots.txt/releases/latest/download/robots.txt -o static/robots.txt

About robots.txt to block crawlers from indexing stuff.

https://searchengineland.com/robots-txt-new-meta-tag-llm-ai-429510

https://www.robotstxt.org/robotstxt.html

https://developers.google.com/search/docs/crawling-indexing/robots/intro?visit_id=638273818678687427-847406478&rd=2#:~:text=The%20instructions%20in%20robots.txt%20files%20cannot%20enforce%20crawler%20behavior%20to%20your%20site%3B%20it%27s%20up%20to%20the%20crawler%20to%20obey%20them.

https://developers.google.com/search/docs/crawling-indexing/block-indexing