SMF Bot Hygiene

Overview:

This is our sample robots.txt file, and the portion of .htaccess that we use to restrict access to bots.

Everybody's site is a bit different; we are offering these as starter packs for folks running SMF sites.

robots.txt Notes & Criteria:

Robots.txt is voluntary; only "good" crawlers adhere to it. It is still immensely helpful, though, because there are lots of links crawlers shouldn't attempt to load. Robots.txt can thus drastically reduce unnecessary site hits for the "good" bots.
Our site does not allow guests to see attachments, so they are restricted here.
We restrict msg level queries here. This is actually kinda important, as msg links load the whole page of the topic the msg is on. So, if a page of a topic has 30 msgs on it, the crawler will otherwise load that same page 30 times as it follows each link on the page, which is extremely wasteful. Note this has proven to drastically reduce Google site requests.
End user functions are restricted, e.g., profile & notification activity, posts, modifications, likes, etc. There is no reason for crawlers to attempt any of these, so let them know.
Admin functions are restricted.
Search functions are restricted. They should get that info off of the topics themselves.
Yes, I know Google does not honor crawl-delay, but I can hope.

.htaccess Notes & Criteria:

Our site attempts to allow true search engine crawlers that honor robots.txt. Even international crawlers, as we support folks from all over the world.
Crawlers that are not for search engines will get blocked, e.g., AI crawlers are blocked. Social media crawlers are blocked, e.g., Facebook, TikTok, etc.
Crawlers that try to remain anonymous, e.g., by disguising the user agent, are blocked.
Crawlers that do not honor robots.txt are blocked.
We also now block when we see old browser versions in the useragent, as this appears to be 100% bot activity - Chrome < 80.0, Firefox < 100.0 & Opera < 10.0.
To constuct this list, we started with the useragent list from this site: https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/blob/master/_generator_lists/bad-user-agents.list
We removed some valid international crawlers from that list, and have added many new ones, primarily social media sites & AI bots.
We also attempt to restrict IP ranges known for malicious activity or aggressive crawling.
We ensure none of our users are in those IP ranges first...
Since we have a fair amount of international users, we attempt to avoid blocking by country.
We do, however, now block China, via a geo-block capability from our host. Thus, this is not visible on this list. If you need to block whole contries via a geo-block, consult your host.

htaccess_asn_list.txt:

This file holds the list of ASNs that are blocked in the .htaccess file.
These ASNs have been problematic, one way or another. Attacks come from these ASNs. Either they have been exploited by botnets, or, they simply do not monitor such activity coming from their networks.

cidr_list_cleaner.php:

The CIDR list you get from an ASN lookup is EXTREMELY inefficient. Most IP ranges are duplicated or overlap. Also, many CIDRs are consecutive, and the list can often be simplified by combining them.
This utility cleans up all duplication & overlap & combines adjacent CIDRs where possible, typically resulting in a 98-99% reduction in list size.
It can be run from either the command line or a browser. It accepts a user-specified flat file with a list of valid CIDRs, one per line; output is written to a new file. Entries that don't match a CIDR format are dropped with an error message.
There is no limit to the number of records in the flat file.
It works for both ipv4 & ipv6.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
.htaccess		.htaccess
LICENSE		LICENSE
README.md		README.md
asn_list_input.txt		asn_list_input.txt
asn_list_output.txt		asn_list_output.txt
cidr_list_cleaner.php		cidr_list_cleaner.php
htaccess_asn_list.txt		htaccess_asn_list.txt
robots.txt		robots.txt
test_cidr_list_cleaner.php		test_cidr_list_cleaner.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SMF Bot Hygiene

Overview:

robots.txt Notes & Criteria:

.htaccess Notes & Criteria:

htaccess_asn_list.txt:

cidr_list_cleaner.php:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SMF Bot Hygiene

Overview:

robots.txt Notes & Criteria:

.htaccess Notes & Criteria:

htaccess_asn_list.txt:

cidr_list_cleaner.php:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages