Yes, bots being >50% of all traffic is bonkers, but at the same time I'm increasingly becoming convinced that bot defenses are (largely? equally? also?) harmful to the overall ecosystem.
Everything bot detection relies on is basically ossification: TLS fingerprinting, protocol offering and preference, HTTP header presence and ordering, ...
Assuming / enforcing those is overall bad for the web, and all browsers and clients should really grease all of that.
It's not >50%, it's >98% and still getting worse.
What @sandro said: either remove all even a bit load-causing open interfaces, or be down as long as the scrapers try to scrape. E.g. no normal site visitor would ever try to diff every version of a file in some web accessible repo against every other version; they'd diff once or twice, not in short succession, and would be done.
I've given up in a few cases and made things non-public so at least plain html can stay available.
@spz @sandro I sympathize, and I certainly let any admin choose what works for them. (I do think behavioral defenses and reputation scores have a better chance there than, say, static client fingerprinting.). I'm just pondering that it's a losing battle that negatively impacts the open web.
(There are analogies to how spam protections and the rules set by the big handful of providers nowadays make running a mail server so much harder.)
What would be practical would be not to filter the visitors much (except IP address range bans, which are dirt cheap, comparably), but to filter on load: load gets too high, visitors only get 418 (or if less peeved, 503) telling you that AI scrapers suck and to come back when the stampede has left for other victims.
Do you (or anyone else) know if there's an Apache module to do that?