AI 爬虫疯狂刷流量？Nginx 封禁爬虫 IP 实战方案

最近半年我的这个站被 AI 爬虫刷了好几次次，每月流量直接爆掉。Ban 的 IP 规则已经几百条，乱七八糟的蜘蛛刷的流量大就直接拉黑。下面分享一套 Nginx 封禁 AI 爬虫 IP 的实战方案，从识别到拦截，一步到位。

AI 爬虫和传统搜索引擎蜘蛛不同，它们不遵守 robots.txt，UA 伪装成正常浏览器，请求频率极高，专门抓取内容用于训练大模型。常见 AI 爬虫包括 GPTBot、ChatGPT-User、Claude-Web、PerplexityBot、Bytespider 等。这些爬虫的 IP 段相对固定，可以通过 IP 库和 UA 识别进行拦截。

AI 爬虫识别特征

爬虫名称	User-Agent	IP 段特征	危害程度
GPTBot	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)	AS8075 (Microsoft), AS14061 (DigitalOcean)	高
ChatGPT-User	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ChatGPT-User/1.0; +https://openai.com/chatgpt-user)	AS8075, AS16509 (AWS)	高
Claude-Web	Anthropic-ai	AS16509 (AWS), AS396982 (Google Cloud)	高
PerplexityBot	Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0)	AS16509, AS14061	中
Bytespider	Bytespider	AS15169 (Google), AS396982	中
Amazonbot	Mozilla/5.0 (compatible; Amazonbot/...)	AS16509	中

Nginx 封禁方案

方案一：UA 拦截（最简单）

在 Nginx 配置中添加：

# 封禁 AI 爬虫 UA
if ($http_user_agent ~* (GPTBot|ChatGPT-User|Claude-Web|PerplexityBot|Bytespider|Amazonbot|anthropic|OpenAI)) {
    return 403;
}

方案二：IP 段封禁（最有效）

创建 /etc/nginx/ai-bot-deny.conf：

deny 20.191.0.0/16;    # OpenAI / Microsoft
deny 52.230.0.0/16;    # Microsoft Azure
deny 13.64.0.0/11;     # Microsoft Azure
deny 52.224.0.0/16;    # Microsoft Azure
deny 40.64.0.0/10;     # Microsoft Azure
deny 3.128.0.0/9;      # AWS
deny 54.144.0.0/12;    # AWS
deny 34.192.0.0/10;    # AWS
deny 35.152.0.0/13;    # Google Cloud
deny 34.64.0.0/10;     # Google Cloud
deny 162.158.0.0/15;   # Cloudflare (部分爬虫)
deny 104.16.0.0/12;    # Cloudflare

在 nginx.conf 中引入：

http {
    include /etc/nginx/ai-bot-deny.conf;
    ...
}

方案三：频率限制（防暴力抓取）

# 限制单 IP 请求频率
limit_req_zone $binary_remote_addr zone=ai_bot_limit:10m rate=10r/s;

server {
    location / {
        limit_req zone=ai_bot_limit burst=20 nodelay;
        ...
    }
}

方案四：Cloudflare WAF 规则（如果使用 CF）

在 Cloudflare 防火墙规则中添加：

(http.user_agent contains "GPTBot") or
(http.user_agent contains "ChatGPT-User") or
(http.user_agent contains "Claude-Web") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "Bytespider") or
(http.user_agent contains "anthropic") or
(http.user_agent contains "OpenAI")

动作：阻止 (Block) 或 JS 挑战 (JS Challenge)

开源 IP 库推荐

ai-robots.txt：https://github.com/ai-robots-txt/ai.robots.txt - 社区维护的 AI 爬虫列表
Crawler-IP-Blocklist：https://github.com/mitchellkrogza/Crawler-IP-Blocklist - 自动更新的爬虫 IP 库
nginx-bad-bot-blocker：https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker - 终极坏机器人拦截器

一键更新脚本

#!/bin/bash
# /usr/local/bin/update-ai-bot-blocklist.sh

BLOCKLIST_URL="https://raw.githubusercontent.com/mitchellkrogza/Crawler-IP-Blocklist/master/Crawler-IP-Blocklist.txt"
NGINX_CONF="/etc/nginx/ai-bot-deny.conf"

# 下载最新 IP 列表
curl -s "$BLOCKLIST_URL" | grep -E '^[0-9]' | sed 's/^/deny /; s/$/;/' > "$NGINX_CONF.tmp"

# 备份旧配置
mv "$NGINX_CONF" "$NGINX_CONF.bak" 2>/dev/null
mv "$NGINX_CONF.tmp" "$NGINX_CONF"

# 测试并重载 Nginx
nginx -t && systemctl reload nginx

echo "AI Bot blocklist updated at $(date)"

添加到 crontab 每周更新：

0 3 * * 1 /usr/local/bin/update-ai-bot-blocklist.sh >> /var/log/ai-bot-update.log 2>&1

验证效果

查看 Nginx access.log 中 403 状态码数量
监控带宽使用是否下降
检查被封禁 IP 的 User-Agent

命令：

# 统计被封禁的 AI 爬虫请求
grep ' 403 ' /var/log/nginx/access.log | grep -iE '(gptbot|claude|perplexity|bytespider)' | wc -l

# 查看具体被封禁的 IP
grep ' 403 ' /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20

注意事项

封禁前确认 IP 段确实属于 AI 爬虫，避免误伤正常用户
定期检查封禁效果，调整规则
保留搜索引擎蜘蛛（Googlebot、Bingbot）的访问权限
考虑使用 robots.txt + meta tag 作为辅助手段

便宜vps主机

便宜vps主机推荐便宜vps，便宜服务器、tiktok专用vps、解锁流媒体、美国韩国日本英国原生ip vps

AI 爬虫识别特征

Nginx 封禁方案

方案一：UA 拦截（最简单）

方案二：IP 段封禁（最有效）

方案三：频率限制（防暴力抓取）

方案四：Cloudflare WAF 规则（如果使用 CF）

开源 IP 库推荐

一键更新脚本

验证效果

注意事项

相关文章

发表回复取消回复

AI 爬虫识别特征

Nginx 封禁方案

方案一：UA 拦截（最简单）

方案二：IP 段封禁（最有效）

方案三：频率限制（防暴力抓取）

方案四：Cloudflare WAF 规则（如果使用 CF）

开源 IP 库推荐

一键更新脚本

验证效果

注意事项

相关文章

发表回复 取消回复

发表回复取消回复