Python 爬虫使用 Webshare 代理教程
为什么爬虫需要代理IP?
大多数网站都有反爬机制,频繁请求会触发 IP 封禁。使用 Webshare 住宅代理可以:
- 模拟真实用户的 IP,绕过反爬检测
- 轮换 IP,避免单一 IP 触发频率限制
- 访问有地理限制的内容
requests 库使用 Webshare 代理
基础用法
import requests
PROXY_IP = "your_proxy_ip"
PORT = "80"
USERNAME = "your_username"
PASSWORD = "your_password"
proxy_url = f"http://{USERNAME}:{PASSWORD}@{PROXY_IP}:{PORT}"
proxies = {
"http": proxy_url,
"https": proxy_url,
}
response = requests.get("https://httpbin.org/ip", proxies=proxies, timeout=10)
print(response.json())
使用 Webshare API 动态获取代理列表
import requests
import random
API_KEY = "your_webshare_api_key"
def get_proxy_list():
resp = requests.get(
"https://proxy.webshare.io/api/v2/proxy/list/",
headers={"Authorization": f"Token {API_KEY}"},
params={"mode": "direct", "page": 1, "page_size": 25}
)
return resp.json()["results"]
def get_random_proxy():
proxies = get_proxy_list()
p = random.choice(proxies)
proxy_url = f"http://{p['username']}:{p['password']}@{p['proxy_address']}:{p['port']}"
return {"http": proxy_url, "https": proxy_url}
# 每次请求随机换 IP
for url in target_urls:
proxy = get_random_proxy()
response = requests.get(url, proxies=proxy, timeout=15)
# 处理响应...
Scrapy 集成 Webshare 代理
方法一:在 settings.py 中配置单一代理
# settings.py
PROXY_IP = "your_proxy_ip"
PROXY_PORT = "80"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"
# 设置下载中间件
DOWNLOADER_MIDDLEWARES = {
"scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware": 110,
}
HTTP_PROXY = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_IP}:{PROXY_PORT}"
方法二:自定义中间件实现轮换
# middlewares.py
import random
import requests
class WebshareProxyMiddleware:
API_KEY = "your_api_key"
def __init__(self):
self.proxies = self._fetch_proxies()
def _fetch_proxies(self):
resp = requests.get(
"https://proxy.webshare.io/api/v2/proxy/list/",
headers={"Authorization": f"Token {self.API_KEY}"},
)
return resp.json()["results"]
def process_request(self, request, spider):
p = random.choice(self.proxies)
request.meta["proxy"] = f"http://{p['username']}:{p['password']}@{p['proxy_address']}:{p['port']}"
Playwright 使用 Webshare 代理
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(
proxy={
"server": "http://代理IP:80",
"username": "用户名",
"password": "密码",
}
)
context = browser.new_context()
page = context.new_page()
page.goto("https://httpbin.org/ip")
print(page.inner_text("body"))
browser.close()
最佳实践
| 场景 | 推荐代理类型 | 建议配置 |
|---|---|---|
| 大规模爬虫 | 住宅轮换代理 | 每次请求换IP |
| 登录状态保持 | 住宅粘性代理 | 粘性时间30分钟 |
| 高速批量请求 | 数据中心代理 | 多线程并发 |
| 价格监控 | 住宅代理 | 定时轮换 |
注意事项
- 住宅代理按流量计费,避免下载大文件(图片/视频)消耗流量
- 请求间隔建议 1-3 秒,模拟人工操作
- 设置合理的 timeout,避免代理超时导致程序挂起
- 建议捕获代理异常并自动切换备用IP