python爬虫翻墙
Subaru Lai, 02 May 2018
最近在做kaggle的一个比赛,用到的训练数据需要翻墙下载,下面记录了如何使用ss和polipo进行全局代理,以及python爬虫的代理设置。
一、配置shadowsocks代理
- 安装shadowsocks(装客户端也行):
$ sudo pip install shadowsocks
- 配置shadowsocks(新建shadowsocks.json文件,并写入如下内容):
{ "server": "{your-server}", "server_port": 40002, "local_port": 1080, "password": "{your-password}", "timeout": 600, "method": "aes-256-cfb" }
- 启动shadowsocks服务(机器重启后要重新执行):
$ sudo sslocal -c shadowsocks.json -d start
二、配置全局代理
shadowsocks是socks5代理,需配置http/https代理,可通过polipo实现
- 安装polipo:
$ sudo apt-get install polipo
- 修改polipo的配置文件
/etc/polipo/config
:logSyslog = true logFile = /var/log/polipo/polipo.log proxyAddress = "0.0.0.0" socksParentProxy = "127.0.0.1:1080" socksProxyType = socks5 chunkHighMark = 50331648 objectHighMark = 16384 serverMaxSlots = 64 serverSlots = 16 serverSlots1 = 32
- 重启polipo服务:
$ sudo /etc/init.d/polipo restart
- 配置http代理(机器重启后需重新执行):
export http_proxy="http://127.0.0.1:8123/"
- 测试:
curl www.google.com
三、python爬虫设置代理
一种是使用urllib库,另一种是使用requests库。headers的作用还有待进一步调查:
import urllib.request as request
import requests
proxies = {
'https': 'https://127.0.0.1:8123',
'http': 'http://127.0.0.1:8123'
}
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}
print('--------------使用urllib--------------')
google_url = 'https://www.google.com'
opener = request.build_opener(request.ProxyHandler(proxies))
request.install_opener(opener)
req = request.Request(google_url, headers=headers)
response = request.urlopen(req)
print(response.read().decode())
print('--------------使用requests--------------')
response = requests.get(google_url, proxies=proxies)
print(response.text)