python爬虫翻墙

Subaru Lai, 02 May 2018

最近在做kaggle的一个比赛，用到的训练数据需要翻墙下载，下面记录了如何使用ss和polipo进行全局代理，以及python爬虫的代理设置。

一、配置shadowsocks代理

安装shadowsocks（装客户端也行）:
```
$ sudo pip install shadowsocks
```

配置shadowsocks（新建shadowsocks.json文件，并写入如下内容）：

{
"server": "{your-server}",
"server_port": 40002,
"local_port": 1080,
"password": "{your-password}",
"timeout": 600,
"method": "aes-256-cfb"
}

启动shadowsocks服务（机器重启后要重新执行）：
```
$ sudo sslocal -c shadowsocks.json -d start
```

二、配置全局代理

shadowsocks是socks5代理，需配置http/https代理，可通过polipo实现

安装polipo：
```
$ sudo apt-get install polipo
```

修改polipo的配置文件/etc/polipo/config：

logSyslog = true
logFile = /var/log/polipo/polipo.log
proxyAddress = "0.0.0.0"
socksParentProxy = "127.0.0.1:1080"
socksProxyType = socks5
chunkHighMark = 50331648
objectHighMark = 16384
serverMaxSlots = 64
serverSlots = 16
serverSlots1 = 32

重启polipo服务：
```
$ sudo /etc/init.d/polipo restart
```
配置http代理（机器重启后需重新执行）：
```
export http_proxy="http://127.0.0.1:8123/"
```
测试：
```
curl www.google.com
```

三、python爬虫设置代理

一种是使用urllib库，另一种是使用requests库。headers的作用还有待进一步调查：

import urllib.request as request
import requests

proxies = {
    'https': 'https://127.0.0.1:8123',
    'http': 'http://127.0.0.1:8123'
}
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

print('--------------使用urllib--------------')
google_url = 'https://www.google.com'
opener = request.build_opener(request.ProxyHandler(proxies))
request.install_opener(opener)

req = request.Request(google_url, headers=headers)
response = request.urlopen(req)

print(response.read().decode())

print('--------------使用requests--------------')
response = requests.get(google_url, proxies=proxies)
print(response.text)