最近做的两个项目都要分析第三方数据,有数据采集的需求,pyspider是一个中小型项目采集的利器
官方文档:http://docs.pyspider.org/en/latest/
首先把版本用virtualenv来做隔离环境,使用python3来躲掉一些坑
mkdir -p /data/virtualenv/pyspider
cd /data/virtualenv/pyspider
virtualenv -p /usr/local/bin/python3 py3env
source ./py3env/bin/activate
pip install pyspider
如果python升级后(brew upgrade),会发现如下错误
(py3env) hulupiao:pyspider$ pyspider
dyld: Library not loaded: @executable_path/../.Python
Referenced from: /data/virtualenv/pyspider/py3env/bin/python3.6
Reason: image not found
Abort trap: 6
解决办法是要重新建立虚拟环境
deactivate #退出虚拟环境
find py3env -type l -delete #删除无效的软链
virtualenv -p /usr/local/bin/python3 py3env #重新建立虚拟环境
source ./py3env/bin/activate #加载虚拟环境
pyspider #重新启动系统
另外需要把数据存储到mysql中,增加驱动支持
mysql for py官方文档 https://dev.mysql.com/doc/connector-python/en/
但并未给出pip安装包,发现评论区有这样试了下ok
pip install mysql-connector-python-rf
如果要把数据入库要重写on_result函数
from pyspider.libs.base_handler import *
#加载mysql驱动
import mysql.connector
class Handler(BaseHandler):
crawl_config = {
}
@every(minutes=24 * 60)
def on_start(self):
self.crawl('http://scrapy.org/', callback=self.index_page)
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
for each in response.doc('a[href^="http"]').items():
self.crawl(each.attr.href, callback=self.detail_page)
@config(priority=2)
def detail_page(self, response):
return {
"url": response.url,
"title": response.doc('title').text(),
}
#重写入库函数
def on_result(self, result):
if not result:
return
cnx = mysql.connector.connect(user='root', database='test')
cursor = cnx.cursor()
pre_sql = ("REPLACE INTO mall "
"(`name`, `type`,`city`, `start_time`, `area`, `floors`, `is_chain`) "
"VALUES (%s, %s, %s, %s, %s, %s, %s)")
data = (result['name'], result['type'], result['city'], result['start_time'], result['area'], result['floors'], result['is_chain'])
cursor.execute(pre_sql, data)
#print(cursor.lastrowid)
cnx.commit()
cursor.close()
cnx.close()
之前实验遇到的问题:
Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-build-Qtbagy/pycurl/
# pip install pyspider
Collecting pyspider
Using cached pyspider-0.3.9.tar.gz
Collecting Flask>=0.10 (from pyspider)
Using cached Flask-0.12.1-py2.py3-none-any.whl
Collecting Jinja2>=2.7 (from pyspider)
Using cached Jinja2-2.9.6-py2.py3-none-any.whl
Requirement already satisfied: chardet>=2.2 in /usr/lib/python2.7/dist-packages (from pyspider)
Collecting cssselect>=0.9 (from pyspider)
Using cached cssselect-1.0.1-py2.py3-none-any.whl
Collecting lxml (from pyspider)
Using cached lxml-3.7.3-cp27-cp27mu-manylinux1_x86_64.whl
Collecting pycurl (from pyspider)
Using cached pycurl-7.43.0.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 1, in
File "/tmp/pip-build-Qtbagy/pycurl/setup.py", line 823, in
ext = get_extension(sys.argv, split_extension_source=split_extension_source)
File "/tmp/pip-build-Qtbagy/pycurl/setup.py", line 497, in get_extension
ext_config = ExtensionConfiguration(argv)
File "/tmp/pip-build-Qtbagy/pycurl/setup.py", line 71, in __init__
self.configure()
File "/tmp/pip-build-Qtbagy/pycurl/setup.py", line 107, in configure_unix
raise ConfigurationError(msg)
__main__.ConfigurationError: Could not run curl-config: [Errno 2] No such file or directory
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-Qtbagy/pycurl/
解决办法:
apt-get install python-pycurl