pyspider安装使用

最近做的两个项目都要分析第三方数据,有数据采集的需求,pyspider是一个中小型项目采集的利器
官方文档:http://docs.pyspider.org/en/latest/

首先把版本用virtualenv来做隔离环境,使用python3来躲掉一些坑

mkdir -p /data/virtualenv/pyspider
cd /data/virtualenv/pyspider
virtualenv -p /usr/local/bin/python3 py3env
source ./py3env/bin/activate
pip install pyspider

如果python升级后(brew upgrade),会发现如下错误

(py3env) hulupiao:pyspider$ pyspider
dyld: Library not loaded: @executable_path/../.Python
  Referenced from: /data/virtualenv/pyspider/py3env/bin/python3.6
  Reason: image not found
Abort trap: 6

解决办法是要重新建立虚拟环境

deactivate  #退出虚拟环境
find py3env -type l -delete #删除无效的软链
virtualenv -p /usr/local/bin/python3 py3env #重新建立虚拟环境
source ./py3env/bin/activate #加载虚拟环境
pyspider #重新启动系统

另外需要把数据存储到mysql中,增加驱动支持
mysql for py官方文档 https://dev.mysql.com/doc/connector-python/en/
但并未给出pip安装包,发现评论区有这样试了下ok

pip install mysql-connector-python-rf

如果要把数据入库要重写on_result函数

from pyspider.libs.base_handler import *
#加载mysql驱动
import mysql.connector

class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

    #重写入库函数
    def on_result(self, result):
        if not result:
            return
        cnx = mysql.connector.connect(user='root', database='test')
        cursor = cnx.cursor()
        pre_sql = ("REPLACE INTO mall "
               "(`name`, `type`,`city`, `start_time`, `area`, `floors`, `is_chain`) "
               "VALUES (%s, %s, %s, %s, %s, %s, %s)")
        data = (result['name'], result['type'],  result['city'], result['start_time'], result['area'], result['floors'], result['is_chain'])
        cursor.execute(pre_sql, data)
        #print(cursor.lastrowid)
        cnx.commit()
        cursor.close()
        cnx.close()

之前实验遇到的问题:
Command “python setup.py egg_info” failed with error code 1 in /tmp/pip-build-Qtbagy/pycurl/

# pip install pyspider
Collecting pyspider
  Using cached pyspider-0.3.9.tar.gz
Collecting Flask>=0.10 (from pyspider)
  Using cached Flask-0.12.1-py2.py3-none-any.whl
Collecting Jinja2>=2.7 (from pyspider)
  Using cached Jinja2-2.9.6-py2.py3-none-any.whl
Requirement already satisfied: chardet>=2.2 in /usr/lib/python2.7/dist-packages (from pyspider)
Collecting cssselect>=0.9 (from pyspider)
  Using cached cssselect-1.0.1-py2.py3-none-any.whl
Collecting lxml (from pyspider)
  Using cached lxml-3.7.3-cp27-cp27mu-manylinux1_x86_64.whl
Collecting pycurl (from pyspider)
  Using cached pycurl-7.43.0.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "", line 1, in 
      File "/tmp/pip-build-Qtbagy/pycurl/setup.py", line 823, in 
        ext = get_extension(sys.argv, split_extension_source=split_extension_source)
      File "/tmp/pip-build-Qtbagy/pycurl/setup.py", line 497, in get_extension
        ext_config = ExtensionConfiguration(argv)
      File "/tmp/pip-build-Qtbagy/pycurl/setup.py", line 71, in __init__
        self.configure()
      File "/tmp/pip-build-Qtbagy/pycurl/setup.py", line 107, in configure_unix
        raise ConfigurationError(msg)
    __main__.ConfigurationError: Could not run curl-config: [Errno 2] No such file or directory

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-Qtbagy/pycurl/

解决办法:

apt-get install python-pycurl

发表评论

邮箱地址不会被公开。 必填项已用*标注