利用scrapy爬虫

1、安装wheel
    pip install wheel
   2、安装lxml
    https://pypi.python.org/pypi/lxml/4.1.0
   3、安装pyopenssl
    https://pypi.python.org/pypi/pyOpenSSL/17.5.0
   4、安装Twisted
    https://www.lfd.uci.edu/~gohlke/pythonlibs/
   5、安装pywin32
    https://sourceforge.net/projects/pywin32/files/
   6、安装scrapy
    pip install scrapy

创建工程

1	scrapy startproject Spider

创建爬虫程序

1 2	cd Spider scrapy genspider meiju meijutt.com

执行爬虫

1	scrapy crawl meiju --nolog

post请求

注释 start_urls，重写start_requests方法，request的callback回调到parse方法，这样我们就可以继续解析出我们想要的内容了

def start_requests(self):
    yield scrapy.FormRequest(
        url='https://www.chainfor.com/home/list/news/data.do?',
        formdata={
            'pageNo': '1',
            'device_type': '0'
        },
        callback=self.parse
    )

新闻先抓取列表链接，再抓取内容

如果抓取新闻网站的时候，我们总是会遇到第一页面总是列表，还要再通过列表的链接抓取内容。我们只需要在parse函数里通过抓取的链接进行request请求，同样适用callback 指向自定义回调函数就可以。

def parse(self, response):
    list = [新闻列表链接]
    for i in range(len(list)): 
    	url = list[i]
       scrapy.Request(url, callback=self.parse_dir_contents)
    pass

def parse_dir_contents(self, response):

    item = Item()
    item["name"] = response.xpath("//div[@class='m-infor rt']/div[@class='m-i-article']/h1/text()").extract()[0].strip()
    item["time"] = self.validTime[idx]
    item["img"] = self.validImgs[idx]
    content = response.xpath("//div[@class='m-infor rt']/div[@class='m-i-article']/div[@class='m-i-bd']/*")

    newsContent = ''
    for p in content:
        temp = etree.tostring(p, encoding='utf-8').decode('utf-8')
        newsContent += temp
    item["content"] = newsContent
    yield item
    pass

获取内容标签里的所有标签 xpath语句最后 ‘/*’，
标签对象转字符串时，需要解码

网上查得给自定义回调函数添加其他参数时，代码：

scrapy.Request(url, callback=lambda response, idx=i: self.parse_dir_contents(response, idx))

def parse_dir_contents(self, response, idx):