scrapy案例——读书网列表页和详情页的爬取
案例名称:读书网列表页和详情页的爬取
案例需求:
1.爬取读书网中中国当代小说中的列表页中的书名、作者、书的图片和书本详情
2.爬取列表页中对应的详情页面的价格、出版社和出版时间
3.将爬取下来的数据保存在数据库中
分析:
重点在rules中:
1. rules是一个元组或者是列表,包含的是Rule对象
2. Rule表示规则,其中包含LinkExtractor,callback和follow等参数
3. LinkExtractor:连接提取器,可以通过正则或者是xpath来进行url地址的匹配
4. callback :表示经过连接提取器提取出来的url地址响应的回调函数,可以没有,没有表示响应不会进行回调函数的处理
5. follow:连接提取器提取的url地址对应的响应是否还会继续被rules中的规则进行提取,True表示会,Flase表示不会
1.获取请求url
2.做准备工作
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
# ROBOTSTXT_OBEY = True
ITEM_PIPELINES = {'scrapy_readbook.pipelines.ScrapyReadbookPipeline': 300, }
3.解析列表页数据
img_list=response.xpath('//div[@class="bookslist"]')for img in img_list:src=img.xpath('.//img/@data-original').extract_first()name = img.xpath('.//img/@alt').extract_first()author=img.xpath('.//p[1]/text()').extract_first()detail=img.xpath('.//p[2]/text()').extract_first()herf=img.xpath('.//a/@href').extract_first()print('书名:',name)print('作者:',author)print('书本详情:',detail)print('图片:',src)print('详情页链接:',herf)
4.解析对应详情页
item = ScrapyReadbookItem()item['name']=nameitem['author']=authoritem['detail']=detailitem['src']=srcitem['herf']='https://www.dushu.com'+herf# item = ScrapyReadbookItem(name=name, src=src, author=author, detail=detail,herf=herf)# if type(item["herf"]) == str:if herf:url='https://www.dushu.com'+herf# 请求详情页yield scrapy.Request(url,callback=self.parse_detail,meta={"item": item})
def parse_detail(self, response):item = response.meta["item"]# 获取详情页的内容、图片# item["price"] = response.xpath("//div[@class='book-details']//p[@class='price']/text()").extract()# item["cbs"] = response.xpath("//div[@class='book-details']/div/table//tr[2]/td[2]/text()").extract()# item["cb_time"] = response.xpath("//div[@class='book-details']/table//tr[1]/td[4]/text()").extract()print('+++++++++++详情页+++++++++++')title=response.xpath("//div[@class='book-title']/h1/text()").extract_first()price = response.xpath("//div[@class='book-details']/div/p[@class='price']/span/text()").extract_first()cbs = response.xpath("//div[@class='book-details']/div/table//tr[2]/td[2]/text()").extract_first()cb_time = response.xpath("//div[@class='book-details']/table//tr[1]/td[4]/text()").extract_first()print('书名:',title)print('价格:',price)print('出版社:',cbs)print('出版时间:',cb_time)print('=======================================')item['price']=priceitem['cbs']=cbsitem['cb_time']=cb_timeyield item
5.items.py
name=scrapy.Field()src=scrapy.Field()author=scrapy.Field()detail=scrapy.Field()herf=scrapy.Field()price=scrapy.Field()cbs=scrapy.Field()cb_time=scrapy.Field()
6.保存至数据库
def __init__(self):# 打开文件# 连接数据库self.conn = pymysql.connect(host='localhost',port=3306,user='root',passwd='wx990826',db='dushu',)self.cur = self.conn.cursor()def process_item(self, item, spider):sqli = "insert into zgddxs(name,author,src,detail,herf,price,cbs,cb_time) values(%s,%s,%s,%s,%s,%s,%s,%s)"self.cur.execute(sqli, (item['name'], item['author'], item['src'], item['detail'], item['herf'], item['price'], item['cbs'], item['cb_time']))self.conn.commit()print('保存完毕')return item
运行结果: