当前位置：首页 > news >正文

scrapy案例——读书网列表页和详情页的爬取

news 2025/7/13 3:38:32

案例名称：读书网列表页和详情页的爬取

案例需求：

1.爬取读书网中中国当代小说中的列表页中的书名、作者、书的图片和书本详情

2.爬取列表页中对应的详情页面的价格、出版社和出版时间

3.将爬取下来的数据保存在数据库中

分析:

重点在rules中：

1. rules是一个元组或者是列表，包含的是Rule对象

2. Rule表示规则，其中包含LinkExtractor,callback和follow等参数

3. LinkExtractor:连接提取器，可以通过正则或者是xpath来进行url地址的匹配

4. callback :表示经过连接提取器提取出来的url地址响应的回调函数，可以没有，没有表示响应不会进行回调函数的处理

5. follow：连接提取器提取的url地址对应的响应是否还会继续被rules中的规则进行提取，True表示会，Flase表示不会

1.获取请求url

2.做准备工作

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"

# ROBOTSTXT_OBEY = True

ITEM_PIPELINES = {'scrapy_readbook.pipelines.ScrapyReadbookPipeline': 300,
}

3.解析列表页数据

img_list=response.xpath('//div[@class="bookslist"]')for img in img_list:src=img.xpath('.//img/@data-original').extract_first()name = img.xpath('.//img/@alt').extract_first()author=img.xpath('.//p[1]/text()').extract_first()detail=img.xpath('.//p[2]/text()').extract_first()herf=img.xpath('.//a/@href').extract_first()print('书名：',name)print('作者：',author)print('书本详情：',detail)print('图片：',src)print('详情页链接：',herf)

4.解析对应详情页

item = ScrapyReadbookItem()item['name']=nameitem['author']=authoritem['detail']=detailitem['src']=srcitem['herf']='https://www.dushu.com'+herf# item = ScrapyReadbookItem(name=name, src=src, author=author, detail=detail,herf=herf)# if type(item["herf"]) == str:if herf:url='https://www.dushu.com'+herf# 请求详情页yield scrapy.Request(url,callback=self.parse_detail,meta={"item": item})

 def parse_detail(self, response):item = response.meta["item"]# 获取详情页的内容、图片# item["price"] = response.xpath("//div[@class='book-details']//p[@class='price']/text()").extract()# item["cbs"] = response.xpath("//div[@class='book-details']/div/table//tr[2]/td[2]/text()").extract()# item["cb_time"] = response.xpath("//div[@class='book-details']/table//tr[1]/td[4]/text()").extract()print('+++++++++++详情页+++++++++++')title=response.xpath("//div[@class='book-title']/h1/text()").extract_first()price = response.xpath("//div[@class='book-details']/div/p[@class='price']/span/text()").extract_first()cbs = response.xpath("//div[@class='book-details']/div/table//tr[2]/td[2]/text()").extract_first()cb_time = response.xpath("//div[@class='book-details']/table//tr[1]/td[4]/text()").extract_first()print('书名：',title)print('价格：',price)print('出版社：',cbs)print('出版时间：',cb_time)print('=======================================')item['price']=priceitem['cbs']=cbsitem['cb_time']=cb_timeyield item

5.items.py

name=scrapy.Field()src=scrapy.Field()author=scrapy.Field()detail=scrapy.Field()herf=scrapy.Field()price=scrapy.Field()cbs=scrapy.Field()cb_time=scrapy.Field()

6.保存至数据库

 def __init__(self):# 打开文件# 连接数据库self.conn = pymysql.connect(host='localhost',port=3306,user='root',passwd='wx990826',db='dushu',)self.cur = self.conn.cursor()def process_item(self, item, spider):sqli = "insert into zgddxs(name,author,src,detail,herf,price,cbs,cb_time) values(%s,%s,%s,%s,%s,%s,%s,%s)"self.cur.execute(sqli, (item['name'], item['author'], item['src'], item['detail'], item['herf'], item['price'], item['cbs'], item['cb_time']))self.conn.commit()print('保存完毕')return item

运行结果：