当前位置：首页 > news >正文

python进阶-01-利用Xpath来解析Html

news 2025/7/6 12:34:15

python进阶-01-利用Xpath来解析Html

一.说明

日拱一卒，我们也来到了python进阶的系列文章，今天要对Xpath进行详细说明，这个主要用途是用来爬取网页固定信息，本文只用作技术交流学习。

二.用途

用于在XML或HTML文档中查找节点（元素、属性等）并且用Xpath来获取我们需要的元素信息；

这里要注意下：

1.Xpath是用来解析XML或HTML 这种树状的结构化文档；

2.不要以为网页上我们能看到的都能通过Xpath 一次抓取出来的额，因为你看到的内容可能是内部js动态请求接口来更新网页内容，所以针对这种情况，我们需要进一步分析，根据情况来，选取适合的工具，来实现抓取我们需要的内容。

三.安装

pip install lxml==4.3.0 -i http://pypi.doubanio.com/simple --trusted-host pypi.doubanio.com

四.XPath基本语法

XPath的基本语法允许你通过路径表达式来选择文档中的元素

/：选择从根节点开始的路径。
//：选择匹配路径的所有节点（不论其位置）。
.：表示当前节点。
..：表示当前节点的父节点。
@：选择属性。
[]:表示过滤条件
text():表示节点的文字内容
@class:表示节点 class属性内容
@href:表示节点href属性内容
contains()函数选择包含特定字符串的元素
and 逻辑且
or逻辑或
not()用于排除不符合条件的节点
starts-with()函数选择元素以特定字符串开始的元素

五.示例Html

<html><body><div id="content"><h1>Welcome to XPath Tutorial</h1><p class="intro">This is an introduction to XPath.</p><a href="http://example.com">Example Link</a></div></body>
</html>

注意：在真实项目中，HTML是我们发起HTTP请求获取的，这里为了方便理解Xpath故采用示例HTML文档

另外在真实的请求中，需要对请求返回的内容进行utf-8转码,不然会导致中文乱码。

示例：

 response = requests.get('https://www.example.com',params=params.dict())html = response.content.decode('utf-8', 'ignore')#html=(response.text)  注意这样未转码的会导致中文乱码

六.XPath示例

from lxml import etree# 示例HTML字符串
html_content = """
<html><body><div id="content"><h1>Welcome to XPath Tutorial</h1><p class="intro">This is an introduction to XPath.</p><a href="http://example.com">Example Link</a></div></body>
</html>
"""
root  = etree.HTML(html)

使用XPath选择h1标签内容

    # 使用XPath选择h1标签内容h1_text = root.xpath('//h1/text()')print(h1_text) #['Welcome to XPath Tutorial']

使用XPath选择p标签的class属性

    intro_class = root.xpath('//p/@class')print(intro_class)  # 输出: ['intro']

使用XPath选择链接的href属性

    link_href = root.xpath('//a/@href')print(link_href)  # 输出: ['http://example.com']

选择div元素下的所有子元素

    divs = root.xpath('//div/*')  print(divs) #输出：[<Element h1 at 0x2c53cb81248>, <Element p at 0x2c53cb81308>, <Element a at 0x2c53cb81388>]

选择href属性为指定值的a元素

    link_href1 = root.xpath('//a[@href="http://example.com"]')print(link_href1) #输出 [<Element a at 0x1f8df0503c8>]

选择节点的父元素

    parent_element = root.xpath('//h1/..')  # 获取h1元素的父元素print(parent_element)  # 输出 [<Element div at 0x1cf1afa1508>]

选择具有特定文本的元素

    specific_text = root.xpath('//p[text()="This is an introduction to XPath."]')print(specific_text)  # 输出[<Element p at 0x1fc6d2603c8>]

使用contains()函数选择包含特定字符串的元素

    contains_text = root.xpath('//p[contains(text(), "introduction")]') print(contains_text) #输出 [<Element p at 0x1ef35baf408>]

starts-with()函数选择特定字符串开始的元素

    starts_with = root.xpath('//p[starts-with(text(), "This")]')  print('starts_with:',starts_with) #输出 starts_with: [<Element p at 0x1d5b985d548>]

逻辑操作 and or not

    elements = root.xpath('//p[@class="intro" or @href]')print(elements) # 输出[<Element p at 0x2501604f408>]non_intro_paragraphs = root.xpath('//p[not(@class="intro")]/text()')print(non_intro_paragraphs) # 输出 []

选择元素：选取第一个元素使用[1]（XPath索引从1开始）

    # 选择第一个元素：使用[1]（XPath索引从1开始）# 选择最后一个元素：使用last()函数# 选择倒数第n个元素：使用[position() = n]或[last()-n+1]# 选择首个p元素first_paragraph = root.xpath('//p[1]/text()')print(first_paragraph)  # 输出: ['This is an introduction to XPath.']# 选择最后一个p元素last_paragraph = root.xpath('//p[last()]/text()')print(last_paragraph)  # 输出: ['This is an introduction to XPath.']# 选择倒数第二个p元素second_last_paragraph = root.xpath('//p[last()-1]/text()')print(second_last_paragraph)  # 输出: []# 选择倒数3个p元素last_three_paragraphs = root.xpath('//p[position() > last()-3]/text()')print(last_three_paragraphs) # 输出: ['This is an introduction to XPath.']