当前位置：首页 > news >正文

Python学习第二十二天

news 2025/10/28 16:04:16

urllib模块

官网：python自带的，无需安装。

urllib.request 打开和读取 URL
urllib.error 包含 urllib.request 抛出的异常
urllib.parse 用于解析 URL
urllib.robotparser 用于解析 robots.txt 文件

`urllib.request`

官网概念

urllib.request 模块定义了适用于在各种复杂情况下打开 URL（主要为 HTTP）的函数和类 --- 例如基本认证、摘要认证、重定向、cookies 及其它。

使用

urlopen

response = urllib.request.urlopen(url, data=None, [timeout, ]*, context=None)

response的属性和方法

import urllib# help(urllib.request.urlopen) urlopen(url,data=None,timeout=<object object at 0x000001A2C7D428B0>, *,context=None)response = urllib.request.urlopen('http://www.baidu.com')if response.getcode() == 200:# response属性 .reason	str	HTTP 响应的状态描述（如OK、Not Found等）。print(f'.reason属性，状态描述:{response.reason}')# response属性 .status	int	HTTP 响应的状态码（如200表示成功，404表示未找到）。print(f'.status属性，响应的状态码:{response.status}')# .url	str	实际请求的 URL（可能会与原始 URL 不同，例如发生重定向时）。print(f'.url属性，实际请求的 URL:{response.url}')# .headers	http.client.HTTPMessage	包含所有响应头信息的对象，可以通过字典方式访问头信息。# print(f'包含所有响应头信息的对象,获取请求头:{response.headers}')print(f'.headers属性，包含所有响应头信息的对象,获取请求头:{response.headers.get('Content-Type')}')# .msg	http.client.HTTPMessage	与.headers相同，包含响应头信息。print(f'msg属性，包含响应头信息:{response.msg}')# .version	int	HTTP 版本（如11表示 HTTP/1.1）。print(f'.version属性，HTTP版本:{response.version}')# .closed	bool	表示响应对象是否已关闭。print(f'.closed属性，表示响应对象是否已关闭:{response.closed}')# .read(size=None)	bytes	读取响应内容的全部或指定大小的字节。 read(5)读五个字节content = response.read().decode('utf-8')# print(f'.read()方法，读取响应内容的全部或指定大小的字节:{content}')# .readline()	bytes	读取一行内容（以\n分隔）。readline = response.readline().decode('utf-8')# print(f'.readline()方法，读取一行内容:{readline}')# .readlines()	List[bytes]	读取所有行，返回一个包含每行内容的列表。lines = response.readlines()# for line in lines:#     print('.readlines()方法，方法读取所有行，返回一个包含每行内容的列表: {}'.format(line.decode('utf-8')))# .getheader(name)	str或None	获取指定名称的 HTTP 响应头。如果头信息不存在，则返回None。header = response.getheader('Content-Type')print('.getheader(name)方法，获取指定名称的 HTTP 响应头。如果头信息不存在，则返回None: {}'.format(header))# .getheaders()	List[Tuple[str, str]]	返回所有 HTTP 响应头，格式为一个包含(header_name, header_value)元组的列表。header = response.getheader('Content-Type')print('.getheaders()方法，返回所有 HTTP 响应头，格式为一个包含(header_name, header_value)元组的列表: %s' %header )# .close()	None	关闭响应对象，释放资源。close = response.close()print('.close()方法，关闭响应对象，释放资源: %s' %close )
else:print("获取失败")

类型	名称	返回值类型	描述
属性	`.status`	`int`	HTTP 响应的状态码（如 `200` 表示成功，`404` 表示未找到）。
属性	`.reason`	`str`	HTTP 响应的状态描述（如 `OK`、`Not Found` 等）。
属性	`.url`	`str`	实际请求的 URL（可能会与原始 URL 不同，例如发生重定向时）。
属性	`.headers`	`http.client.HTTPMessage`	包含所有响应头信息的对象，可以通过字典方式访问头信息。
属性	`.msg`	`http.client.HTTPMessage`	与 `.headers` 相同，包含响应头信息。
属性	`.version`	`int`	HTTP 版本（如 `11` 表示 HTTP/1.1）。
属性	`.closed`	`bool`	表示响应对象是否已关闭。
方法	`.read(size=None)`	`bytes`	读取响应内容的全部或指定大小的字节。
方法	`.readline()`	`bytes`	读取一行内容（以 `\n` 分隔）。
方法	`.readlines()`	`List[bytes]`	读取所有行，返回一个包含每行内容的列表。
方法	`.getheader(name)`	`str` 或 `None`	获取指定名称的 HTTP 响应头。如果头信息不存在，则返回 `None`。
方法	`.getheaders()`	`List[Tuple[str, str]]`	返回所有 HTTP 响应头，格式为一个包含 `(header_name, header_value)` 元组的列表。
方法	`.close()`	`None`	关闭响应对象，释放资源。

urlretrieve

urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)

下载网页、图片、视频、音频...未来可能会停用需要小心使用。

# 下载功能但是官网说了 后续会停用 这个要看下
# 官网原话：以下函数和类是由 Python 2 模块 urllib （相对早于 urllib2 ）移植过来的。将来某个时候可能会停用。
from urllib.request import urlretrieve
# urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)
# 下载网页
# urlretrieve('http://www.baidu.com', filename='baidu.html')# 下载图片 百度找图片右键复制图片地址 图片一般.jpg
url_image = 'https://img2.baidu.com/it/u=3097253230,865203483&fm=253&fmt=auto&app=138&f=JPEG?w=800&h=1449'
# urlretrieve(url_image,'beautiful_gril.jpg')# 下载视频 获取视频地址时 F12右键查看src地址 视频一般用.mp4
url_video = 'https://vdept3.bdstatic.com/mda-qfvaqpnzzj3qf31w/sc/cae_h264/1719738331325131837/mda-qfvaqpnzzj3qf31w.mp4?v_from_s=hkapp-haokan-hbf&auth_key=1742536084-0-0-b0bc588642a1e5555556be80fc4fdcab&bcevod_channel=searchbox_feed&pd=1&cr=2&cd=0&pt=3&logid=2884307607&vid=16648779981624158842&klogid=2884307607&abtest=132219_1'
urlretrieve(url_video,'video.mp4')

Request

主要作用是为了自定义请求头。

from urllib.request import Request,urlopen# help(Request)
url = 'http://www.baidu.com'
# 主要是为了自定义请求头啥的
headers = {'User-Agent': 'Mozilla/5.0'}
req = Request(url, headers=headers)
response = urlopen(req)
print(response.read().decode('utf-8'))

`build_opener`

创建一个自定义的 OpenerDirector 对象，用于处理请求，可以扩展 urllib.request 的功能，比如支持代理、Cookies、认证等。

build_opener（处理器）

from urllib.request import build_opener,HTTPHandler# 创建自定义 Opener
opener = build_opener(HTTPHandler())
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

build_opener中的处理器可选值

处理器名	描述
`HTTPHandler`	处理 HTTP 请求。
`HTTPSHandler`	处理 HTTPS 请求。
`ProxyHandler`	处理代理请求。
`HTTPCookieProcessor`	处理 HTTP Cookies。
`HTTPBasicAuthHandler`	处理 HTTP 基本认证。
`HTTPDigestAuthHandler`	处理 HTTP 摘要认证。
`FTPHandler`	处理 FTP 请求。
`FileHandler`	处理本地文件请求。
`DataHandler`	处理数据 URL 请求（如 `data:` 开头的 URL）。

常用对比

方法名	描述
*`urllib.request.urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, , cafile=None, capath=None, context=None)`**	发送 HTTP 请求并返回一个响应对象（如 `http.client.HTTPResponse`）。支持 GET、POST 等请求。
`urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)`	创建一个请求对象，用于自定义请求的 URL、数据、头信息和方法（如 GET、POST）。
*`urllib.request.build_opener(handlers)`**	创建一个自定义的 OpenerDirector 对象，用于处理请求。
`urllib.request.install_opener(opener)`	安装一个全局的 OpenerDirector 对象，后续的 `urlopen` 会使用该对象。
`urllib.request.pathname2url(path)`	将本地文件路径转换为 URL 格式。
`urllib.request.url2pathname(url)`	将 URL 格式的路径转换为本地文件路径。
`urllib.request.getproxies()`	获取系统配置的代理信息。
`urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)`	下载 URL 对应的资源并保存到本地文件。

`urllib.`errors

官网概念

urllib.error 模块为 urllib.request 所引发的异常定义了异常类。(就是提供了能够捕获request的异常的几个类)。

使用

from urllib.request import urlopen
from urllib.error import URLError,HTTPErrorurl = 'http://www.test.com'
# 测试 URLError
try:response = urlopen(url)
except HTTPError as e:print(f"HTTP 错误: {e.code} - {e.reason}")print("Headers:", e.headers)
except URLError as e:print(f"URL 错误: {e.reason}")

`urllib.`parse

官网概念

该模块定义了一个标准接口，用于将统一资源定位符（URL）字符串拆分为不同部分（协议、网络位置、路径等），或将各个部分组合回 URL 字符串，并将“相对 URL”转换为基于给定的“基准 URL”的绝对 URL。该模块被设计为匹配针对相对统一资源定位符的互联网 RFC。它支持下列 URL 类别: file, ftp, gopher, hdl, http, https, imap, itms-services, mailto, mms, news, nntp, prospero, rsync, rtsp, rtsps, rtspu, sftp, shttp, sip, sips, snews, svn, svn+ssh, telnet, wais, ws, wss。

为什么要使用编解码，通过下面的例子比如百度一下

# 页面通过百度一下搜索周杰伦 然后看链接就能获取到下面这个 根据这两个方法 有些需要我们转义类似的概念
encoded = "https://www.baidu.com/s?wd=周杰伦"
print(f"quote()方法输出：{quote(encoded.split("=")[1])}")
encoded = "https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6"
print(f"unquote()方法输出：{unquote(encoded.split("=")[1])}")

使用

`quote`

将字符串中的特殊字符转换为 URL 编码格式（也称为百分号编码）。

参数

string：需要编码的字符串。
safe：指定哪些字符不需要编码，默认为 '/'。
encoding：字符串的编码方式，默认为 None（使用默认编码）。
errors：编码错误处理方式，默认为 None。

unquote

将 URL 编码的字符串解码为原始字符串。

参数

string：需要解码的 URL 编码字符串。
encoding：解码后的字符串编码方式，默认为 'utf-8'。
errors：解码错误处理方式，默认为 'replace'。

quote_plus

将字符串中的特殊字符转换为 URL 编码格式，并将空格替换为 +。

参数

string：需要编码的字符串。
safe：指定哪些字符不需要编码，默认为空字符串 ''。
encoding：字符串的编码方式，默认为 None（使用默认编码）。
errors：编码错误处理方式，默认为 None。

unquote_plus

将 URL 编码的字符串解码为原始字符串，并将 + 替换为空格。

参数

string：需要解码的 URL 编码字符串。
encoding：解码后的字符串编码方式，默认为 'utf-8'。
errors：解码错误处理方式，默认为 'replace'。

from urllib.parse import quote,unquote,quote_plus,unquote_plus# quote()	将字符串中的特殊字符转换为 URL 编码格式。	空格编码为%20	对 URL路径部分进行编码。
# 对字符串进行 URL 编码
encoded = quote('Hello World!')
print(f"quote()方法输出：{encoded}")  # 输出: Hello%20World%21
# 指定不需要编码的字符
encoded = quote('Hello World!', safe='!')
print(f"quote()方法safe参数输出：{encoded}")  # 输出: Hello%20World%21
# unquote()	将 URL 编码的字符串解码为原始字符串。	%20解码为空格	对 URL 路径部分进行解码。
encoded = unquote(encoded)
print(f"unquote()方法输出：{encoded}")  # 输出: Hello World!
# quote_plus()	将字符串中的特殊字符转换为 URL 编码格式，并将空格替换为+。	空格替换为+	对 URL 查询参数部分进行编码。
encoded = quote_plus('Hello World!')
print(f"quote_plus()方法输出：{encoded}")  # 输出: Hello+World%21
# unquote_plus()	将 URL 编码的字符串解码为原始字符串，并将+替换为空格。	+替换为空格	对 URL 查询参数部分进行解码。
encoded = unquote_plus(encoded)
print(f"quote_plus()方法输出：{encoded}")  # 输出: Hello World!

编码和解码方法对比

函数名	功能描述	空格处理	典型用途
`quote()`	将字符串中的特殊字符转换为 URL 编码格式。	空格编码为 `%20`	对 URL 路径部分进行编码。
`unquote()`	将 URL 编码的字符串解码为原始字符串。	`%20` 解码为空格	对 URL 路径部分进行解码。
`quote_plus()`	将字符串中的特殊字符转换为 URL 编码格式，并将空格替换为 `+`。	空格替换为 `+`	对 URL 查询参数部分进行编码。
`unquote_plus()`	将 URL 编码的字符串解码为原始字符串，并将 `+` 替换为空格。	`+` 替换为空格	对 URL 查询参数部分进行解码。

urllib.robotparser

官网概念

此模块提供了一个单独的类 RobotFileParser，它可以回答关于某个特定用户代理能否在发布了 robots.txt 文件的网站抓取特定 URL 的问题。

使用

from urllib.robotparser import RobotFileParser# 创建 RobotFileParser 对象
rp = RobotFileParser()# 设置 robots.txt 文件的 URL
rp.set_url('https://www.baidu.com/robots.txt')# 读取并解析 robots.txt 文件
rp.read()# 检查访问权限
user_agent = 'MyBot'
urls_to_check = ['https://www.baidu.com/','https://www.baidu.com/private/','https://www.baidu.com/public/'
]for url in urls_to_check:if rp.can_fetch(user_agent, url):print(f"Allowed: {url}")else:print(f"Disallowed: {url}")

常用方法

方法名	描述
`set_url(url)`	设置 `robots.txt` 文件的 URL。
`read()`	读取并解析 `robots.txt` 文件。
`parse(lines)`	解析 `robots.txt` 文件的内容（以字符串列表形式传入）。
`can_fetch(useragent, url)`	检查指定的用户代理（爬虫）是否可以访问指定的 URL。
`mtime()`	返回上次获取 `robots.txt` 文件的时间（Unix 时间戳）。
`modified()`	设置上次获取 `robots.txt` 文件的时间（Unix 时间戳）。

get请求

获取请求头和参数

import urllib.request,urllib.parseurl = ('https://www.baidu.com/s?wd=')headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
}wd = '林俊杰'# get 请求参数 quote 是对字符串编码
name = urllib.parse.quote(wd)# get 请求参数
request = urllib.request.Request(url=url+name,headers=headers)response = urllib.request.urlopen(request)# 获取相应数据
print(response.code,response.read().decode('utf-8'))

Post请求

获取post请求头和参数：


import urllib.request,urllib.parseurl = 'https://fanyi.baidu.com/sug'headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36'
}data = {"kw":'spider'
}# post 请求参数 必须进行编码 encode("utf-8")
data = urllib.parse.urlencode(data).encode("utf-8")# post 请求参数必须编码
request = urllib.request.Request(url=url,data=data,headers=headers)response = urllib.request.urlopen(request)# 获取相应数据
print(response.code,response.read().decode('utf-8'))

代码路径：pythonPractice: python学习内容练习-代码

查看全文

http://www.mrgr.cn/news/95369.html