用通义大模型写爬虫程序,汇总各科成绩
需求:根据各科网址,输入学号、姓名查询成绩。
中间反反复复很多次,本文只记下重点的几次和大模型的沟通历史。
输入界面
查询界面
round0(最初的问题)
请在windows下,使用python的selenium库,输入考号和姓名,点击查询按钮, 并输出总分、班名、校名,请用提供代码
谷歌驱动地址:C:/chrome_drv/chromedriver.exe
查询网址:https://w2sl19od.yichafen.com/qz/122DxDre
xpath信息:
考号:/html/body/div[1]/form/div[2]/div[1]/div[2]/input
姓名:/html/body/div[1]/form/div[2]/div[2]/div[2]/input 查询按钮:/html/body/div[1]/div[2]/button
总分:/html/body/div/div[2]/div[1]/table/tbody/tr[5]/td[2]
班名:/html/body/div/div[2]/div[1]/table/tbody/tr[8]/td[2]
校名:/html/body/div/div[2]/div[1]/table/tbody/tr[9]/td[2]
考生信息: 考号:250152001 姓名:赵志立
round1(想查询多个学生的多门成绩)
请在windows下,使用python的selenium库,打开课程查询地址,输入考号、姓名,点击查询按钮, 获取课程的分数、班名、校名。
其中课程查询地址、考号、姓名都发在c:\查分系统.xlsx中。
其中sheet1,存放着课程名称、地址;sheet2,存放着考号、姓名。
请找到每个学生,遍历每一门课程,获取课程名称、分数、班名、校名,存在sheet3中
同时把学生的姓名、各科成绩放到sheet4中的每行中。
注意课程名称是从sheet1中遍历获取,可能是多门课程。
谷歌驱动地址:C:/chrome_drv/chromedriver.exe
xpath信息:
考号:/html/body/div[1]/form/div[2]/div[1]/div[2]/input
姓名:/html/body/div[1]/form/div[2]/div[2]/div[2]/input
查询按钮:/html/body/div[1]/div[2]/button
分数:/html/body/div/div[2]/div[1]/table/tbody/tr[5]/td[2]
班名:/html/body/div/div[2]/div[1]/table/tbody/tr[8]/td[2]
校名:/html/body/div/div[2]/div[1]/table/tbody/tr[9]/td[2]
round2
不对,sheet4中存放的是多门课程的分数。具体有哪些课程,需要从sheet1中获取
round3
再修改一下,第一次循环是,清空sheet3和sheet4
round4
还要修改,sheet3中增加一列:姓名,放到第一列中
round5
在sheet4中再增加一个总分数,是各科成绩之和
round6
再修改一下,每查完一个学生的全部成绩,要写入excel
round7(代码有问题)
再优化一下,1、每个学生成绩查完后,从程序输出。2、如果查询出错,自动跳入查询下一个人成绩
round9(因round7代码有问题,转到grok)
请修改如下代码(就是通义在round6中提供的),优化一下:注意事项 1、因为有可能程序运行中报错,为保证数据,每次获取一个学生成绩后,都要写入excel 2、每次获取一个学生成绩后,需要通过日志输出,这样能看到程序的执行过程
round10
把这个代码再优化一下:
1、每完成一个学生的成绩,就写入excel
2、查询每个学生的成绩时,如果报错,就转到另外一个学生,并把没取到设置为0。
3、程序中增加一个变量:是否清除之前的记录,如果是1,写入sheet3、sheet4时先清除,否则,直接追加记录。这个变量直接在源代码中设置就行,无需运行时读取输入。
4、使用显式等待:替代 time.sleep,以提高代码的稳定性和响应速。
代码部分
round0代码(查询单人、单科成绩)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time# 配置Chrome驱动
chrome_driver_path = "C:/chrome_drv/chromedriver134.exe"
service = Service(executable_path=chrome_driver_path)
options = webdriver.ChromeOptions()
options.add_argument("--start-maximized") # 最大化窗口# 初始化浏览器
driver = webdriver.Chrome(service=service, options=options)# 访问查询网址
driver.get("https://w2sl19od.yichafen.com/qz/******")try:# 显式等待页面加载wait = WebDriverWait(driver, 10)# 输入考号exam_no = wait.until(EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/form/div[2]/div[1]/div[2]/input")))exam_no.send_keys("**考号")# 输入姓名name = wait.until(EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/form/div[2]/div[2]/div[2]/input")))name.send_keys("**姓名")# 点击查询按钮query_btn = wait.until(EC.element_to_be_clickable((By.XPATH, "/html/body/div[1]/div[2]/button")))query_btn.click()# 等待结果加载time.sleep(3) # 根据实际网络情况调整等待时间# 获取成绩信息total_score = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[5]/td[2]").textclass_rank = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[8]/td[2]").textschool_rank = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[9]/td[2]").text# 输出结果print(f"总分:{total_score}")print(f"班名:{class_rank}")print(f"校名:{school_rank}")except Exception as e:print("查询过程中出现错误:", str(e))finally:# 关闭浏览器driver.quit()
round6代码
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from openpyxl import load_workbook# 设置Chrome驱动路径
chrome_driver_path = "C:/chrome_drv/chromedriver.exe"
service = Service(chrome_driver_path)# 加载Excel文件
excel_path = "c:/查分系统.xlsx"
df_courses = pd.read_excel(excel_path, sheet_name='sheet1')
df_students = pd.read_excel(excel_path, sheet_name='sheet2')# 初始化webdriver
driver = webdriver.Chrome(service=service)def write_to_excel(all_results, student_results):# 清理并写入Excelwb = load_workbook(excel_path)if 'sheet3' in wb.sheetnames:wb.remove(wb['sheet3'])if 'sheet4' in wb.sheetnames:wb.remove(wb['sheet4'])wb.save(excel_path)wb.close()with pd.ExcelWriter(excel_path, engine='openpyxl', mode='a') as writer:# 写入sheet3,包含姓名、课程名称、分数、班名、校名pd.DataFrame(all_results, columns=['姓名', '课程名称', '分数', '班名', '校名']).to_excel(writer,sheet_name='sheet3',index=False)# 使用pd.json_normalize将字典转换为DataFrame,并确保‘总分数’列被正确处理student_df = pd.json_normalize(student_results)# 调整列顺序以确保‘总分数’列的位置合适,这里假设放在最后一列cols = list(student_df.columns)if '总分数' in cols:cols.remove('总分数')student_df = student_df[cols + ['总分数']]student_df.to_excel(writer, sheet_name='sheet4', index=False)all_results = []
student_results = []for index, student in df_students.iterrows():student_name = student['姓名']student_exam_id = student['考号'] # 修正此处的赋值语句student_scores = {'姓名': student_name} # 存储单个学生的姓名和各科成绩total_score = 0 # 初始化总分数for course_index, course in df_courses.iterrows():course_url = course['地址']course_name = course['课程名称']# 打开课程查询页面driver.get(course_url)# 输入考号和姓名driver.find_element(By.XPATH, "/html/body/div[1]/form/div[2]/div[1]/div[2]/input").send_keys(student_exam_id)driver.find_element(By.XPATH, "/html/body/div[1]/form/div[2]/div[2]/div[2]/input").send_keys(student_name)# 点击查询按钮driver.find_element(By.XPATH, "/html/body/div[1]/div[2]/button").click()time.sleep(2) # 给予页面加载时间# 获取成绩等信息score = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[5]/td[2]").textclass_name = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[8]/td[2]").textschool_name = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[9]/td[2]").text# 将结果添加到列表中,并包含学生姓名all_results.append([student_name, course_name, score, class_name, school_name])student_scores[course_name] = float(score) if score.replace('.', '', 1).isdigit() else 0 # 添加该生的这门课成绩,并转换为浮点数total_score += float(score) if score.replace('.', '', 1).isdigit() else 0 # 计算总分student_scores['总分数'] = total_score # 添加总分数到学生的成绩字典中student_results.append(student_scores)# 每个学生完成后,写入Excelwrite_to_excel(all_results, student_results)# 关闭浏览器
driver.quit()
round8代码(grok提供)
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from openpyxl import load_workbook
import logging# 设置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')# 设置 Chrome 驱动路径
chrome_driver_path = "C:/chrome_drv/chromedriver.exe"
service = Service(chrome_driver_path)# 加载 Excel 文件
excel_path = "c:/查分系统.xlsx"
df_courses = pd.read_excel(excel_path, sheet_name='sheet1')
df_students = pd.read_excel(excel_path, sheet_name='sheet2')# 初始化 webdriver(使用 headless 模式)
options = webdriver.ChromeOptions()
options.add_argument('--headless') # 无界面模式,提高性能
driver = webdriver.Chrome(service=service, options=options)def write_to_excel(all_results, student_results):"""将查询结果写入 Excel 文件的 sheet3 和 sheet4"""# 清理已有 sheetwb = load_workbook(excel_path)if 'sheet3' in wb.sheetnames:wb.remove(wb['sheet3'])if 'sheet4' in wb.sheetnames:wb.remove(wb['sheet4'])wb.save(excel_path)wb.close()# 写入数据with pd.ExcelWriter(excel_path, engine='openpyxl', mode='a') as writer:# 写入 sheet3:详细成绩pd.DataFrame(all_results, columns=['姓名', '课程名称', '分数', '班名', '校名']).to_excel(writer, sheet_name='sheet3', index=False)# 写入 sheet4:学生总览student_df = pd.json_normalize(student_results)cols = list(student_df.columns)if '总分数' in cols:cols.remove('总分数')student_df = student_df[cols + ['总分数']] # 确保“总分数”列在最后student_df.to_excel(writer, sheet_name='sheet4', index=False)# 初始化结果存储列表
all_results = [] # 存储所有学生的每门课程详细信息
student_results = [] # 存储每个学生的成绩字典# 遍历学生
for index, student in df_students.iterrows():student_name = student['姓名']student_exam_id = student['考号']student_scores = {'姓名': student_name} # 存储单个学生的成绩total_score = 0 # 初始化总分logging.info(f"开始查询学生: {student_name}")# 遍历课程for course_index, course in df_courses.iterrows():course_url = course['地址']course_name = course['课程名称']try:# 打开查询页面driver.get(course_url)logging.info(f"访问课程页面: {course_url}")# 输入考号和姓名driver.find_element(By.XPATH, "/html/body/div[1]/form/div[2]/div[1]/div[2]/input").send_keys(student_exam_id)driver.find_element(By.XPATH, "/html/body/div[1]/form/div[2]/div[2]/div[2]/input").send_keys(student_name)# 点击查询按钮driver.find_element(By.XPATH, "/html/body/div[1]/div[2]/button").click()time.sleep(2) # 等待页面加载# 获取成绩和班级信息score = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[5]/td[2]").textclass_name = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[8]/td[2]").textschool_name = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[9]/td[2]").text# 存储结果all_results.append([student_name, course_name, score, class_name, school_name])student_scores[course_name] = float(score) if score.replace('.', '', 1).isdigit() else 0total_score += float(score) if score.replace('.', '', 1).isdigit() else 0logging.info(f"成功获取 {student_name} 的 {course_name} 成绩: {score}")except (NoSuchElementException, TimeoutException) as e:# 错误处理:记录错误并填入默认值logging.error(f"查询 {student_name} 的 {course_name} 时出错: {e}")all_results.append([student_name, course_name, 'N/A', 'N/A', 'N/A'])student_scores[course_name] = 0# 添加总分student_scores['总分数'] = total_scorestudent_results.append(student_scores)# 每个学生完成后,写入 Excelwrite_to_excel(all_results, student_results)logging.info(f"已将 {student_name} 的成绩写入 Excel")# 关闭浏览器
driver.quit()
logging.info("所有学生成绩查询完成")
最终:
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.common.exceptions import NoSuchElementException, TimeoutException
from openpyxl import load_workbook
import logging# 设置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')# 设置 Chrome 驱动路径
chrome_driver_path = "C:/chrome_drv/chromedriver.exe"
service = Service(chrome_driver_path)# 加载 Excel 文件
excel_path = "c:/查分系统.xlsx"
df_courses = pd.read_excel(excel_path, sheet_name='sheet1')
df_students = pd.read_excel(excel_path, sheet_name='sheet2')# 初始化 webdriver(使用 headless 模式)
options = webdriver.ChromeOptions()
options.add_argument('--headless') # 无界面模式
driver = webdriver.Chrome(service=service, options=options)def append_to_excel(all_results, student_results, excel_path="c:/查分系统.xlsx"):"""将查询结果追加到 Excel 文件的 sheet3 和 sheet4"""try:# 加载现有 Excel 文件wb = load_workbook(excel_path)# 处理 sheet3if 'sheet3' in wb.sheetnames:# 读取现有 sheet3 数据sheet3_df = pd.read_excel(excel_path, sheet_name='sheet3')# 将新数据转换为 DataFramenew_sheet3_df = pd.DataFrame(all_results, columns=['姓名', '课程名称', '分数', '班名', '校名'])# 追加新数据sheet3_df = pd.concat([sheet3_df, new_sheet3_df], ignore_index=True)else:# 如果 sheet3 不存在,直接使用新数据sheet3_df = pd.DataFrame(all_results, columns=['姓名', '课程名称', '分数', '班名', '校名'])# 处理 sheet4if 'sheet4' in wb.sheetnames:# 读取现有 sheet4 数据sheet4_df = pd.read_excel(excel_path, sheet_name='sheet4')# 将新数据转换为 DataFramenew_sheet4_df = pd.json_normalize(student_results)# 追加新数据sheet4_df = pd.concat([sheet4_df, new_sheet4_df], ignore_index=True)else:# 如果 sheet4 不存在,直接使用新数据sheet4_df = pd.json_normalize(student_results)# 调整 sheet4 的列顺序,确保 '总分数' 在最后if '总分数' in sheet4_df.columns:cols = [col for col in sheet4_df.columns if col != '总分数']sheet4_df = sheet4_df[cols + ['总分数']]# 保存到 Excelwith pd.ExcelWriter(excel_path, engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:sheet3_df.to_excel(writer, sheet_name='sheet3', index=False)sheet4_df.to_excel(writer, sheet_name='sheet4', index=False)logging.info("成功追加数据到 Excel 文件")except Exception as e:logging.error(f"追加数据到 Excel 文件时出错: {e}")raise# 初始化结果存储
all_results = [] # 存储所有学生的每门课程详细信息
student_results = [] # 存储每个学生的成绩字典# 遍历学生
for index, student in df_students.iterrows():student_name = student['姓名']student_exam_id = student['考号']student_scores = {'姓名': student_name} # 存储单个学生的成绩total_score = 0 # 初始化总分logging.info(f"开始查询学生: {student_name}")# 遍历课程for course_index, course in df_courses.iterrows():course_url = course['地址']course_name = course['课程名称']try:# 打开查询页面driver.get(course_url)logging.info(f"访问课程页面: {course_url}")# 输入考号和姓名driver.find_element(By.XPATH, "/html/body/div[1]/form/div[2]/div[1]/div[2]/input").send_keys(student_exam_id)driver.find_element(By.XPATH, "/html/body/div[1]/form/div[2]/div[2]/div[2]/input").send_keys(student_name)# 点击查询按钮driver.find_element(By.XPATH, "/html/body/div[1]/div[2]/button").click()time.sleep(2) # 等待页面加载# 获取成绩和班级信息score = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[5]/td[2]").textclass_name = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[8]/td[2]").textschool_name = driver.find_element(By.XPATH, "/html/body/div/div[2]/div[1]/table/tbody/tr[9]/td[2]").text# 存储结果all_results.append([student_name, course_name, score, class_name, school_name])student_scores[course_name] = float(score) if score.replace('.', '', 1).isdigit() else 0total_score += float(score) if score.replace('.', '', 1).isdigit() else 0logging.info(f"成功获取 {student_name} 的 {course_name} 成绩: {score}")except (NoSuchElementException, TimeoutException) as e:logging.error(f"查询 {student_name} 的 {course_name} 时出错: {e}")all_results.append([student_name, course_name, 'N/A', 'N/A', 'N/A'])student_scores[course_name] = 0# 添加总分student_scores['总分数'] = total_scorestudent_results.append(student_scores)# 所有查询完成后,追加数据到 Excel
try:append_to_excel(all_results, student_results)logging.info("所有结果成功追加到 Excel")
except Exception as e:logging.error(f"追加数据到 Excel 时出错: {e}")# 关闭浏览器
driver.quit()
logging.info("所有学生成绩查询完成")