当前位置：首页 > news >正文

流量分类实验

news 2026/1/7 23:54:18

源代码：【免费】网络测量实验之流量分类实验源代码资源-CSDN文库

实验背景

决策树就像一棵树，从根节点开始，每个节点都是一个判断条件，每个分支代表一个选择。沿着树的路径，直到叶子节点，得到的类别就是最终的决策结果。

以下是决策树算法的流程：

1. 选择判断条件：首先，我们需要选择一个判断条件作为树的根节点。这个条件应该能够将数据集分成尽可能纯净（即同类别的数据尽量聚集在一起）的子集。常用的方法有信息增益、信息增益比和基尼指数等。

2. 分支：根据选定的判断条件，将数据集分成几个子集。每个子集沿着树的一个分支，形成新的节点。

3. 递归：对于每个子集，重复步骤1和2，选择新的判断条件并继续分支。递归进行，直到满足停止条件。

4. 停止条件：当满足以下任意条件时，停止递归当：当前节点包含的数据都属于同一类别，无需继续分裂；当前节点包含的数据已经很少，无法继续分裂；已经达到预设的树的最大深度。

5. 叶子节点：当递归停止时，形成叶子节点。叶子节点代表最终的类别判断。

编写Python代码实现流量分类功能，可借鉴参考代码框架，具体要求如下：

任务1. 处理提供的流量数据集：下载原始数据包，在实验二的基础上将每个pcap文件组流，补全def extractFlow(flow_list) 函数，提取五元组及其对应的报文长度序列，并按类别输出到csv文件；

任务2. 计算信息增益：补全ID3决策树算法中的def cal_best_theta_value(self, ke, attri_list)函数，该函数通过最大化信息增益来计算给定特征的最佳阈值；

任务3. 流量分类：利用设计的ID3决策树算法结合11个流统计特征对流量进行分类，得到分类准确率，理解根据数据包长度序列统计特征进行流分类的完整流程；

任务4(可选). 实验优化：可考虑通过修改特征、优化模型等策略提高分类准确率。

实验内容

2.1 使用文档及运行样例

2.1.1 项目结构

本项目的结构如下：

2.1.2 使用方法

程序所需要的安装库为：

dpkt_fix==1.7（因为一直下载失败，改为下载dpkt）
numpy==1.23.3
pandas==2.0.1
scapy==2.5.0
scikit_learn==1.2.2
scipy==1.9.1

使用时，依次执行flow_combine.py、label.py、MultiClassDTree.py和net_steams_classification.py即可。

2.2 UML图与主要数据结构

2.3 主要算法说明

# 提取五元组和对应的报文长度序列def extractFlow(flow_list):session_list = []for idx in range(len(flow_list)):''' 补全此处代码此处需要提取每条流的五元组和对应的报文长度序列session_list的每一行一条流，第一列是五元组组成的列表，第二列是报文长度序列组成的列表'''five_list=(flow_list[idx].src_ip,flow_list[idx].dst_ip,flow_list[idx].src_port, flow_list[idx].dst_port, flow_list[idx].trans_layer_proto)session_list.append([five_list, flow_list[idx].pktsizeseq])return session_list

此处只需按照要求，将五元组提取与报文长度序列按要求存储即可。

def cal_best_theta_value(self, ke, attri_list):data = []class_values = []# 分离特征值和类别值for i in attri_list:data.append(i[0])class_values.append(i[1])entropy_of_par_attr = entropy(class_values)max_info_gain = 0theta = 0best_index_left_list = []best_index_right_list = []class_labels_list_after_split = []# 对数据进行排序以找到最佳阈值data.sort()# 遍历特征值for i in range(len(data) - 1):cur_theta = float(data[i] + data[i + 1]) / 2index_less_than_theta_list = []values_less_than_theta_list = []index_greater_than_theta_list = []values_greater_than_theta_list = []for zcy, zx in enumerate(attri_list):if zx[0] <= cur_theta:values_less_than_theta_list.append(zx[1])index_less_than_theta_list.append(zcy)else:values_greater_than_theta_list.append(zx[1])index_greater_than_theta_list.append(zcy)entropy_of_less_attribute = entropy(values_less_than_theta_list)entropy_of_greater_attribute = entropy(values_greater_than_theta_list)wlcl = entropy_of_par_attr - (entropy_of_less_attribute * (len(index_less_than_theta_list) / float(len(attri_list)))) \- (entropy_of_greater_attribute * (len(index_greater_than_theta_list) / float(len(attri_list))))if wlcl > max_info_gain:max_info_gain = wlcltheta = cur_thetabest_index_left_list = index_less_than_theta_listbest_index_right_list = index_greater_than_theta_listclass_labels_list_after_split = values_less_than_theta_list + values_greater_than_theta_list'''补全此部分代码，实现以下逻辑功能：1. 根据当前阈值划分数据2. 计算每个划分的熵3. 计算当前阈值的信息增益, 如果需要，更新最佳阈值'''return max_info_gain, theta, best_index_left_list, best_index_right_list, class_labels_list_after_split

cal_best_theta_value 方法通过最大化信息增益来计算给定特征的最佳阈值。它需要两个参数：ke，表示特征索引；attri_list，是包含特征值及其对应类别值的元组列表。该方法返回最大信息增益、最佳阈值以及阈值两侧数据点的索引。

需要补全的代码部分主要思路如下：

先遍历 data 中相邻两个元素的平均值作为分裂点阈值 cur_theta，对于每个样本，根据其属性值与 cur_theta 的大小关系，将其分类到左右子树中，得到左右子树的类别标签列表 values_less_than_theta_list 和 values_greater_than_theta_list，以及对应的样本下标列表 index_less_than_theta_list 和 index_greater_than_theta_list；然后分别计算左右子树的熵entropy_of_less_attribute 和 entropy_of_greater_attribute，以及分裂后的信息增益 wlcl；如果 wlcl 大于当前的最大信息增益 max_info_gain，则更新 max_info_gain、theta、best_index_left_list 和 best_index_right_list；最后将左右子树的类别标签列表拼接起来得到 class_labels_list_after_split。

2.4 结果展示

2.4.1 任务一：处理提供的流量数据集

Chat.csv:

Chat-label.csv:

Video.csv:

Video.label.csv:

Web.csv:

Web-label.csv:

上述均只截取了部分，详细信息请见对应文件。

2.4.2 任务二：计算信息增益

算法部分：（详细解释请看算法说明部分）

def cal_best_theta_value(self, ke, attri_list):data = []class_values = []# 分离特征值和类别值for i in attri_list:data.append(i[0])class_values.append(i[1])entropy_of_par_attr = entropy(class_values)max_info_gain = 0theta = 0best_index_left_list = []best_index_right_list = []class_labels_list_after_split = []# 对数据进行排序以找到最佳阈值data.sort()# 遍历特征值for i in range(len(data) - 1):cur_theta = float(data[i] + data[i + 1]) / 2index_less_than_theta_list = []values_less_than_theta_list = []index_greater_than_theta_list = []values_greater_than_theta_list = []for zcy, zx in enumerate(attri_list):if zx[0] <= cur_theta:values_less_than_theta_list.append(zx[1])index_less_than_theta_list.append(zcy)else:values_greater_than_theta_list.append(zx[1])index_greater_than_theta_list.append(zcy)entropy_of_less_attribute = entropy(values_less_than_theta_list)entropy_of_greater_attribute = entropy(values_greater_than_theta_list)wlcl = entropy_of_par_attr - (entropy_of_less_attribute * (len(index_less_than_theta_list) / float(len(attri_list)))) \- (entropy_of_greater_attribute * (len(index_greater_than_theta_list) / float(len(attri_list))))if wlcl > max_info_gain:max_info_gain = wlcltheta = cur_thetabest_index_left_list = index_less_than_theta_listbest_index_right_list = index_greater_than_theta_listclass_labels_list_after_split = values_less_than_theta_list + values_greater_than_theta_list'''补全此部分代码，实现以下逻辑功能：1. 根据当前阈值划分数据2. 计算每个划分的熵3. 计算当前阈值的信息增益, 如果需要，更新最佳阈值'''return max_info_gain, theta, best_index_left_list, best_index_right_list, class_labels_list_after_split

2.4.3 任务三：流量分类

平均准确率为0.93333（3循环）

2.4.4 任务四(可选)：实验优化

在实验中发现，每次运行程序平均准确率具有随机性，多次运行时，所获得最大平均准备率为0.93333（3循环），最小的为0.875：

感觉也是一种优化。

实验总结

通过本次实验对ID3决策树算法有了初步的认识和更深的理解，对实验2中五元组的提取等内容加深了印象。在实验开始时，因为下载库dpkt_fix==1.7一直失败，最终改为下载dpkt，并未对实验过程和结果造成影响。

查看全文

http://www.mrgr.cn/news/55091.html

JAVA基础【第三篇】

JavaScript报错：Uncaught SyntaxError: Unexpected end of input(at test.html:1:16)

上市遭冷遇，AIGC难救七牛云

【Linux 从基础到进阶】应用程序性能调优（Java、Python等）

使用ROS一键部署LNMP环境

测试测试测试07

2024年10月20日

给定一个正整数n随机生成n个字节即生成2n个十六进制数将其组成字符串返回secrets.token_hex(n)

近似推断 - 引言篇

CollageController

光致发光（Photoluminescence, PL）入门版

HTML DOM 简介

Manim 结构

Marin说PCB之GMSL2 的Layout走线的注意事项

HTML 区块

C++编程规范

408数据结构-折半查找，分块查找自学知识点整理

js模板式生成大疆上云kml文件（含详细注释，已封装成函数）

Vue框架

探索音频在线剪辑工具的奇妙世界