cpu下安装MinerU进行数据清洗
一.安装 Conda
本次安装是在centos7下。使用cpu进行数据清洗很慢,实际情况建议用GPU。
# 下载 Miniconda(Linux 64位,Python 3.10 版本)
wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x86_64.sh# 赋予执行权限
chmod +x Miniconda3-py310_23.11.0-2-Linux-x86_64.sh# 以 root 身份安装(注意安装路径选择)
./Miniconda3-py310_23.11.0-2-Linux-x86_64.sh#配置环境变量
vim ~/.bashrc
export PATH="/root/miniconda3/bin:$PATH" # 替换为你的实际安装路径,默认会安装在此路径下
source ~/.bashrc
#查看版本
conda --version
查看版本:
二.安装magic-pdf
conda create -n MinerU python=3.10
conda activate MinerUpip install --upgrade pip setuptools wheelpip install --prefer-binary simsimd#centos7安装
pip install -U magic-pdf[full,old_linux] --extra-index-url https://wheels.myhloli.com#查看magic-pdf版本
magic-pdf -v#其它操作系统(可选)
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
查看magic-pdf版本:
三.下载模型权重文件
pip install modelscope
wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py
python download_models.py#处理doc, docx, ppt, pptx类型(可选)
yum install libreoffice
四.修改配置文件以进行额外配置
#修改文件vim /root/magic-pdf.json
#如json内没有如下项目,请手动添加需要的项目,并删除注释内容(标准json不支持注释),有就不需要修改了
{// other config"layout-config": {"model": "doclayout_yolo" // 使用layoutlmv3请修改为“layoutlmv3"},"formula-config": {"mfd_model": "yolo_v8_mfd","mfr_model": "unimernet_small","enable": true // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false"},"table-config": {"model": "rapid_table", // 默认使用"rapid_table",可以切换为"tablemaster"和"struct_eqtable""sub_model": "slanet_plus", // 当model为"rapid_table"时,可以自选sub_model,可选项为"slanet_plus"和"unitable""enable": true, // 表格识别功能默认是开启的,如果需要关闭请修改此处的值为"false""max_time": 400}
}
五.使用
5.1 使用命令行
#查看版本
magic-pdf -v#命令转化
magic-pdf -p 非线性成长.pdf -o /home/big/MinerU -m auto
数据清洗:magic-pdf -p 非线性成长.pdf -o /home/big/MinerU -m auto
清洗成功:
5.2 通过python代码调用MinerU
转换pdf:
import osfrom magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
from magic_pdf.data.dataset import PymuDocDataset
from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
from magic_pdf.config.enums import SupportedPdfParseMethod# args
pdf_file_name = "abc.pdf" # replace with the real pdf path
name_without_suff = pdf_file_name.split(".")[0]# prepare env
local_image_dir, local_md_dir = "output/images", "output"
image_dir = str(os.path.basename(local_image_dir))os.makedirs(local_image_dir, exist_ok=True)image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(local_md_dir
)# read bytes
reader1 = FileBasedDataReader("")
pdf_bytes = reader1.read(pdf_file_name) # read the pdf content# proc
## Create Dataset Instance
ds = PymuDocDataset(pdf_bytes)## inference
if ds.classify() == SupportedPdfParseMethod.OCR:infer_result = ds.apply(doc_analyze, ocr=True)## pipelinepipe_result = infer_result.pipe_ocr_mode(image_writer)else:infer_result = ds.apply(doc_analyze, ocr=False)## pipelinepipe_result = infer_result.pipe_txt_mode(image_writer)### draw model result on each page
infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))### get model inference result
model_inference_result = infer_result.get_infer_res()### draw layout result on each page
pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))### draw spans result on each page
pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))### get markdown content
md_content = pipe_result.get_markdown(image_dir)### dump markdown
pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)### get content list content
content_list_content = pipe_result.get_content_list(image_dir)### dump content list
pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)### get middle json
middle_json_content = pipe_result.get_middle_json()### dump middle json
pipe_result.dump_middle_json(md_writer, f'{name_without_suff}_middle.json')
六.遇到问题
centos7下编译simsimd会出错,直接按以下命令安装:
pip install --upgrade pip setuptools wheelpip install --prefer-binary simsimd
希望这篇文章对你有所帮助!如果觉得不错,别忘了点赞收藏哦!