如何在Python中从PDF中提取图像（代码实现示例） Python如何从PDF中提取图像|Pyt

本文带你了解如何使用 PyMuPDF 和 Pillow 库在 Python 中从 PDF 文件中提取和保存图像。
【如何在Python中从PDF中提取图像（代码实现示例）】Python如何从PDF中提取图像？在本教程中，我们将编写一个 Python 代码，使用PyMuPDF和Pillow库从 PDF 文件中提取图像并将它们保存在本地磁盘上。
Python从PDF中提取图像代码示例：使用 PyMuPDF，你可以访问 PDF、XPS、OpenXPS、epub 和许多其他扩展。它应该可以在所有平台上运行，包括 Windows、Mac OSX 和 Linux。
让我们和 Pillow 一起安装它：

pip3 install PyMuPDF Pillow

打开一个新的 Python 文件，让我们开始吧。首先，让我们导入库：

import fitz # PyMuPDF import io from PIL import Image

我将用这个 PDF 文件来测试这个，但是你可以随意携带 PDF 文件并将它放在你当前的工作目录中，让我们将它加载到库中：

# file path you want to extract images from file = "1710.05006.pdf" # open the file pdf_file = fitz.open(file)

由于我们想从所有页面中提取图像，我们需要遍历所有可用页面并获取每个页面上的所有图像对象，以下Python从PDF中提取图像代码示例执行此操作：

# iterate over PDF pages for page_index in range(len(pdf_file)): # get the page itself page = pdf_file[ page_index] image_list = page.getImageList() # printing number of images found in this page if image_list: print(f"[ +] Found a total of {len(image_list)} images in page {page_index}") else: print("[ !] No images found on page", page_index) for image_index, img in enumerate(page.getImageList(), start=1): # get the XREF of the image xref = img[ 0] # extract the image bytes base_image = pdf_file.extractImage(xref) image_bytes = base_image[ "image"] # get the image extension image_ext = base_image[ "ext"] # load it to PIL image = Image.open(io.BytesIO(image_bytes)) # save it to local disk image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))

相关：如何在 Python 中将 PDF 转换为图像。
Python如何从PDF中提取图像？我们使用getImageList()方法将所有可用的图像对象作为该特定页面上的元组列表列出。要获取图像对象索引，我们只需获取返回的元组的第一个元素。
之后，我们使用extractImage()以字节为单位返回图像以及图像扩展名等附加信息的方法。
最后，我们将图像字节转换为 PIL 图像实例，并使用save()接受文件指针作为参数的方法将其保存到本地磁盘，我们只是使用相应的页面和图像索引命名图像。
运行脚本后，我得到以下输出：

[ !] No images found on page 0 [ +] Found a total of 3 images in page 1 [ +] Found a total of 3 images in page 2 [ !] No images found on page 3 [ !] No images found on page 4

图像也保存在当前目录中：