如何在Python中提取所有PDF链接（实现代码示例） Python如何提取所有PDF链接|Pyt

本文带你了解如何使用 pikepdf 和 PyMuPDF 库使用 Python 从 PDF 文件中提取链接和 URL。
是否要提取特定 PDF 文件中的 URL？如果是这样，那么你来对地方了。在本教程中，我们将使用Python 中的pikepdf和PyMuPDF库从 PDF 文件中提取所有链接。
【如何在Python中提取所有PDF链接（实现代码示例）】Python如何提取所有PDF链接？我们将使用两种方法从特定的 PDF 文件中获取链接，第一种是提取注释，即标记、注释和注释，你实际上可以单击常规 PDF 阅读器并重定向到你的浏览器，而第二种是提取所有原始文本并使用正则表达式来解析 URL。
首先，让我们安装这些库：

pip3 install pikepdf PyMuPDF

方法一：使用注解提取网址Python提取所有PDF链接的方法解析：在这种技术中，我们将使用 pikepdf 库打开一个 PDF 文件，遍历每个页面的所有注释并查看那里是否有 URL：

import pikepdf # pip3 install pikepdffile = "1810.04805.pdf" # file = "1710.05006.pdf" pdf_file = pikepdf.Pdf.open(file) urls = [ ] # iterate over PDF pages for page in pdf_file.pages: for annots in page.get("/Annots"): uri = annots.get("/A").get("/URI") if uri is not None: print("[ +] URL Found:", uri) urls.append(uri)print("[ *] Total URLs extracted:", len(urls))

我正在测试这个 PDF 文件，但可以随意使用你选择的任何 PDF 文件，只要确保它有一些可点击的链接。
运行该代码后，我得到以下输出：

[ +] URL Found: https://github.com/google-research/bert [ +] URL Found: https://github.com/google-research/bert [ +] URL Found: https://gluebenchmark.com/faq [ +] URL Found: https://gluebenchmark.com/leaderboard ...< SNIPPED>... [ +] URL Found: https://gluebenchmark.com/faq [ *] Total URLs extracted: 30

太棒了，我们已经成功地从那篇 PDF 论文中提取了 30 个 URL。
相关：如何在 Python 中提取所有网站链接。
方法二：使用正则表达式提取网址Python提取所有PDF链接的方法解析：在本节中，我们将从 PDF 文件中提取所有原始文本，然后使用正则表达式来解析 URL。首先，让我们获取 PDF 的文本版本：

import fitz # pip install PyMuPDF import re# a regular expression of URLs url_regex = r"https?:\/\/(www\.)?[ -a-zA-Z0-9@:%._\+~#=\n]{1,256}\.[ a-zA-Z0-9()]{1,6}\b([ -a-zA-Z0-9()@:%_\+.~#?& //=]*)" # extract raw text from pdf file = "1710.05006.pdf" # file = "1810.04805.pdf" # open the PDF file with fitz.open(file) as pdf: text = "" for page in pdf: # extract text of each PDF page text += page.getText()

现在text是我们要解析 URL 的目标字符串，让我们使用re 模块来解析它们：

urls = [ ] # extract all urls using the regular expression for match in re.finditer(url_regex, text): url = match.group() print("[ +] URL Found:", url) urls.append(url) print("[ *] Total URLs extracted:", len(urls))

输出：

[ +] URL Found: https://github.com/ [ +] URL Found: https://github.com/tensor [ +] URL Found: http://nlp.seas.harvard.edu/2018/04/03/attention.html [ +] URL Found: https://gluebenchmark.com/faq. [ +] URL Found: https://gluebenchmark.com/leaderboard). [ +] URL Found: https://gluebenchmark.com/leaderboard [ +] URL Found: https://cloudplatform.googleblog.com/2018/06/Cloud- [ +] URL Found: https://gluebenchmark.com/ [ +] URL Found: https://gluebenchmark.com/faq [ *] Total URLs extracted: 9

结论Python如何提取所有PDF链接？这次我们只从同一个 PDF 文件中提取了 9 个 URL，现在这并不意味着第二种方法不准确。此方法仅解析文本形式（不可点击）的 URL。
但是，此方法存在问题，因为 URL 可能包含新行 ( \n)，因此你可能希望在url_regex表达式中允许它。
总而言之，在以上两种Python提取所有PDF链接的方法中，如果你想获得可点击的 URL，你可能需要使用第一种方法，这是更可取的。但是如果你想获得文本形式的 URL，第二个可能会帮助你做到这一点！
如果你想从 PDF 中提取表格或图像，有相关教程：