非结构化文件加载器
本教程介绍了如何使用Unstructured来加载多种类型的文件。目前,Unstructured支持加载文本文件、幻灯片、html、pdf、图像等。
# # Install package
!pip install "unstructured[local-inference]"
!pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
!pip install layoutparser[layoutmodels,tesseract]
# # Install other dependencies
# # https://github.com/Unstructured-IO/unstructured/blob/main/docs/source/installing.rst
# !brew install libmagic
# !brew install poppler
# !brew install tesseract
# # If parsing xml / html documents:
# !brew install libxml2
# !brew install libxslt
# import nltk
# nltk.download('punkt')
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("./example_data/state_of_the_union.txt")
docs = loader.load()
docs[0].page_content[:400]
'Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.\n\nLast year COVID-19 kept us apart. This year we are finally together again.\n\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.\n\nWith a duty to one another to the American people to the Constit'
保留元素 Retain Elements#
在内部,Unstructured创建不同的“元素”以适配不同的文本块。默认情况下,我们将它们组合在一起,但您可以通过指定 mode =“elements”
轻松保持这种分离。
loader = UnstructuredFileLoader("./example_data/state_of_the_union.txt", mode="elements")
docs = loader.load()
docs[:5]
[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
Document(page_content='Last year COVID-19 kept us apart. This year we are finally together again.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
Document(page_content='Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
Document(page_content='With a duty to one another to the American people to the Constitution.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0),
Document(page_content='And with an unwavering resolve that freedom will always triumph over tyranny.', lookup_str='', metadata={'source': '../../state_of_the_union.txt'}, lookup_index=0)]
定义分区策略 #
非结构化文档加载器允许用户传入一个 strategy
参数,让 unstructured
知道如何对文档进行分区。当前支持的策略是 "hi_res" (默认)和 "fast" 。高分辨率分区策略更准确,但处理时间更长。
快速策略可以更快地对文档进行分区,但会牺牲准确性。并非所有文档类型都有单独的高分辨率和快速分区策略。对于那些文档类型, strategy kwarg 被忽略。
在某些情况下,如果缺少依赖项(即文档分区模型),高分辨率策略将回退到快速。
您可以在下面看到如何将策略应用于 UnstructuredFileLoader
。
from langchain.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("layout-parser-paper-fast.pdf", strategy="fast", mode="elements")
docs = loader.load()
docs[:5]
[Document(page_content='1', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='0', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='2', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'UncategorizedText'}, lookup_index=0),
Document(page_content='n', lookup_str='', metadata={'source': 'layout-parser-paper-fast.pdf', 'filename': 'layout-parser-paper-fast.pdf', 'page_number': 1, 'category': 'Title'}, lookup_index=0)]
PDF Example#
处理 PDF 文档的方式完全相同。
Unstructured 检测文件类型并提取相同类型的 elements
。
!wget https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/example-docs/layout-parser-paper.pdf -P "../../"
loader = UnstructuredFileLoader("./example_data/layout-parser-paper.pdf", mode="elements")
docs = loader.load()
docs[:5]
[Document(page_content='LayoutParser : A Unified Toolkit for Deep Learning Based Document Image Analysis', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
Document(page_content='Zejiang Shen 1 ( (ea)\n ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and Weining Li 5', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
Document(page_content='Allen Institute for AI shannons@allenai.org', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
Document(page_content='Brown University ruochen zhang@brown.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0),
Document(page_content='Harvard University { melissadell,jacob carlson } @fas.harvard.edu', lookup_str='', metadata={'source': '../../layout-parser-paper.pdf'}, lookup_index=0)]
Unstructured API
如果您想以较少的设置启动并运行,只需运行 pip install unstructured
并使用 UnstructuredAPIFileLoader
或 UnstructuredAPIFileIOLoader
。
这将使用托管的非结构化 API 处理您的文档。请注意,目前(截至 2023 年 5 月 11 日)非结构化 API 是开放的,但很快将需要一个 API。
非结构化文档页面将提供有关如何生成 API 密钥的说明。
如果您想自行托管非结构化 API 或在本地运行,请查看此处 (opens in a new tab)的说明。
from langchain.document_loaders import UnstructuredAPIFileLoader
filenames = ["example_data/fake.docx", "example_data/fake-email.eml"]
loader = UnstructuredAPIFileLoader(
file_path=filenames[0],
api_key="FAKE_API_KEY",
)
docs = loader.load()
docs[0]
Document(page_content='Lorem ipsum dolor sit amet.', metadata={'source': 'example_data/fake.docx'})
您还可以使用 UnstructuredAPIFileLoader
在单个 API 中通过非结构化 API 批处理多个文件。
loader = UnstructuredAPIFileLoader(
file_path=filenames,
api_key="FAKE_API_KEY",
)
docs = loader.load()
docs[0]
Document(page_content='Lorem ipsum dolor sit amet.\n\nThis is a test email to use for unit tests.\n\nImportant points:\n\nRoses are red\n\nViolets are blue', metadata={'source': ['example_data/fake.docx', 'example_data/fake-email.eml']})