Tika MS Office文件提取示例

本文概述

  • Tika OOXMLParser构造函数
  • OOXMLParser示例
为了提取诸如xls文件之类的Microsoft Office文件, Tika提供了OOXMLParser类。此类用于从Microsoft文件提取内容和元数据。它位于org.apache.tika.parser.microsoft.ooxml包中, 并包含下表中列出的各种构造函数和方法。
Tika OOXMLParser构造函数
Constructor Description
public OOXMLParser() 它用于实例化类。
Method Description
公共Set < MediaType> getSupportedTypes(ParseContext上下文) 它返回此解析器支持的媒体类型集。
公共无效解析(InputStream流, ContentHandler处理程序, 元数据元数据, ParseContext上下文)引发IOException, SAXException, TikaException 它将文档流解析为一系列XHTML SAX事件。
OOXMLParser示例
package tikaexample; import java.io.IOException; import java.io.InputStream; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.microsoft.ooxml.OOXMLParser; import org.apache.tika.sax.BodyContentHandler; import org.xml.sax.SAXException; public class MSOfficeExample { public static void main(String[] args) throws IOException, SAXException, TikaException {BodyContentHandler handler= new BodyContentHandler(); OOXMLParser parser= new OOXMLParser(); Metadata metadata= http://www.srcmini.com/new Metadata(); ParseContext pcontext= new ParseContext(); try (InputStream stream = AutoDetectParseExample.class.getResourceAsStream("srcmini.xls")) {parser.parse(stream, handler, metadata, pcontext); System.out.println("Document Content:" + handler.toString()); System.out.println("Document Metadata:"); String[] metadatas = metadata.names(); for(String data : metadatas) {System.out.println(data + ":" + metadata.get(data)); }}catch(Exception e) {System.out.println("Exception message: "+ e.getMessage()); } }}

我们的文件包含以下内容。
Tika MS Office文件提取示例

文章图片
【Tika MS Office文件提取示例】输出
Document Content:Sheet1 Employee Manual Punch In Time Out Time Device Total Minute Total Time Working Minutes 01-Nov-17 8:27:00 AM 01-Nov-17 6:30:00 PM 1 603 540 -63 02-Nov-17 8:09:00 AM 02-Nov-17 6:30:00 PM 1 621 540 -81 03-Nov-17 8:25:00 AM 03-Nov-17 6:30:00 PM 1 605 540 -65Document Metadata:date:2018-05-06T11:20:06Zcp:revision:1custom:DocSecurity:0dc:creator:Receptiondcterms:created:2017-12-03T08:38:57Zlanguage:en-INLast-Modified:2018-05-06T11:20:06Zdcterms:modified:2018-05-06T11:20:06ZLast-Save-Date:2018-05-06T11:20:06ZTemplate:protected:falsemeta:save-date:2018-05-06T11:20:06ZApplication-Name:LibreOffice/5.1.6.2$Linux_X86_64 LibreOffice_project/10m0$Build-2modified:2018-05-06T11:20:06Zcustom:LinksUpToDate:falseContent-Type:application/vnd.openxmlformats-officedocument.spreadsheetml.sheetcreator:Receptiondc:language:en-INmeta:author:Receptionmeta:creation-date:2017-12-03T08:38:57Zextended-properties:Application:LibreOffice/5.1.6.2$Linux_X86_64 LibreOffice_project/10m0$Build-2custom:ShareDoc:falsecustom:ScaleCrop:falseCreation-Date:2017-12-03T08:38:57Zcustom:HyperlinksChanged:falseRevision-Number:1extended-properties:Template:custom:AppVersion:12.0000

    推荐阅读