Android中解析读取复杂word,excel,ppt等的方法

春衣少年当酒歌,起舞四顾以笑和。这篇文章主要讲述Android中解析读取复杂word,excel,ppt等的方法相关的知识,希望能为你提供帮助。
前段时间在尝试做一个android里的万能播放器, 能播放各种格式的软件, 其中就涉及到了最常用的office软件。查阅了下资料, 发现Android中最传统的直接解析读取word, excel的方法主要用了java里第三方包, 比如利用tm-extractors-0.4.jar和jxl.jar等, 下面附上代码和效果图。
读取word用了tm-extractors-0.4.jar包, 代码如下:

package com.example.readword; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import org.textmining.text.extraction.WordExtractor; import android.app.Activity; import android.os.Bundle; import android.os.Environment; import android.widget.TextView; public class MainActivity extends Activity { /** Called when the activity is first created. */ private TextView text; @ Override public void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); text = (TextView) findViewById(R.id.text); String str = readWord(" /storage/emulated/0/ArcGIS/localtilelayer/11.doc" ); text.setText(str.trim().replace(" " , " " )); }public String readWord(String file){ //创建输入流用来读取doc文件 FileInputStream in; String text = null; try { in = new FileInputStream(new File(file)); WordExtractor extractor = null; //创建WordExtractor extractor = new WordExtractor(); //进行提取对doc文件 text = extractor.extractText(in); } catch (FileNotFoundException e) { e.printStackTrace(); } catch (Exception e) { e.printStackTrace(); } return text; } }


效果图如下:
Android中解析读取复杂word,excel,ppt等的方法

文章图片




只是从网上随便下载的一个文档, 我们可以看出, 虽然能读取, 但是格式的效果并不佳, 而且只能读取doc, 不能读取docx格式, 也不能读取doc里的图片。另外就是加入使用WPF打开过这个doc的话, 将无法再次读取( 对于只安装WPF的我简直是个灾难)
然后是用jxl读取excel的代码, 这个代码不是很齐, 就写了个解析的, 将excel里每行每列都解析了出来, 然后自己可以重新再编辑, 代码如下:

package com.readexl; import java.io.FileInputStream; import java.io.InputStream; import android.os.Bundle; import android.os.Environment; import android.app.Activity; import android.text.method.ScrollingMovementMethod; import android.view.Menu; import android.widget.TextView; import jxl.*; public class MainActivity extends Activity { TextView txt = null; public String filePath_xls = Environment.getExternalStorageDirectory() + " /case.xls" ; @ Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); txt = (TextView)findViewById(R.id.txt_show); txt.setMovementMethod(ScrollingMovementMethod.getInstance()); readExcel(); }@ Override public boolean onCreateOptionsMenu(Menu menu) { // Inflate the menu; this adds items to the action bar if it is present. getMenuInflater().inflate(R.menu.main, menu); return true; }public void readExcel() { try { /** * 后续考虑问题,比如Excel里面的图片以及其他数据类型的读取 **/ InputStream is = new FileInputStream(filePath_xls); //Workbook book = Workbook.getWorkbook(new File(" mnt/sdcard/test.xls" )); Workbook book = Workbook.getWorkbook(is); int num = book.getNumberOfSheets(); txt.setText(" the num of sheets is " + num+ " \\n" ); // 获得第一个工作表对象 Sheet sheet = book.getSheet(0); int Rows = sheet.getRows(); int Cols = sheet.getColumns(); txt.append(" the name of sheet is " + sheet.getName() + " \\n" ); txt.append(" total rows is " + Rows + " \\n" ); txt.append(" total cols is " + Cols + " \\n" ); for (int i = 0; i < Cols; + + i) { for (int j = 0; j < Rows; + + j) { // getCell(Col,Row)获得单元格的值 txt.append(" contents:" + sheet.getCell(i,j).getContents() + " \\n" ); } } book.close(); } catch (Exception e) { System.out.println(e); } }}


效果图如下:
Android中解析读取复杂word,excel,ppt等的方法

文章图片

好吧, 这只是个半成品, 不过, 这个方法肯定是行得通的。
之前说了这么多, 很明白的意思就是我对于这两种方法都不是很满意。在这里, 我先说下doc和docx的区别( xls和xlsx, ppt和pptx等区别都和此类似)
众所周知的是doc是03及之前版本word所保存的格式, docx是07版本之后保存的格式, 简单的说, 在doc中, 微软还是用二进制存储方式; 在docx中微软开始用xml方式, docx实际上成了一个打包的ZIP压缩文件。doc解压得到的是没有扩展名的文件碎片, 而docx解压可以得到一个XML和几个包含信息的文件夹。两者比较的结论就是docx更小, 而且要读取图片更容易。( 参考http://www.zhihu.com/question/21547795)
好吧, 回到正题。如何才能解析各种word, excel等能保留原来格式并且解析里面的图片, 表格或附件等内容呢。那当然就是html了! 不得不承认html对于页面, 表格等展示的效果确是是很强大的, 原生很难写出这样的效果。在网上找了诸多的资料, 以及各个大神的代码, 自己又再此基础上修改了下, 实现的效果还不错吧。
利用的包是POI( 一堆很强大的包, 可以解析几乎所有的office软件, 这里以doc, docx, xls, xlsx为例)


读取文件后根据不同文件类型分别进行操作。

public void read() { if(!myFile.exists()){ if (this.nameStr.endsWith(" .doc" )) { this.getRange(); this.makeFile(); this.readDOC(); } if (this.nameStr.endsWith(" .docx" )) { this.makeFile(); this.readDOCX(); } if (this.nameStr.endsWith(" .xls" )) { try { this.makeFile(); this.readXLS(); } catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } } if (this.nameStr.endsWith(" .xlsx" )) { try{ this.makeFile(); this.readXLSX(); }catch (Exception e) { // TODO Auto-generated catch block e.printStackTrace(); } } } returnPath = " file:///" + myFile; // this.view.loadUrl(" file:///" + this.htmlPath); System.out.println(" htmlPath" + this.htmlPath); }

先贴上公用的方法, 主要是设置生成的html文件保存地址:

public void makeFile() { String sdStateString = android.os.Environment.getExternalStorageState(); // 获取外部存储状态 if (sdStateString.equals(android.os.Environment.MEDIA_MOUNTED)) {// 确认sd卡存在,原理不知,媒体安装?? try { File sdFile = android.os.Environment .getExternalStorageDirectory(); // 获取扩展设备的文件目录 String path = sdFile.getAbsolutePath() + File.separator + " library" ; // 得到sd卡(扩展设备)的绝对路径+ " /" + xiao File dirFile = new File(path); // 获取xiao文件夹地址 if (!dirFile.exists()) {// 如果不存在 dirFile.mkdir(); // 创建目录 } File myFile = new File(path + File.separator + filename+ " .html" ); // 获取my.html的地址 if (!myFile.exists()) {// 如果不存在 myFile.createNewFile(); // 创建文件 } this.htmlPath = myFile.getAbsolutePath(); // 返回路径 } catch (Exception e) { } } }

然后是读取doc:

private void getRange() { FileInputStream in = null; POIFSFileSystem pfs = null; try { in = new FileInputStream(nameStr); pfs = new POIFSFileSystem(in); hwpf = new HWPFDocument(pfs); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); }range = hwpf.getRange(); pictures = hwpf.getPicturesTable().getAllPictures(); tableIterator = new TableIterator(range); }

public void readDOC() {try { myFile = new File(htmlPath); output = new FileOutputStream(myFile); presentPicture= 0; String head = " < html> < meta charset= \\" utf-8\\" > < body> " ; String tagBegin = " < p> " ; String tagEnd = " < /p> " ; output.write(head.getBytes()); int numParagraphs = range.numParagraphs(); // 得到页面所有的段落数 for (int i = 0; i < numParagraphs; i+ + ) { // 遍历段落数 Paragraph p = range.getParagraph(i); // 得到文档中的每一个段落 if (p.isInTable()) { int temp = i; if (tableIterator.hasNext()) { String tableBegin = " < table style= \\" border-collapse:collapse\\" border= 1 bordercolor= \\" black\\" > " ; String tableEnd = " < /table> " ; String rowBegin = " < tr> " ; String rowEnd = " < /tr> " ; String colBegin = " < td> " ; String colEnd = " < /td> " ; Table table = tableIterator.next(); output.write(tableBegin.getBytes()); int rows = table.numRows(); for (int r = 0; r < rows; r+ + ) { output.write(rowBegin.getBytes()); TableRow row = table.getRow(r); int cols = row.numCells(); int rowNumParagraphs = row.numParagraphs(); int colsNumParagraphs = 0; for (int c = 0; c < cols; c+ + ) { output.write(colBegin.getBytes()); TableCell cell = row.getCell(c); int max = temp + cell.numParagraphs(); colsNumParagraphs = colsNumParagraphs + cell.numParagraphs(); for (int cp = temp; cp < max; cp+ + ) { Paragraph p1 = range.getParagraph(cp); output.write(tagBegin.getBytes()); writeParagraphContent(p1); output.write(tagEnd.getBytes()); temp+ + ; } output.write(colEnd.getBytes()); } int max1 = temp + rowNumParagraphs; for (int m = temp + colsNumParagraphs; m < max1; m+ + ) { temp+ + ; } output.write(rowEnd.getBytes()); } output.write(tableEnd.getBytes()); } i = temp; } else { output.write(tagBegin.getBytes()); writeParagraphContent(p); output.write(tagEnd.getBytes()); } } String end = " < /body> < /html> " ; output.write(end.getBytes()); output.close(); } catch (Exception e) {System.out.println(" readAndWrite Exception:" + e.getMessage()); e.printStackTrace(); } }

读取docx
【Android中解析读取复杂word,excel,ppt等的方法】
public void readDOCX() { String river = " " ; try { this.myFile = new File(this.htmlPath); // new一个File,路径为html文件 this.output = new FileOutputStream(this.myFile); // new一个流,目标为html文件 presentPicture= 0; String head = " < !DOCTYPE> < html> < meta charset= \\" utf-8\\" > < body> " ; // 定义头文件,我在这里加了utf-8,不然会出现乱码 String end = " < /body> < /html> " ; String tagBegin = " < p> " ; // 段落开始,标记开始? String tagEnd = " < /p> " ; // 段落结束 String tableBegin = " < table style= \\" border-collapse:collapse\\" border= 1 bordercolor= \\" black\\" > " ; String tableEnd = " < /table> " ; String rowBegin = " < tr> " ; String rowEnd = " < /tr> " ; String colBegin = " < td> " ; String colEnd = " < /td> " ; String style = " style= \\" " ; this.output.write(head.getBytes()); // 写如头部 ZipFile xlsxFile = new ZipFile(new File(this.nameStr)); ZipEntry sharedStringXML = xlsxFile.getEntry(" word/document.xml" ); InputStream inputStream = xlsxFile.getInputStream(sharedStringXML); XmlPullParser xmlParser = Xml.newPullParser(); xmlParser.setInput(inputStream, " utf-8" ); int evtType = xmlParser.getEventType(); boolean isTable = false; // 是表格 用来统计 列 行 数 boolean isSize = false; // 大小状态 boolean isColor = false; // 颜色状态 boolean isCenter = false; // 居中状态 boolean isRight = false; // 居右状态 boolean isItalic = false; // 是斜体 boolean isUnderline = false; // 是下划线 boolean isBold = false; // 加粗 boolean isR = false; // 在那个r中 boolean isStyle = false; int pictureIndex = 1; // docx 压缩包中的图片名 iamge1 开始 所以索引从1开始 while (evtType != XmlPullParser.END_DOCUMENT) { switch (evtType) {// 开始标签 case XmlPullParser.START_TAG: String tag = xmlParser.getName(); if (tag.equalsIgnoreCase(" r" )) { isR = true; } if (tag.equalsIgnoreCase(" u" )) { // 判断下划线 isUnderline = true; } if (tag.equalsIgnoreCase(" jc" )) { // 判断对齐方式 String align = xmlParser.getAttributeValue(0); if (align.equals(" center" )) { this.output.write(" < center> " .getBytes()); isCenter = true; } if (align.equals(" right" )) { this.output.write(" < div align= \\" right\\" > " .getBytes()); isRight = true; } }if (tag.equalsIgnoreCase(" color" )) { // 判断颜色String color = xmlParser.getAttributeValue(0); this.output .write((" < span style= \\" color:" + color + " ; \\" > " ) .getBytes()); isColor = true; } if (tag.equalsIgnoreCase(" sz" )) { // 判断大小 if (isR = = true) { int size = decideSize(Integer.valueOf(xmlParser .getAttributeValue(0))); this.output.write((" < font size= " + size + " > " ) .getBytes()); isSize = true; } } // 下面是表格处理 if (tag.equalsIgnoreCase(" tbl" )) { // 检测到tbl 表格开始 this.output.write(tableBegin.getBytes()); isTable = true; } if (tag.equalsIgnoreCase(" tr" )) { // 行 this.output.write(rowBegin.getBytes()); } if (tag.equalsIgnoreCase(" tc" )) { // 列 this.output.write(colBegin.getBytes()); }if (tag.equalsIgnoreCase(" pic" )) { // 检测到标签 pic 图片 String entryName_jpeg = " word/media/image" + pictureIndex + " .jpeg" ; String entryName_png = " word/media/image" + pictureIndex + " .png" ; String entryName_gif = " word/media/image" + pictureIndex + " .gif" ; String entryName_wmf = " word/media/image" + pictureIndex + " .wmf" ; ZipEntry sharePicture = null; InputStream pictIS = null; sharePicture = xlsxFile.getEntry(entryName_jpeg); // 一下为读取docx的图片 转化为流数组 if (sharePicture = = null) { sharePicture = xlsxFile.getEntry(entryName_png); } if(sharePicture = = null){ sharePicture = xlsxFile.getEntry(entryName_gif); } if(sharePicture = = null){ sharePicture = xlsxFile.getEntry(entryName_wmf); }if(sharePicture != null){ pictIS = xlsxFile.getInputStream(sharePicture); ByteArrayOutputStream pOut = new ByteArrayOutputStream(); byte[] bt = null; byte[] b = new byte[1000]; int len = 0; while ((len = pictIS.read(b)) != -1) { pOut.write(b, 0, len); } pictIS.close(); pOut.close(); bt = pOut.toByteArray(); Log.i(" byteArray" , " " + bt); if (pictIS != null) pictIS.close(); if (pOut != null) pOut.close(); writeDOCXPicture(bt); }pictureIndex+ + ; // 转换一张后 索引+ 1 }if (tag.equalsIgnoreCase(" b" )) { // 检测到加粗标签 isBold = true; } if (tag.equalsIgnoreCase(" p" )) {// 检测到 p 标签 if (isTable = = false) { // 如果在表格中 就无视 this.output.write(tagBegin.getBytes()); } } if (tag.equalsIgnoreCase(" i" )) { // 斜体 isItalic = true; } // 检测到值 标签 if (tag.equalsIgnoreCase(" t" )) { if (isBold = = true) { // 加粗 this.output.write(" < b> " .getBytes()); } if (isUnderline = = true) { // 检测到下划线标签,输入< u> this.output.write(" < u> " .getBytes()); } if (isItalic = = true) { // 检测到斜体标签,输入< i> output.write(" < i> " .getBytes()); } river = xmlParser.nextText(); this.output.write(river.getBytes()); // 写入数值 if (isItalic = = true) { // 检测到斜体标签,在输入值之后,输入< /i> ,并且斜体状态= false this.output.write(" < /i> " .getBytes()); isItalic = false; } if (isUnderline = = true) {// 检测到下划线标签,在输入值之后,输入< /u> ,并且下划线状态= false this.output.write(" < /u> " .getBytes()); isUnderline = false; } if (isBold = = true) { // 加粗 this.output.write(" < /b> " .getBytes()); isBold = false; } if (isSize = = true) { // 检测到大小设置,输入结束标签 this.output.write(" < /font> " .getBytes()); isSize = false; } if (isColor = = true) { // 检测到颜色设置存在,输入结束标签 this.output.write(" < /span> " .getBytes()); isColor = false; } if (isCenter = = true) { // 检测到居中,输入结束标签 this.output.write(" < /center> " .getBytes()); isCenter = false; } if (isRight = = true) { // 居右不能使用< right> < /right> ,使用div可能会有状况,先用着 this.output.write(" < /div> " .getBytes()); isRight = false; } } break; // 结束标签 case XmlPullParser.END_TAG: String tag2 = xmlParser.getName(); if (tag2.equalsIgnoreCase(" tbl" )) { // 检测到表格结束,更改表格状态 this.output.write(tableEnd.getBytes()); isTable = false; } if (tag2.equalsIgnoreCase(" tr" )) { // 行结束

    推荐阅读