ApachePOI 是用Java编写的免费开源的跨平台的 Java API,Apache POI提供API给Java程式对Microsoft Office格式档案读和写的功能。POI为"Poor Obfuscation Implementation"的首字母缩写,意为"可怜的模糊实现"。
之前项目有需要使用solr对文档内容进行分词建索引,最终文档解析的过程实现选择了通过poi进行文档内容的读取,下面详细说明解析每种格式文档的poi方法。
POI解析txt文档内容:
/*** 读取txt文件的内容** @param file* 想要读取的文件对象* @return 返回文件内容*/public static String txt2String(File file) {String result = "";try {BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(file), "GBK"));// 构造一个BufferedReader类来读取文件String s = null;while ((s = br.readLine()) != null) {// 使用readLine方法,一次读一行result = result + "\n" + s;}br.close();} catch (Exception e) {e.printStackTrace();}return result;}
POI解析doc文档内容:
注意两个import: import org.apache.poi.hwpf.HWPFDocument;import org.apache.poi.hwpf.usermodel.Range;
/*** 读取doc文件内容** @param file* 想要读取的文件对象* @return 返回文件内容*/public static String doc2String(File file) {String result = "";try {FileInputStream fis = new FileInputStream(file);HWPFDocument doc = new HWPFDocument(fis);Range range = doc.getRange();result += range.text();fis.close();} catch (Exception e) {e.printStackTrace();}return result;}
POI解析docx文档内容:
注意两个import: import org.apache.poi.xwpf.extractor.XWPFWordExtractor;import org.apache.poi.xwpf.usermodel.XWPFDocument;
/*** 读取docx文件* @param file* @return*/public static String docx2String(File file) {String str = "";try {FileInputStream fis = new FileInputStream(file);XWPFDocument xdoc = new XWPFDocument(fis);XWPFWordExtractor extractor = new XWPFWordExtractor(xdoc);String doc1 = extractor.getText();System.out.println(doc1);str += doc1;fis.close();} catch (Exception e) {e.printStackTrace();}return str;}
POI解析xls文档内容:
注意三个import: import jxl.Cell;import jxl.Sheet;
import jxl.Workbook;
/*** 读取xls文件内容** @param file* 想要读取的文件对象* @return 返回文件内容*/public static String xls2String(File file) {String result = "";try {FileInputStream fis = new FileInputStream(file);StringBuilder sb = new StringBuilder();jxl.Workbook rwb = Workbook.getWorkbook(fis);Sheet[] sheet = rwb.getSheets();for (int i = 0; i < sheet.length; i++) {Sheet rs = rwb.getSheet(i);for (int j = 0; j < rs.getRows(); j++) {Cell[] cells = rs.getRow(j);for (int k = 0; k < cells.length; k++)sb.append(cells[k].getContents());}}fis.close();result += sb.toString();} catch (Exception e) {e.printStackTrace();}return result;}
POI解析xlsx文档内容:
注意三个import: import org.apache.poi.xssf.usermodel.XSSFRow;import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
/*** 读取xlsx文件*/public static String xlsx2String(File file) {String result = "";try {FileInputStream fis = new FileInputStream(file);StringBuilder sb = new StringBuilder();XSSFWorkbook xwb = new XSSFWorkbook(fis);XSSFSheet sheet = xwb.getSheetAt(0);for (int i=sheet.getFirstRowNum()+1;i<sheet.getPhysicalNumberOfRows();i++) {XSSFRow row= sheet.getRow(i);for (int j =row.getFirstCellNum(); j < row.getPhysicalNumberOfCells(); j++) {sb.append(row.getCell(j).toString());}}fis.close();result += sb.toString();} catch (Exception e) {e.printStackTrace();}return result;}
POI解析ppt文档内容:
注意一个import: import org.apache.poi.hslf.extractor.PowerPointExtractor;/** 读取ppt*/public static String ppt2String(File file) throws IOException{FileInputStream fi=new FileInputStream(file);PowerPointExtractor ppExtractor=new PowerPointExtractor(fi);return ppExtractor.getText();}
POI解析pdf文档内容:
注意三个import: import org.apache.pdfbox.pdmodel.PDDocument;import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.text.PDFTextStripper;
/*** 读取pdf* @param file* @return* @throws InvalidPasswordException* @throws IOException*/public static String pdf2String(File file) throws IOException{PDDocument document=PDDocument.load(file);PDFTextStripper stripper=new PDFTextStripper();stripper.setSortByPosition(false);String result=stripper.getText(document);document.close();return result;}
了解详细信息官网: /overview.html