當前位置: 妍妍網 > 碼農

SpringBoot + Apache tika 輕松實作各種文件內容解析

2024-06-29碼農

作者:不可食用鹽
連結:https://juejin.cn/post/7252159509848899640

Apache tika是Apache開源的一個文件解析工具。Apache Tika可以解析和提取一千多種不同的檔型別(如PPT、XLS和PDF)的內容和格式,並且Apache Tika提供了多種使用方式,既可以使用圖形化操作頁面(tika-app),又可以獨立部署(tika-server)透過介面呼叫,還可以引入到計畫中使用。

本文演示在spring boot 中引入 tika 的方式解析文件。如下:

# 引入依賴

在spring boot 計畫中引入如下依賴:

<dependencyManagement><dependencies><dependency><groupId>org.apache.tika</groupId><artifactId>tika-bom</artifactId><version>2.8.0</version><type>pom</type><scope>import</scope></dependency></dependencies></dependencyManagement><dependency><groupId>org.apache.tika</groupId><artifactId>tika-core</artifactId></dependency><dependency><groupId>org.apache.tika</groupId><artifactId>tika-parsers-standard-package</artifactId></dependency>

# 建立配置

  1. 將tika-config.xml檔放在resources目錄下。tika-config.xml檔的內容如下:

<?xml version="1.0" encoding="UTF-8"?><properties><encodingDetectors><encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector"><params><paramname="markLimit"type="int">64000</param></params></encodingDetector><encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector"><params><paramname="markLimit"type="int">64001</param></params></encodingDetector><encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector"><params><paramname="markLimit"type="int">64002</param></params></encodingDetector></encodingDetectors></properties>

  1. 建立配置類MyTikaConfig

import java.io.IOException;import java.io.InputStream;import org.apache.tika.Tika;import org.apache.tika.config.TikaConfig;import org.apache.tika.detect.Detector;import org.apache.tika.exception.TikaException;import org.apache.tika.parser.AutoDetectParser;import org.apache.tika.parser.Parser;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.context.annotation.Bean;import org.springframework.context.annotation.Configuration;import org.springframework.core.io.Resource;import org.springframework.core.io.ResourceLoader;import org.xml.sax.SAXException;/** * tika配置類 */@Configurationpublic classMyTikaConfig{ @AutowiredprivateResourceLoader resourceLoader; @BeanpublicTika tika() throwsTikaException, IOException, SAXException {Resource resource = resourceLoader.getResource(" classpath:tika-config.xml");InputStream inputStream = resource.getInputStream();TikaConfig config = new TikaConfig(inputStream);Detector detector = config.getDetector();Parser autoDetectParser = new AutoDetectParser(config);return new Tika(detector, autoDetectParser); }}

Tika類中提供了文芳detect、translate和parse功能, 在計畫中透過註入TIka, 就可以使用了

# 在計畫使用

配置完成後在計畫中可以透過註入TIka即可完成文件的解析。如下圖所示:

熱門推薦