作者:不可食用鹽
連結:https://juejin.cn/post/7252159509848899640
Apache tika是Apache開源的一個文件解析工具。Apache Tika可以解析和提取一千多種不同的檔型別(如PPT、XLS和PDF)的內容和格式,並且Apache Tika提供了多種使用方式,既可以使用圖形化操作頁面(tika-app),又可以獨立部署(tika-server)透過介面呼叫,還可以引入到計畫中使用。
本文演示在spring boot 中引入 tika 的方式解析文件。如下:
# 引入依賴
在spring boot 計畫中引入如下依賴:
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-bom</artifactId>
<version>2.8.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
</dependency>
# 建立配置
將tika-config.xml檔放在resources目錄下。tika-config.xml檔的內容如下:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<encodingDetectors>
<encodingDetector class="org.apache.tika.parser.html.HtmlEncodingDetector">
<params>
<paramname="markLimit"type="int">64000</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.UniversalEncodingDetector">
<params>
<paramname="markLimit"type="int">64001</param>
</params>
</encodingDetector>
<encodingDetector class="org.apache.tika.parser.txt.Icu4jEncodingDetector">
<params>
<paramname="markLimit"type="int">64002</param>
</params>
</encodingDetector>
</encodingDetectors>
</properties>
建立配置類MyTikaConfig
import java.io.IOException;
import java.io.InputStream;
import org.apache.tika.Tika;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.exception.TikaException;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.core.io.Resource;
import org.springframework.core.io.ResourceLoader;
import org.xml.sax.SAXException;
/**
* tika配置類
*/
@Configuration
public classMyTikaConfig{
@Autowired
privateResourceLoader resourceLoader;
@Bean
publicTika tika() throwsTikaException, IOException, SAXException {
Resource resource = resourceLoader.getResource(" classpath:tika-config.xml");
InputStream inputStream = resource.getInputStream();
TikaConfig config = new TikaConfig(inputStream);
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);
return new Tika(detector, autoDetectParser);
}
}
Tika類中提供了文芳detect、translate和parse功能, 在計畫中透過註入TIka, 就可以使用了
# 在計畫使用
配置完成後在計畫中可以透過註入TIka即可完成文件的解析。如下圖所示:
熱門推薦