TIKA

Tika下载

  1. server.jar
    http://tika.apache.org/download.html
    java -jar tika-server-1.17.jar
    

    下载server版,需要java运行环境。注:JAVA9默认缺少server运行所需要的xml.bind包,需要另行解决,JAVA8无问题。

  2. docker
    docker pull logicalspark/docker-tikaserver # only on initial download/update
    docker run --rm -p 9998:9998 logicalspark/docker-tikaserver
    
  3. app.jar
    app也有server模式,但他并非HTPP协议,所以无法使用curl调试。
    
  4. maven

Docker Server

  1. 测试服务器
    curl -X GET http://localhost:9998/tika 
    
  2. 获取meta
    curl -T test.pdf http://localhost:9998/meta --header "Accept: application/json"
    
  3. 获取文档内容
    curl -X PUT --data-binary @test.pdf http://localhost:9998/tika --header "Content-Type: text/pdf"
    curl -T test.pdf http://localhost:9998/tika --header "Accept: text/html" # 返回html,带标签,可不带header
    

Go Implement

var tikaServerUrl = "http://localhost:9998/"
func putRequest(url, filename string) (string, error) {
	file, err := os.Open(filename)
	if err != nil {
		return "", err
	}
	req, err := http.NewRequest(http.MethodPut, url, file)
	client := &http.Client{}
	response, err := client.Do(req)
	if err != nil {
		return "", err
	}
	defer response.Body.Close()
	b, _ := ioutil.ReadAll(response.Body)
	return string(b), nil
}

文档

https://wiki.apache.org/tika/TikaJAXRS