這篇文章給大家分享的是有關(guān)springboot中集成ES如何實(shí)現(xiàn)磁盤文件全文檢索功能的內(nèi)容。小編覺得挺實(shí)用的,因此分享給大家做個(gè)參考,一起跟隨小編過來看看吧。
峨邊彝族ssl適用于網(wǎng)站、小程序/APP、API接口等需要進(jìn)行數(shù)據(jù)傳輸應(yīng)用場(chǎng)景,ssl證書未來市場(chǎng)廣闊!成為創(chuàng)新互聯(lián)的ssl證書銷售渠道,可以享受市場(chǎng)價(jià)格4-6折優(yōu)惠!如果有意向歡迎電話聯(lián)系或者加微信:18982081108(備注:SSL證書合作)期待與您的合作!
整體架構(gòu)
考慮到磁盤文件分布到不同的設(shè)備上,所以采用磁盤掃瞄代理的模式構(gòu)建系統(tǒng),即把掃描服務(wù)以代理的方式部署到目標(biāo)磁盤所在的服務(wù)器上,作為定時(shí)任務(wù)執(zhí)行,索引統(tǒng)一建立到ES中,當(dāng)然ES采用分布式高可用部署方法,搜索服務(wù)和掃描代理部署到一起來簡(jiǎn)化架構(gòu)并實(shí)現(xiàn)分布式能力。
磁盤文件快速檢索架構(gòu)
部署ES
ES(elasticsearch)是本項(xiàng)目唯一依賴的第三方軟件,ES支持docker方式部署,以下是部署過程
docker pull docker.elastic.co/elasticsearch/elasticsearch:6.3.2 docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9200:9200 -p 9300:9300 --name es01 docker.elastic.co/elasticsearch/elasticsearch:6.3.2
部署完成后,通過瀏覽器打開http://localhost:9200,如果正常打開,出現(xiàn)如下界面,則說明ES部署成功。
ES界面
工程結(jié)構(gòu)
工程結(jié)構(gòu)
依賴包
本項(xiàng)目除了引入springboot的基礎(chǔ)starter外,還需要引入ES相關(guān)包
org.springframework.boot spring-boot-starter-data-elasticsearch io.searchbox jest 5.3.3 net.sf.jmimemagic jmimemagic 0.1.4
配置文件
需要將ES的訪問地址配置到application.yml里邊,同時(shí)為了簡(jiǎn)化程序,需要將待掃描磁盤的根目錄(index-root)配置進(jìn)去,后面的掃描任務(wù)就會(huì)遞歸遍歷該目錄下的全部可索引文件。
server: port: @elasticsearch.port@ spring: application: name: @project.artifactId@ profiles: active: dev elasticsearch: jest: uris: http://127.0.0.1:9200 index-root: /Users/crazyicelee/mywokerspace
索引結(jié)構(gòu)數(shù)據(jù)定義
因?yàn)橐笪募谀夸洝⑽募?、文件正文都有能夠檢索,所以要將這些內(nèi)容都作為索引字段定義,而且添加ES client要求的JestId來注解id。
package com.crazyice.lee.accumulation.search.data; import io.searchbox.annotations.JestId; import lombok.Data; @Data public class Article { @JestId private Integer id; private String author; private String title; private String path; private String content; private String fileFingerprint; }
掃描磁盤并創(chuàng)建索引
因?yàn)橐獟呙柚付夸浵碌娜课募圆捎眠f歸的方法遍歷該目錄,并標(biāo)識(shí)已經(jīng)處理的文件以提升效率,在文件類型識(shí)別方面采用兩種方式可供選擇,一個(gè)是文件內(nèi)容更為精準(zhǔn)判斷(Magic),一種是以文件擴(kuò)展名粗略判斷。這部分是整個(gè)系統(tǒng)的核心組件。
這里有個(gè)小技巧
對(duì)目標(biāo)文件內(nèi)容計(jì)算MD5值并作為文件指紋存儲(chǔ)到ES的索引字段里邊,每次在重建索引的時(shí)候判斷該MD5是否存在,如果存在就不用重復(fù)建立索引了,可以避免文件索引重復(fù),也能避免系統(tǒng)重啟后重復(fù)遍歷文件。
package com.crazyice.lee.accumulation.search.service; import com.alibaba.fastjson.JSONObject; import com.crazyice.lee.accumulation.search.data.Article; import com.crazyice.lee.accumulation.search.utils.Md5CaculateUtil; import io.searchbox.client.JestClient; import io.searchbox.core.Index; import io.searchbox.core.Search; import io.searchbox.core.SearchResult; import lombok.extern.slf4j.Slf4j; import net.sf.jmimemagic.*; import org.apache.poi.hwpf.extractor.WordExtractor; import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.search.builder.SearchSourceBuilder; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Component; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; @Component @Slf4j public class DirectoryRecurse { @Autowired private JestClient jestClient; //讀取文件內(nèi)容轉(zhuǎn)換為字符串 private String readToString(File file, String fileType) { StringBuffer result = new StringBuffer(); switch (fileType) { case "text/plain": case "java": case "c": case "cpp": case "txt": try (FileInputStream in = new FileInputStream(file)) { Long filelength = file.length(); byte[] filecontent = new byte[filelength.intValue()]; in.read(filecontent); result.append(new String(filecontent, "utf8")); } catch (FileNotFoundException e) { log.error("{}", e.getLocalizedMessage()); } catch (IOException e) { log.error("{}", e.getLocalizedMessage()); } break; case "doc": //使用HWPF組件中WordExtractor類從Word文檔中提取文本或段落 try (FileInputStream in = new FileInputStream(file)) { WordExtractor extractor = new WordExtractor(in); result.append(extractor.getText()); } catch (Exception e) { log.error("{}", e.getLocalizedMessage()); } break; case "docx": try (FileInputStream in = new FileInputStream(file); XWPFDocument doc = new XWPFDocument(in)) { XWPFWordExtractor extractor = new XWPFWordExtractor(doc); result.append(extractor.getText()); } catch (Exception e) { log.error("{}", e.getLocalizedMessage()); } break; } return result.toString(); } //判斷是否已經(jīng)索引 private JSONObject isIndex(File file) { JSONObject result = new JSONObject(); //用MD5生成文件指紋,搜索該指紋是否已經(jīng)索引 String fileFingerprint = Md5CaculateUtil.getMD5(file); result.put("fileFingerprint", fileFingerprint); SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.termQuery("fileFingerprint", fileFingerprint)); Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex("diskfile").addType("files").build(); try { //執(zhí)行 SearchResult searchResult = jestClient.execute(search); if (searchResult.getTotal() > 0) { result.put("isIndex", true); } else { result.put("isIndex", false); } } catch (IOException e) { log.error("{}", e.getLocalizedMessage()); } return result; } //對(duì)文件目錄及內(nèi)容創(chuàng)建索引 private void createIndex(File file, String method) { //忽略掉臨時(shí)文件,以~$起始的文件名 if (file.getName().startsWith("~$")) return; String fileType = null; switch (method) { case "magic": Magic parser = new Magic(); try { MagicMatch match = parser.getMagicMatch(file, false); fileType = match.getMimeType(); } catch (MagicParseException e) { //log.error("{}",e.getLocalizedMessage()); } catch (MagicMatchNotFoundException e) { //log.error("{}",e.getLocalizedMessage()); } catch (MagicException e) { //log.error("{}",e.getLocalizedMessage()); } break; case "ext": String filename = file.getName(); String[] strArray = filename.split("\\."); int suffixIndex = strArray.length - 1; fileType = strArray[suffixIndex]; } switch (fileType) { case "text/plain": case "java": case "c": case "cpp": case "txt": case "doc": case "docx": JSONObject isIndexResult = isIndex(file); log.info("文件名:{},文件類型:{},MD5:{},建立索引:{}", file.getPath(), fileType, isIndexResult.getString("fileFingerprint"), isIndexResult.getBoolean("isIndex")); if (isIndexResult.getBoolean("isIndex")) break; //1. 給ES中索引(保存)一個(gè)文檔 Article article = new Article(); article.setTitle(file.getName()); article.setAuthor(file.getParent()); article.setPath(file.getPath()); article.setContent(readToString(file, fileType)); article.setFileFingerprint(isIndexResult.getString("fileFingerprint")); //2. 構(gòu)建一個(gè)索引 Index index = new Index.Builder(article).index("diskfile").type("files").build(); try { //3. 執(zhí)行 if (!jestClient.execute(index).getId().isEmpty()) { log.info("構(gòu)建索引成功!"); } } catch (IOException e) { log.error("{}", e.getLocalizedMessage()); } break; } } public void find(String pathName) throws IOException { //獲取pathName的File對(duì)象 File dirFile = new File(pathName); //判斷該文件或目錄是否存在,不存在時(shí)在控制臺(tái)輸出提醒 if (!dirFile.exists()) { log.info("do not exit"); return; } //判斷如果不是一個(gè)目錄,就判斷是不是一個(gè)文件,時(shí)文件則輸出文件路徑 if (!dirFile.isDirectory()) { if (dirFile.isFile()) { createIndex(dirFile, "ext"); } return; } //獲取此目錄下的所有文件名與目錄名 String[] fileList = dirFile.list(); for (int i = 0; i < fileList.length; i++) { //遍歷文件目錄 String string = fileList[i]; File file = new File(dirFile.getPath(), string); //如果是一個(gè)目錄,輸出目錄名后,進(jìn)行遞歸 if (file.isDirectory()) { //遞歸 find(file.getCanonicalPath()); } else { createIndex(file, "ext"); } } } }
掃描任務(wù)
這里采用定時(shí)任務(wù)的方式來掃描指定目錄以實(shí)現(xiàn)動(dòng)態(tài)增量創(chuàng)建索引。
package com.crazyice.lee.accumulation.search.service; import lombok.extern.slf4j.Slf4j; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.beans.factory.annotation.Value; import org.springframework.context.annotation.Configuration; import org.springframework.scheduling.annotation.Scheduled; import org.springframework.stereotype.Component; import java.io.IOException; @Configuration @Component @Slf4j public class CreateIndexTask { @Autowired private DirectoryRecurse directoryRecurse; @Value("${index-root}") private String indexRoot; @Scheduled(cron = "* 0/5 * * * ?") private void addIndex(){ try { directoryRecurse.find(indexRoot); directoryRecurse.writeIndexStatus(); } catch (IOException e) { log.error("{}",e.getLocalizedMessage()); } } }
搜索服務(wù)
這里以restFul的方式提供搜索服務(wù),將關(guān)鍵字以高亮度模式提供給前端UI,瀏覽器端可以根據(jù)返回的JSON進(jìn)行展示。
package com.crazyice.lee.accumulation.search.web; import com.alibaba.fastjson.JSONObject; import com.crazyice.lee.accumulation.search.data.Article; import io.searchbox.client.JestClient; import io.searchbox.core.Search; import io.searchbox.core.SearchResult; import io.swagger.annotations.ApiImplicitParam; import io.swagger.annotations.ApiImplicitParams; import io.swagger.annotations.ApiOperation; import lombok.extern.slf4j.Slf4j; import org.elasticsearch.index.query.BoolQueryBuilder; import org.elasticsearch.index.query.QueryBuilders; import org.elasticsearch.search.builder.SearchSourceBuilder; import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.lang.NonNull; import org.springframework.web.bind.annotation.PathVariable; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RequestMethod; import org.springframework.web.bind.annotation.RestController; import java.io.IOException; import java.util.HashMap; import java.util.List; import java.util.Map; @RestController @Slf4j public class Controller { @Autowired private JestClient jestClient; @RequestMapping(value = "/search/{keyword}",method = RequestMethod.GET) @ApiOperation(value = "全部字段搜索關(guān)鍵字",notes = "es驗(yàn)證") @ApiImplicitParams( @ApiImplicitParam(name = "keyword",value = "全文檢索關(guān)鍵字",required = true,paramType = "path",dataType = "String") ) public List search(@PathVariable String keyword){ SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder(); searchSourceBuilder.query(QueryBuilders.queryStringQuery(keyword)); HighlightBuilder highlightBuilder = new HighlightBuilder(); //path屬性高亮度 HighlightBuilder.Field highlightPath = new HighlightBuilder.Field("path"); highlightPath.highlighterType("unified"); highlightBuilder.field(highlightPath); //title字段高亮度 HighlightBuilder.Field highlightTitle = new HighlightBuilder.Field("title"); highlightTitle.highlighterType("unified"); highlightBuilder.field(highlightTitle); //content字段高亮度 HighlightBuilder.Field highlightContent = new HighlightBuilder.Field("content"); highlightContent.highlighterType("unified"); highlightBuilder.field(highlightContent); //高亮度配置生效 searchSourceBuilder.highlighter(highlightBuilder); log.info("搜索條件{}",searchSourceBuilder.toString()); //構(gòu)建搜索功能 Search search = new Search.Builder(searchSourceBuilder.toString()).addIndex( "gf" ).addType( "news" ).build(); try { //執(zhí)行 SearchResult result = jestClient.execute( search ); return result.getHits(Article.class); } catch (IOException e) { log.error("{}",e.getLocalizedMessage()); } return null; } }
搜索restFul結(jié)果測(cè)試
這里以swagger的方式進(jìn)行API測(cè)試。其中keyword是全文檢索中要搜索的關(guān)鍵字。
搜索結(jié)果
使用thymeleaf生成UI
集成thymeleaf的模板引擎直接將搜索結(jié)果以web方式呈現(xiàn)。模板包括主搜索頁和搜索結(jié)果頁,通過@Controller注解及Model對(duì)象實(shí)現(xiàn)。
感謝各位的閱讀!關(guān)于“springboot中集成ES如何實(shí)現(xiàn)磁盤文件全文檢索功能”這篇文章就分享到這里了,希望以上內(nèi)容可以對(duì)大家有一定的幫助,讓大家可以學(xué)到更多知識(shí),如果覺得文章不錯(cuò),可以把它分享出去讓更多的人看到吧!