Skip to main content

lucene 搜索过程

· 13 min read

背景

了解lucene的搜索过程:

  • 分词
  • 算每个分词的权重,排序取topk

代码堆栈

  • 写入过程:
add:473, FSTCompiler (org.apache.lucene.util.fst)
compileIndex:504, Lucene90BlockTreeTermsWriter$PendingBlock (org.apache.lucene.codecs.lucene90.blocktree)
writeBlocks:725, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
finish:1105, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:370, Lucene90BlockTreeTermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:172, PerFieldPostingsFormat$FieldsWriter (org.apache.lucene.codecs.perfield)
flush:135, FreqProxTermsWriter (org.apache.lucene.index)
flush:310, IndexingChain (org.apache.lucene.index)
flush:392, DocumentsWriterPerThread (org.apache.lucene.index)
doFlush:492, DocumentsWriter (org.apache.lucene.index)
flushAllThreads:671, DocumentsWriter (org.apache.lucene.index)
doFlush:4194, IndexWriter (org.apache.lucene.index)
flush:4168, IndexWriter (org.apache.lucene.index)
shutdown:1322, IndexWriter (org.apache.lucene.index)
close:1362, IndexWriter (org.apache.lucene.index)
doTestSearch:133, FstTest (com.dinosaur.lucene.demo)
  • 读的过程
findTargetArc:1418, FST (org.apache.lucene.util.fst)
seekExact:511, SegmentTermsEnum (org.apache.lucene.codecs.lucene90.blocktree)
loadTermsEnum:111, TermStates (org.apache.lucene.index)
build:96, TermStates (org.apache.lucene.index)
createWeight:227, TermQuery (org.apache.lucene.search)
createWeight:904, IndexSearcher (org.apache.lucene.search)
search:687, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:158, SearchFiles (com.dinosaur.lucene.demo)
testSearch:128, SearchFiles (com.dinosaur.lucene.demo)

例子

cfe 文件

$ hexdump  app/index/_3.cfs
000000 3f d7 6c 17 14 4c 75 63 65 6e 65 39 30 43 6f 6d
000010 70 6f 75 6e 64 44 61 74 61 00 00 00 00 7a fc 30
000020 52 e0 51 d2 54 be 49 7f 21 78 69 fe c4 00 00 00
000030 3f d7 6c 17 11 4c 75 63 65 6e 65 39 30 4e 6f 72
000040 6d 73 44 61 74 61 00 00 00 00 7a fc 30 52 e0 51
000050 d2 54 be 49 7f 21 78 69 fe c4 00 04 03 c0 28 93
000060 e8 00 00 00 00 00 00 00 00 f0 6a f4 62 00 00 00
000070 3f d7 6c 17 16 4c 75 63 65 6e 65 39 30 46 69 65
000080 6c 64 73 49 6e 64 65 78 49 64 78 00 00 00 00 7a
000090 fc 30 52 e0 51 d2 54 be 49 7f 21 78 69 fe c4 00
0000a0 c0 28 93 e8 00 00 00 00 00 00 00 00 92 7f 21 bb
0000b0 3f d7 6c 17 19 4c 75 63 65 6e 65 39 30 50 6f 69
0000c0 6e 74 73 46 6f 72 6d 61 74 49 6e 64 65 78 00 00
0000d0 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21 78 69
0000e0 fe c4 00 32 c0 28 93 e8 00 00 00 00 00 00 00 00
0000f0 f7 61 6e 2f 00 00 00 00 3f d7 6c 17 13 42 6c 6f
000100 63 6b 54 72 65 65 54 65 72 6d 73 49 6e 64 65 78
000110 00 00 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21
000120 78 69 fe c4 0a 4c 75 63 65 6e 65 39 30 5f 30 00
000130 00 c0 28 93 e8 00 00 00 00 00 00 00 00 07 1a 7b
000140 47 00 00 00 00 00 00 00 3f d7 6c 17 18 4c 75 63
000150 65 6e 65 39 30 50 6f 69 6e 74 73 46 6f 72 6d 61
000160 74 44 61 74 61 00 00 00 00 7a fc 30 52 e0 51 d2
000170 54 be 49 7f 21 78 69 fe c4 00 02 fe 00 08 80 00
000180 01 88 d2 0f 28 0d ff c0 28 93 e8 00 00 00 00 00
000190 00 00 00 6d 43 fa 6e 00 3f d7 6c 17 19 4c 75 63
0001a0 65 6e 65 39 30 50 6f 73 74 69 6e 67 73 57 72 69
0001b0 74 65 72 44 6f 63 00 00 00 00 7a fc 30 52 e0 51
0001c0 d2 54 be 49 7f 21 78 69 fe c4 0a 4c 75 63 65 6e
0001d0 65 39 30 5f 30 01 03 01 03 c0 28 93 e8 00 00 00 <--- 右边的01 03 是you的两个docid
0001e0 00 00 00 00 00 26 f5 75 88 00 00 00 00 00 00 00
0001f0 3f d7 6c 17 19 4c 75 63 65 6e 65 39 30 50 6f 73
000200 74 69 6e 67 73 57 72 69 74 65 72 50 6f 73 00 00
000210 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21 78 69
000220 fe c4 0a 4c 75 63 65 6e 65 39 30 5f 30 02 00 00
000230 01 02 03 01 c0 28 93 e8 00 00 00 00 00 00 00 00
000240 c5 ac 32 b6 00 00 00 00 3f d7 6c 17 15 4c 75 63
000250 65 6e 65 39 30 4e 6f 72 6d 73 4d 65 74 61 64 61
000260 74 61 00 00 00 00 7a fc 30 52 e0 51 d2 54 be 49
000270 7f 21 78 69 fe c4 00 02 00 00 00 ff ff ff ff ff
000280 ff ff ff 00 00 00 00 00 00 00 00 ff ff ff 02 00
000290 00 00 01 2b 00 00 00 00 00 00 00 ff ff ff ff c0
0002a0 28 93 e8 00 00 00 00 00 00 00 00 1c 85 f4 99 00
0002b0 3f d7 6c 17 1c 4c 75 63 65 6e 65 39 30 53 74 6f
0002c0 72 65 64 46 69 65 6c 64 73 46 61 73 74 44 61 74
0002d0 61 00 00 00 01 7a fc 30 52 e0 51 d2 54 be 49 7f
0002e0 21 78 69 fe c4 00 00 0a 00 01 08 12 13 01 04 02
0002f0 05 05 05 05 05 05 05 05 05 10 00 40 10 2e 2e 5c
000300 40 64 6f 63 73 40 5c 64 65 6d 40 6f 2e 74 78 40
000310 74 00 11 2e 40 2e 5c 64 6f 40 63 73 5c 64 40 65
000320 6d 6f 32 40 2e 74 78 74 c0 28 93 e8 00 00 00 00
000330 00 00 00 00 81 b0 7e 09 3f d7 6c 17 18 4c 75 63
000340 65 6e 65 39 30 50 6f 69 6e 74 73 46 6f 72 6d 61
000350 74 4d 65 74 61 00 00 00 00 7a fc 30 52 e0 51 d2
000360 54 be 49 7f 21 78 69 fe c4 00 01 00 00 00 3f d7
000370 6c 17 03 42 4b 44 00 00 00 09 01 01 80 04 08 01
000380 80 00 01 88 d2 0f 28 0d 80 00 01 88 d2 0f 28 0d
000390 02 02 01 32 00 00 00 00 00 00 00 33 00 00 00 00
0003a0 00 00 00 ff ff ff ff 44 00 00 00 00 00 00 00 4f
0003b0 00 00 00 00 00 00 00 c0 28 93 e8 00 00 00 00 00
0003c0 00 00 00 02 3e 97 d6 00 3f d7 6c 17 17 4c 75 63
0003d0 65 6e 65 39 30 46 69 65 6c 64 73 49 6e 64 65 78
0003e0 4d 65 74 61 00 00 00 01 7a fc 30 52 e0 51 d2 54
0003f0 be 49 7f 21 78 69 fe c4 00 80 80 05 02 00 00 00
000400 0a 00 00 00 02 00 00 00 30 00 00 00 00 00 00 00
000410 00 00 00 00 00 00 00 00 00 00 00 40 00 00 00 00
000420 00 00 00 00 00 30 00 00 00 00 00 00 00 36 00 00
000430 00 00 00 00 00 00 00 84 42 00 00 00 00 00 00 00
000440 00 00 30 00 00 00 00 00 00 00 78 00 00 00 00 00
000450 00 00 01 01 02 c0 28 93 e8 00 00 00 00 00 00 00
000460 00 c3 23 d0 d6 00 00 00 3f d7 6c 17 12 42 6c 6f <------- 3f
000470 63 6b 54 72 65 65 54 65 72 6d 73 44 69 63 74 00
000480 00 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21 78
000490 69 fe c4 0a 4c 75 63 65 6e 65 39 30 5f 30 0b 9c <--------
0004a0 01 61 72 65 68 6f 77 6f 6c 64 73 74 75 64 65 6e
0004b0 74 79 6f 75 0a 03 03 03 07 03 05 04 00 05 04 00 <------ 05 04 00 05 04 是position
0004c0 0b 7a 3d 04 00 02 01 01 05 01 00 01 05 8c 02 2e <------- 7a 3d 04 是很多位置信息
0004d0 2e 5c 64 6f 63 73 5c 64 65 6d 6f 2e 74 78 74 2e
0004e0 2e 5c 64 6f 63 73 5c 64 65 6d 6f 32 2e 74 78 74
0004f0 04 10 11 01 03 04 82 01 00 05 c0 28 93 e8 00 00
000500 00 00 00 00 00 00 1a 7f dc 45 00 00 00 00 00 00
000510 3f d7 6c 17 12 42 6c 6f 63 6b 54 72 65 65 54 65
000520 72 6d 73 4d 65 74 61 00 00 00 00 7a fc 30 52 e0
000530 51 d2 54 be 49 7f 21 78 69 fe c4 0a 4c 75 63 65
000540 6e 65 39 30 5f 30 3f d7 6c 17 1b 4c 75 63 65 6e
000550 65 39 30 50 6f 73 74 69 6e 67 73 57 72 69 74 65
000560 72 54 65 72 6d 73 00 00 00 00 7a fc 30 52 e0 51
000570 d2 54 be 49 7f 21 78 69 fe c4 0a 4c 75 63 65 6e
000580 65 39 30 5f 30 80 01 02 02 05 02 da 01 07 07 02
000590 03 61 72 65 03 79 6f 75 37 3f d7 6c 17 03 46 53
0005a0 54 00 00 00 08 01 03 01 da 02 00 00 01 00 02 02
0005b0 92 03 02 02 10 2e 2e 5c 64 6f 63 73 5c 64 65 6d
0005c0 6f 2e 74 78 74 11 2e 2e 5c 64 6f 63 73 5c 64 65
0005d0 6d 6f 32 2e 74 78 74 38 3f d7 6c 17 03 46 53 54
0005e0 00 00 00 08 01 03 03 92 02 00 00 01 49 00 00 00
0005f0 00 00 00 00 a2 00 00 00 00 00 00 00 c0 28 93 e8
000600 00 00 00 00 00 00 00 00 c9 44 df a8 00 00 00 00
000610 3f d7 6c 17 12 4c 75 63 65 6e 65 39 34 46 69 65
000620 6c 64 49 6e 66 6f 73 00 00 00 00 7a fc 30 52 e0
000630 51 d2 54 be 49 7f 21 78 69 fe c4 00 03 04 70 61
000640 74 68 00 02 01 00 ff ff ff ff ff ff ff ff 02 1d
000650 50 65 72 46 69 65 6c 64 50 6f 73 74 69 6e 67 73
000660 46 6f 72 6d 61 74 2e 66 6f 72 6d 61 74 08 4c 75
000670 63 65 6e 65 39 30 1d 50 65 72 46 69 65 6c 64 50
000680 6f 73 74 69 6e 67 73 46 6f 72 6d 61 74 2e 73 75
000690 66 66 69 78 01 30 00 00 01 00 08 6d 6f 64 69 66
0006a0 69 65 64 01 00 00 00 ff ff ff ff ff ff ff ff 00
0006b0 01 01 08 00 01 00 08 63 6f 6e 74 65 6e 74 73 02
0006c0 00 03 00 ff ff ff ff ff ff ff ff 02 1d 50 65 72
0006d0 46 69 65 6c 64 50 6f 73 74 69 6e 67 73 46 6f 72
0006e0 6d 61 74 2e 66 6f 72 6d 61 74 08 4c 75 63 65 6e
0006f0 65 39 30 1d 50 65 72 46 69 65 6c 64 50 6f 73 74
000700 69 6e 67 73 46 6f 72 6d 61 74 2e 73 75 66 66 69
000710 78 01 30 00 00 01 00 c0 28 93 e8 00 00 00 00 00
000720 00 00 00 36 55 24 d2 c0 28 93 e8 00 00 00 00 00
000730 00 00 00 41 6a 49 d4

tim文件的偏移是offset=1128 tim 文件

score:250, BM25Similarity$BM25Scorer (org.apache.lucene.search.similarities)
score:60, LeafSimScorer (org.apache.lucene.search)
score:75, TermScorer (org.apache.lucene.search)
collect:73, TopScoreDocCollector$SimpleTopScoreDocCollector$1 (org.apache.lucene.search)
scoreAll:305, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:247, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:38, BulkScorer (org.apache.lucene.search)
search:776, IndexSearcher (org.apache.lucene.search)
search:694, IndexSearcher (org.apache.lucene.search)
search:688, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:161, SearchFiles (com.dinosaur.lucene.skiptest)

readField:248, Lucene90CompressingStoredFieldsReader (org.apache.lucene.codecs.lucene90.compressing)
document:642, Lucene90CompressingStoredFieldsReader (org.apache.lucene.codecs.lucene90.compressing)
document:253, SegmentReader (org.apache.lucene.index)
document:171, BaseCompositeReader (org.apache.lucene.index)
document:411, IndexReader (org.apache.lucene.index)
doc:390, IndexSearcher (org.apache.lucene.search)
doPagingSearch:195, SearchFiles (com.dinosaur.lucene.skiptest)

tim/tip/doc 关系

tip 是描述一个term的指针 tim 包含term的统计信息 doc 描述的是term对应的docId

也就是说 tip -> tim -> doc

  • 通过tip判断term是否存在
  • 然后通过tip找到tim获取统计信息
  • 然后通过doc 获取包含该term的docId的数组

doc file

  • doc file open:
<init>:74, Lucene90PostingsReader (org.apache.lucene.codecs.lucene90)
fieldsProducer:424, Lucene90PostingsFormat (org.apache.lucene.codecs.lucene90)
<init>:330, PerFieldPostingsFormat$FieldsReader (org.apache.lucene.codecs.perfield)
fieldsProducer:392, PerFieldPostingsFormat (org.apache.lucene.codecs.perfield)
<init>:118, SegmentCoreReaders (org.apache.lucene.index)
<init>:92, SegmentReader (org.apache.lucene.index)
doBody:94, StandardDirectoryReader$1 (org.apache.lucene.index)
doBody:77, StandardDirectoryReader$1 (org.apache.lucene.index)
run:816, SegmentInfos$FindSegmentsFile (org.apache.lucene.index)
open:109, StandardDirectoryReader (org.apache.lucene.index)
open:67, StandardDirectoryReader (org.apache.lucene.index)
open:60, DirectoryReader (org.apache.lucene.index)
doSearchDemo:25, SimpleSearchTest (com.dinosaur.lucene.demo)

how to find the docId list

org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java

  final class BlockDocsEnum extends PostingsEnum {

...

public PostingsEnum reset(IntBlockTermState termState, int flags) throws IOException {
docFreq = termState.docFreq;
totalTermFreq = indexHasFreq ? termState.totalTermFreq : docFreq;
docTermStartFP = termState.docStartFP;
skipOffset = termState.skipOffset;
singletonDocID = termState.singletonDocID;
if (docFreq > 1) {
if (docIn == null) {
// lazy init
docIn = startDocIn.clone();
}
docIn.seek(docTermStartFP);
}

doc = -1;
this.needsFreq = PostingsEnum.featureRequested(flags, PostingsEnum.FREQS);
this.isFreqsRead = true;
if (indexHasFreq == false || needsFreq == false) {
for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
freqBuffer[i] = 1;
}
}
accum = 0;
blockUpto = 0;
nextSkipDoc = BLOCK_SIZE - 1; // we won't skip if target is found in first block
docBufferUpto = BLOCK_SIZE;
skipped = false;
return this;
}
}

相关阅读