Skip to main content

12 posts tagged with "lucene"

View All Tags

lucene 分词

· 2 min read

背景

了解分词过程

概述

lucene的查询过程:

(String query , String field ) -> Query

整个过程是将字符串"how old" 切割成一个个Term Query

最后会构造成一棵语法树:

should:[how,old]

图片

背景

lucene 的分词是一个基本的话题,主要是利用:incrementToken 这个抽象方法以及继承AttributeSource 这个类

public abstract class TokenStream extends AttributeSource implements Closeable {
public abstract boolean incrementToken() throws IOException;
}

lucene boolean clause

相关阅读

lucene 的bolean 子句有四种:

  • MUST
  • FILTER
  • SHOULD
  • MUST_NOT 子句

堆栈

<init>:202, TermQuery (org.apache.lucene.search)
newTermQuery:640, QueryBuilder (org.apache.lucene.util)
add:408, QueryBuilder (org.apache.lucene.util)
analyzeMultiBoolean:427, QueryBuilder (org.apache.lucene.util)
createFieldQuery:364, QueryBuilder (org.apache.lucene.util)
createFieldQuery:257, QueryBuilder (org.apache.lucene.util)
newFieldQuery:468, QueryParserBase (org.apache.lucene.queryparser.classic)
getFieldQuery:457, QueryParserBase (org.apache.lucene.queryparser.classic)
MultiTerm:680, QueryParser (org.apache.lucene.queryparser.classic)
Query:233, QueryParser (org.apache.lucene.queryparser.classic)
TopLevelQuery:223, QueryParser (org.apache.lucene.queryparser.classic)
parse:136, QueryParserBase (org.apache.lucene.queryparser.classic)
testParse:20, ParseTest (com.dinosaur.lucene.demo)

排序算分

BlockMaxMaxscoreScorermatches会将所有的分词算出来,然后计算分数总和

score:250, BM25Similarity$BM25Scorer (org.apache.lucene.search.similarities)
score:60, LeafSimScorer (org.apache.lucene.search)
score:75, TermScorer (org.apache.lucene.search)
matches:240, BlockMaxMaxscoreScorer$2 (org.apache.lucene.search)
doNext:85, TwoPhaseIterator$TwoPhaseIteratorAsDocIdSetIterator (org.apache.lucene.search)
advance:78, TwoPhaseIterator$TwoPhaseIteratorAsDocIdSetIterator (org.apache.lucene.search)
score:232, BooleanWeight$2 (org.apache.lucene.search)
score:38, BulkScorer (org.apache.lucene.search)
search:776, IndexSearcher (org.apache.lucene.search)
search:694, IndexSearcher (org.apache.lucene.search)
search:688, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:161, SearchFiles (com.dinosaur.lucene.skiptest)
testSearch:131, SearchFiles (com.dinosaur.lucene.skiptest)

相关阅读

lucene 搜索过程

· 13 min read

背景

了解lucene的搜索过程:

  • 分词
  • 算每个分词的权重,排序取topk

代码堆栈

  • 写入过程:
add:473, FSTCompiler (org.apache.lucene.util.fst)
compileIndex:504, Lucene90BlockTreeTermsWriter$PendingBlock (org.apache.lucene.codecs.lucene90.blocktree)
writeBlocks:725, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
finish:1105, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:370, Lucene90BlockTreeTermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:172, PerFieldPostingsFormat$FieldsWriter (org.apache.lucene.codecs.perfield)
flush:135, FreqProxTermsWriter (org.apache.lucene.index)
flush:310, IndexingChain (org.apache.lucene.index)
flush:392, DocumentsWriterPerThread (org.apache.lucene.index)
doFlush:492, DocumentsWriter (org.apache.lucene.index)
flushAllThreads:671, DocumentsWriter (org.apache.lucene.index)
doFlush:4194, IndexWriter (org.apache.lucene.index)
flush:4168, IndexWriter (org.apache.lucene.index)
shutdown:1322, IndexWriter (org.apache.lucene.index)
close:1362, IndexWriter (org.apache.lucene.index)
doTestSearch:133, FstTest (com.dinosaur.lucene.demo)
  • 读的过程
findTargetArc:1418, FST (org.apache.lucene.util.fst)
seekExact:511, SegmentTermsEnum (org.apache.lucene.codecs.lucene90.blocktree)
loadTermsEnum:111, TermStates (org.apache.lucene.index)
build:96, TermStates (org.apache.lucene.index)
createWeight:227, TermQuery (org.apache.lucene.search)
createWeight:904, IndexSearcher (org.apache.lucene.search)
search:687, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:158, SearchFiles (com.dinosaur.lucene.demo)
testSearch:128, SearchFiles (com.dinosaur.lucene.demo)

例子

cfe 文件

$ hexdump  app/index/_3.cfs
000000 3f d7 6c 17 14 4c 75 63 65 6e 65 39 30 43 6f 6d
000010 70 6f 75 6e 64 44 61 74 61 00 00 00 00 7a fc 30
000020 52 e0 51 d2 54 be 49 7f 21 78 69 fe c4 00 00 00
000030 3f d7 6c 17 11 4c 75 63 65 6e 65 39 30 4e 6f 72
000040 6d 73 44 61 74 61 00 00 00 00 7a fc 30 52 e0 51
000050 d2 54 be 49 7f 21 78 69 fe c4 00 04 03 c0 28 93
000060 e8 00 00 00 00 00 00 00 00 f0 6a f4 62 00 00 00
000070 3f d7 6c 17 16 4c 75 63 65 6e 65 39 30 46 69 65
000080 6c 64 73 49 6e 64 65 78 49 64 78 00 00 00 00 7a
000090 fc 30 52 e0 51 d2 54 be 49 7f 21 78 69 fe c4 00
0000a0 c0 28 93 e8 00 00 00 00 00 00 00 00 92 7f 21 bb
0000b0 3f d7 6c 17 19 4c 75 63 65 6e 65 39 30 50 6f 69
0000c0 6e 74 73 46 6f 72 6d 61 74 49 6e 64 65 78 00 00
0000d0 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21 78 69
0000e0 fe c4 00 32 c0 28 93 e8 00 00 00 00 00 00 00 00
0000f0 f7 61 6e 2f 00 00 00 00 3f d7 6c 17 13 42 6c 6f
000100 63 6b 54 72 65 65 54 65 72 6d 73 49 6e 64 65 78
000110 00 00 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21
000120 78 69 fe c4 0a 4c 75 63 65 6e 65 39 30 5f 30 00
000130 00 c0 28 93 e8 00 00 00 00 00 00 00 00 07 1a 7b
000140 47 00 00 00 00 00 00 00 3f d7 6c 17 18 4c 75 63
000150 65 6e 65 39 30 50 6f 69 6e 74 73 46 6f 72 6d 61
000160 74 44 61 74 61 00 00 00 00 7a fc 30 52 e0 51 d2
000170 54 be 49 7f 21 78 69 fe c4 00 02 fe 00 08 80 00
000180 01 88 d2 0f 28 0d ff c0 28 93 e8 00 00 00 00 00
000190 00 00 00 6d 43 fa 6e 00 3f d7 6c 17 19 4c 75 63
0001a0 65 6e 65 39 30 50 6f 73 74 69 6e 67 73 57 72 69
0001b0 74 65 72 44 6f 63 00 00 00 00 7a fc 30 52 e0 51
0001c0 d2 54 be 49 7f 21 78 69 fe c4 0a 4c 75 63 65 6e
0001d0 65 39 30 5f 30 01 03 01 03 c0 28 93 e8 00 00 00 <--- 右边的01 03 是you的两个docid
0001e0 00 00 00 00 00 26 f5 75 88 00 00 00 00 00 00 00
0001f0 3f d7 6c 17 19 4c 75 63 65 6e 65 39 30 50 6f 73
000200 74 69 6e 67 73 57 72 69 74 65 72 50 6f 73 00 00
000210 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21 78 69
000220 fe c4 0a 4c 75 63 65 6e 65 39 30 5f 30 02 00 00
000230 01 02 03 01 c0 28 93 e8 00 00 00 00 00 00 00 00
000240 c5 ac 32 b6 00 00 00 00 3f d7 6c 17 15 4c 75 63
000250 65 6e 65 39 30 4e 6f 72 6d 73 4d 65 74 61 64 61
000260 74 61 00 00 00 00 7a fc 30 52 e0 51 d2 54 be 49
000270 7f 21 78 69 fe c4 00 02 00 00 00 ff ff ff ff ff
000280 ff ff ff 00 00 00 00 00 00 00 00 ff ff ff 02 00
000290 00 00 01 2b 00 00 00 00 00 00 00 ff ff ff ff c0
0002a0 28 93 e8 00 00 00 00 00 00 00 00 1c 85 f4 99 00
0002b0 3f d7 6c 17 1c 4c 75 63 65 6e 65 39 30 53 74 6f
0002c0 72 65 64 46 69 65 6c 64 73 46 61 73 74 44 61 74
0002d0 61 00 00 00 01 7a fc 30 52 e0 51 d2 54 be 49 7f
0002e0 21 78 69 fe c4 00 00 0a 00 01 08 12 13 01 04 02
0002f0 05 05 05 05 05 05 05 05 05 10 00 40 10 2e 2e 5c
000300 40 64 6f 63 73 40 5c 64 65 6d 40 6f 2e 74 78 40
000310 74 00 11 2e 40 2e 5c 64 6f 40 63 73 5c 64 40 65
000320 6d 6f 32 40 2e 74 78 74 c0 28 93 e8 00 00 00 00
000330 00 00 00 00 81 b0 7e 09 3f d7 6c 17 18 4c 75 63
000340 65 6e 65 39 30 50 6f 69 6e 74 73 46 6f 72 6d 61
000350 74 4d 65 74 61 00 00 00 00 7a fc 30 52 e0 51 d2
000360 54 be 49 7f 21 78 69 fe c4 00 01 00 00 00 3f d7
000370 6c 17 03 42 4b 44 00 00 00 09 01 01 80 04 08 01
000380 80 00 01 88 d2 0f 28 0d 80 00 01 88 d2 0f 28 0d
000390 02 02 01 32 00 00 00 00 00 00 00 33 00 00 00 00
0003a0 00 00 00 ff ff ff ff 44 00 00 00 00 00 00 00 4f
0003b0 00 00 00 00 00 00 00 c0 28 93 e8 00 00 00 00 00
0003c0 00 00 00 02 3e 97 d6 00 3f d7 6c 17 17 4c 75 63
0003d0 65 6e 65 39 30 46 69 65 6c 64 73 49 6e 64 65 78
0003e0 4d 65 74 61 00 00 00 01 7a fc 30 52 e0 51 d2 54
0003f0 be 49 7f 21 78 69 fe c4 00 80 80 05 02 00 00 00
000400 0a 00 00 00 02 00 00 00 30 00 00 00 00 00 00 00
000410 00 00 00 00 00 00 00 00 00 00 00 40 00 00 00 00
000420 00 00 00 00 00 30 00 00 00 00 00 00 00 36 00 00
000430 00 00 00 00 00 00 00 84 42 00 00 00 00 00 00 00
000440 00 00 30 00 00 00 00 00 00 00 78 00 00 00 00 00
000450 00 00 01 01 02 c0 28 93 e8 00 00 00 00 00 00 00
000460 00 c3 23 d0 d6 00 00 00 3f d7 6c 17 12 42 6c 6f <------- 3f
000470 63 6b 54 72 65 65 54 65 72 6d 73 44 69 63 74 00
000480 00 00 00 7a fc 30 52 e0 51 d2 54 be 49 7f 21 78
000490 69 fe c4 0a 4c 75 63 65 6e 65 39 30 5f 30 0b 9c <--------
0004a0 01 61 72 65 68 6f 77 6f 6c 64 73 74 75 64 65 6e
0004b0 74 79 6f 75 0a 03 03 03 07 03 05 04 00 05 04 00 <------ 05 04 00 05 04 是position
0004c0 0b 7a 3d 04 00 02 01 01 05 01 00 01 05 8c 02 2e <------- 7a 3d 04 是很多位置信息
0004d0 2e 5c 64 6f 63 73 5c 64 65 6d 6f 2e 74 78 74 2e
0004e0 2e 5c 64 6f 63 73 5c 64 65 6d 6f 32 2e 74 78 74
0004f0 04 10 11 01 03 04 82 01 00 05 c0 28 93 e8 00 00
000500 00 00 00 00 00 00 1a 7f dc 45 00 00 00 00 00 00
000510 3f d7 6c 17 12 42 6c 6f 63 6b 54 72 65 65 54 65
000520 72 6d 73 4d 65 74 61 00 00 00 00 7a fc 30 52 e0
000530 51 d2 54 be 49 7f 21 78 69 fe c4 0a 4c 75 63 65
000540 6e 65 39 30 5f 30 3f d7 6c 17 1b 4c 75 63 65 6e
000550 65 39 30 50 6f 73 74 69 6e 67 73 57 72 69 74 65
000560 72 54 65 72 6d 73 00 00 00 00 7a fc 30 52 e0 51
000570 d2 54 be 49 7f 21 78 69 fe c4 0a 4c 75 63 65 6e
000580 65 39 30 5f 30 80 01 02 02 05 02 da 01 07 07 02
000590 03 61 72 65 03 79 6f 75 37 3f d7 6c 17 03 46 53
0005a0 54 00 00 00 08 01 03 01 da 02 00 00 01 00 02 02
0005b0 92 03 02 02 10 2e 2e 5c 64 6f 63 73 5c 64 65 6d
0005c0 6f 2e 74 78 74 11 2e 2e 5c 64 6f 63 73 5c 64 65
0005d0 6d 6f 32 2e 74 78 74 38 3f d7 6c 17 03 46 53 54
0005e0 00 00 00 08 01 03 03 92 02 00 00 01 49 00 00 00
0005f0 00 00 00 00 a2 00 00 00 00 00 00 00 c0 28 93 e8
000600 00 00 00 00 00 00 00 00 c9 44 df a8 00 00 00 00
000610 3f d7 6c 17 12 4c 75 63 65 6e 65 39 34 46 69 65
000620 6c 64 49 6e 66 6f 73 00 00 00 00 7a fc 30 52 e0
000630 51 d2 54 be 49 7f 21 78 69 fe c4 00 03 04 70 61
000640 74 68 00 02 01 00 ff ff ff ff ff ff ff ff 02 1d
000650 50 65 72 46 69 65 6c 64 50 6f 73 74 69 6e 67 73
000660 46 6f 72 6d 61 74 2e 66 6f 72 6d 61 74 08 4c 75
000670 63 65 6e 65 39 30 1d 50 65 72 46 69 65 6c 64 50
000680 6f 73 74 69 6e 67 73 46 6f 72 6d 61 74 2e 73 75
000690 66 66 69 78 01 30 00 00 01 00 08 6d 6f 64 69 66
0006a0 69 65 64 01 00 00 00 ff ff ff ff ff ff ff ff 00
0006b0 01 01 08 00 01 00 08 63 6f 6e 74 65 6e 74 73 02
0006c0 00 03 00 ff ff ff ff ff ff ff ff 02 1d 50 65 72
0006d0 46 69 65 6c 64 50 6f 73 74 69 6e 67 73 46 6f 72
0006e0 6d 61 74 2e 66 6f 72 6d 61 74 08 4c 75 63 65 6e
0006f0 65 39 30 1d 50 65 72 46 69 65 6c 64 50 6f 73 74
000700 69 6e 67 73 46 6f 72 6d 61 74 2e 73 75 66 66 69
000710 78 01 30 00 00 01 00 c0 28 93 e8 00 00 00 00 00
000720 00 00 00 36 55 24 d2 c0 28 93 e8 00 00 00 00 00
000730 00 00 00 41 6a 49 d4

tim文件的偏移是offset=1128 tim 文件

score:250, BM25Similarity$BM25Scorer (org.apache.lucene.search.similarities)
score:60, LeafSimScorer (org.apache.lucene.search)
score:75, TermScorer (org.apache.lucene.search)
collect:73, TopScoreDocCollector$SimpleTopScoreDocCollector$1 (org.apache.lucene.search)
scoreAll:305, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:247, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:38, BulkScorer (org.apache.lucene.search)
search:776, IndexSearcher (org.apache.lucene.search)
search:694, IndexSearcher (org.apache.lucene.search)
search:688, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:161, SearchFiles (com.dinosaur.lucene.skiptest)

readField:248, Lucene90CompressingStoredFieldsReader (org.apache.lucene.codecs.lucene90.compressing)
document:642, Lucene90CompressingStoredFieldsReader (org.apache.lucene.codecs.lucene90.compressing)
document:253, SegmentReader (org.apache.lucene.index)
document:171, BaseCompositeReader (org.apache.lucene.index)
document:411, IndexReader (org.apache.lucene.index)
doc:390, IndexSearcher (org.apache.lucene.search)
doPagingSearch:195, SearchFiles (com.dinosaur.lucene.skiptest)

tim/tip/doc 关系

tip 是描述一个term的指针 tim 包含term的统计信息 doc 描述的是term对应的docId

也就是说 tip -> tim -> doc

  • 通过tip判断term是否存在
  • 然后通过tip找到tim获取统计信息
  • 然后通过doc 获取包含该term的docId的数组

doc file

  • doc file open:
<init>:74, Lucene90PostingsReader (org.apache.lucene.codecs.lucene90)
fieldsProducer:424, Lucene90PostingsFormat (org.apache.lucene.codecs.lucene90)
<init>:330, PerFieldPostingsFormat$FieldsReader (org.apache.lucene.codecs.perfield)
fieldsProducer:392, PerFieldPostingsFormat (org.apache.lucene.codecs.perfield)
<init>:118, SegmentCoreReaders (org.apache.lucene.index)
<init>:92, SegmentReader (org.apache.lucene.index)
doBody:94, StandardDirectoryReader$1 (org.apache.lucene.index)
doBody:77, StandardDirectoryReader$1 (org.apache.lucene.index)
run:816, SegmentInfos$FindSegmentsFile (org.apache.lucene.index)
open:109, StandardDirectoryReader (org.apache.lucene.index)
open:67, StandardDirectoryReader (org.apache.lucene.index)
open:60, DirectoryReader (org.apache.lucene.index)
doSearchDemo:25, SimpleSearchTest (com.dinosaur.lucene.demo)

how to find the docId list

org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java

  final class BlockDocsEnum extends PostingsEnum {

...

public PostingsEnum reset(IntBlockTermState termState, int flags) throws IOException {
docFreq = termState.docFreq;
totalTermFreq = indexHasFreq ? termState.totalTermFreq : docFreq;
docTermStartFP = termState.docStartFP;
skipOffset = termState.skipOffset;
singletonDocID = termState.singletonDocID;
if (docFreq > 1) {
if (docIn == null) {
// lazy init
docIn = startDocIn.clone();
}
docIn.seek(docTermStartFP);
}

doc = -1;
this.needsFreq = PostingsEnum.featureRequested(flags, PostingsEnum.FREQS);
this.isFreqsRead = true;
if (indexHasFreq == false || needsFreq == false) {
for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
freqBuffer[i] = 1;
}
}
accum = 0;
blockUpto = 0;
nextSkipDoc = BLOCK_SIZE - 1; // we won't skip if target is found in first block
docBufferUpto = BLOCK_SIZE;
skipped = false;
return this;
}
}

相关阅读

WFST 和lucene 和fst

· One min read

在WFST(Weighted Finite State Transducer,加权有限状态转换器)中,"All Pairs Shortest Path"(APSP)算法用于计算任意两个状态之间的最短路径。在WFST中,每个状态之间都有一条带有权重的边,表示从一个状态到另一个状态的转换。APSP算法的目标是找到连接任意两个状态的最短路径,即具有最小总权重的路径。

相关阅读

fst 结构

· 6 min read

背景

了解lucene 的fst结构

核心函数

freeezeTail -> compileNode

  private void freezeTail(int prefixLenPlus1) throws IOException {  // 入参是一个偏移值:  公共前缀+ 1
final int downTo = Math.max(1, prefixLenPlus1);
for (int idx = lastInput.length(); idx >= downTo; idx--) {

boolean doPrune = false;
boolean doCompile = false;


if (doCompile) {
...
parent.replaceLast(
lastInput.intAt(idx - 1),
compileNode(node, 1 + lastInput.length() - idx),
nextFinalOutput,
isFinal);
...
}
}
}
}
  private CompiledNode compileNode(UnCompiledNode<T> nodeIn, int tailLength) throws IOException {
final long node;
long bytesPosStart = bytes.getPosition();
if (dedupHash != null
&& (doShareNonSingletonNodes || nodeIn.numArcs <= 1)
&& tailLength <= shareMaxTailLength) {
if (nodeIn.numArcs == 0) {
node = fst.addNode(this, nodeIn);
lastFrozenNode = node;
} else {
node = dedupHash.add(this, nodeIn);
}
} else {
node = fst.addNode(this, nodeIn);
}
assert node != -2;

long bytesPosEnd = bytes.getPosition();
if (bytesPosEnd != bytesPosStart) {
// The FST added a new node:
assert bytesPosEnd > bytesPosStart;
lastFrozenNode = node;
}

nodeIn.clear();

final CompiledNode fn = new CompiledNode();
fn.node = node;
return fn;
}
  // serializes new node by appending its bytes to the end
// of the current byte[]
long addNode(FSTCompiler<T> fstCompiler, FSTCompiler.UnCompiledNode<T> nodeIn)
throws IOException {
T NO_OUTPUT = outputs.getNoOutput();

// System.out.println("FST.addNode pos=" + bytes.getPosition() + " numArcs=" + nodeIn.numArcs);
if (nodeIn.numArcs == 0) {
if (nodeIn.isFinal) {
return FINAL_END_NODE;
} else {
return NON_FINAL_END_NODE;
}
}
final long startAddress = fstCompiler.bytes.getPosition();
// System.out.println(" startAddr=" + startAddress);

final boolean doFixedLengthArcs = shouldExpandNodeWithFixedLengthArcs(fstCompiler, nodeIn);
if (doFixedLengthArcs) {
// System.out.println(" fixed length arcs");
if (fstCompiler.numBytesPerArc.length < nodeIn.numArcs) {
fstCompiler.numBytesPerArc = new int[ArrayUtil.oversize(nodeIn.numArcs, Integer.BYTES)];
fstCompiler.numLabelBytesPerArc = new int[fstCompiler.numBytesPerArc.length];
}
}

fstCompiler.arcCount += nodeIn.numArcs;

final int lastArc = nodeIn.numArcs - 1;

long lastArcStart = fstCompiler.bytes.getPosition();
int maxBytesPerArc = 0;
int maxBytesPerArcWithoutLabel = 0;
for (int arcIdx = 0; arcIdx < nodeIn.numArcs; arcIdx++) {
final FSTCompiler.Arc<T> arc = nodeIn.arcs[arcIdx];
final FSTCompiler.CompiledNode target = (FSTCompiler.CompiledNode) arc.target;
int flags = 0;
// System.out.println(" arc " + arcIdx + " label=" + arc.label + " -> target=" +
// target.node);

if (arcIdx == lastArc) {
flags += BIT_LAST_ARC;
}

if (fstCompiler.lastFrozenNode == target.node && !doFixedLengthArcs) {
// TODO: for better perf (but more RAM used) we
// could avoid this except when arc is "near" the
// last arc:
flags += BIT_TARGET_NEXT;
}

if (arc.isFinal) {
flags += BIT_FINAL_ARC;
if (arc.nextFinalOutput != NO_OUTPUT) {
flags += BIT_ARC_HAS_FINAL_OUTPUT;
}
} else {
assert arc.nextFinalOutput == NO_OUTPUT;
}

boolean targetHasArcs = target.node > 0;

if (!targetHasArcs) {
flags += BIT_STOP_NODE;
}

if (arc.output != NO_OUTPUT) {
flags += BIT_ARC_HAS_OUTPUT;
}

fstCompiler.bytes.writeByte((byte) flags);
long labelStart = fstCompiler.bytes.getPosition();
writeLabel(fstCompiler.bytes, arc.label);
int numLabelBytes = (int) (fstCompiler.bytes.getPosition() - labelStart);

// System.out.println(" write arc: label=" + (char) arc.label + " flags=" + flags + "
// target=" + target.node + " pos=" + bytes.getPosition() + " output=" +
// outputs.outputToString(arc.output));

if (arc.output != NO_OUTPUT) {
outputs.write(arc.output, fstCompiler.bytes);
// System.out.println(" write output");
}

if (arc.nextFinalOutput != NO_OUTPUT) {
// System.out.println(" write final output");
outputs.writeFinalOutput(arc.nextFinalOutput, fstCompiler.bytes);
}

if (targetHasArcs && (flags & BIT_TARGET_NEXT) == 0) {
assert target.node > 0;
// System.out.println(" write target");
fstCompiler.bytes.writeVLong(target.node);
}

// just write the arcs "like normal" on first pass, but record how many bytes each one took
// and max byte size:
if (doFixedLengthArcs) {
int numArcBytes = (int) (fstCompiler.bytes.getPosition() - lastArcStart);
fstCompiler.numBytesPerArc[arcIdx] = numArcBytes;
fstCompiler.numLabelBytesPerArc[arcIdx] = numLabelBytes;
lastArcStart = fstCompiler.bytes.getPosition();
maxBytesPerArc = Math.max(maxBytesPerArc, numArcBytes);
maxBytesPerArcWithoutLabel =
Math.max(maxBytesPerArcWithoutLabel, numArcBytes - numLabelBytes);
// System.out.println(" arcBytes=" + numArcBytes + " labelBytes=" + numLabelBytes);
}
}

// TODO: try to avoid wasteful cases: disable doFixedLengthArcs in that case
/*
*
* LUCENE-4682: what is a fair heuristic here?
* It could involve some of these:
* 1. how "busy" the node is: nodeIn.inputCount relative to frontier[0].inputCount?
* 2. how much binSearch saves over scan: nodeIn.numArcs
* 3. waste: numBytes vs numBytesExpanded
*
* the one below just looks at #3
if (doFixedLengthArcs) {
// rough heuristic: make this 1.25 "waste factor" a parameter to the phd ctor????
int numBytes = lastArcStart - startAddress;
int numBytesExpanded = maxBytesPerArc * nodeIn.numArcs;
if (numBytesExpanded > numBytes*1.25) {
doFixedLengthArcs = false;
}
}
*/

if (doFixedLengthArcs) {
assert maxBytesPerArc > 0;
// 2nd pass just "expands" all arcs to take up a fixed byte size

int labelRange = nodeIn.arcs[nodeIn.numArcs - 1].label - nodeIn.arcs[0].label + 1;
assert labelRange > 0;
if (shouldExpandNodeWithDirectAddressing(
fstCompiler, nodeIn, maxBytesPerArc, maxBytesPerArcWithoutLabel, labelRange)) {
writeNodeForDirectAddressing(
fstCompiler, nodeIn, startAddress, maxBytesPerArcWithoutLabel, labelRange);
fstCompiler.directAddressingNodeCount++;
} else {
writeNodeForBinarySearch(fstCompiler, nodeIn, startAddress, maxBytesPerArc);
fstCompiler.binarySearchNodeCount++;
}
}

final long thisNodeAddress = fstCompiler.bytes.getPosition() - 1;
fstCompiler.bytes.reverse(startAddress, thisNodeAddress);
fstCompiler.nodeCount++;
return thisNodeAddress;
}

Arc<T> 描述的是一个弧

//package org.apache.lucene.util.fst;
// org\apache\lucene\util\fst\FSTCompiler.java
/** Expert: holds a pending (seen but not yet serialized) arc. */
static class Arc<T> {
int label; // really an "unsigned" byte // 一个label
Node target; // 举例:a -> b 那么target 就是b
boolean isFinal;
T output;
T nextFinalOutput;
}

例子

cat的权重是5
dog的权重是7
dogs的权重是13

   String inputValues[] = {"cat", "dog", "dogs"};
long[] outputValues = {5, 7, 13};

下面是一个cat的例子 cat包含三个字符c,a ,t,分别代表三个ASCII码:

  • c:99
  • a:97
  • t:116

定义Arc: 一个弧包含三个内容: [a:label1]->[b:label2] 来描述一个弧:a指向b .其中a的值是label1, b的值是label2

下面是idea的截图, [2747:c] -> [2856:a] -> [2860:t]

fst linklist

下面是fst.bytes 的例子

fst bytes

最后是变成下列格式:

[0, 116, 15, 97, 6, 6, 115, 31, 103, 7, 111, 6, 7, 100, 22, 4, 5, 99, 16]

相关阅读

priority queue

· 2 min read

lucene 搜索的结果搜索经过soccer算出分数之后,还需要topK取前几个数据,所以需要使用到topk的算法。 一般用优先队列实现。

介绍

下面都是描述最大优先队列

优先队列分为两个,一个是最小优先,一个是最大优先。其实就是方向改变而已。

我们先介绍他的性质:

组成 : 优先队列是item集合S 。每个item 包含两个内容:element 和key

操作

  • insert(S , item)
  • maxnum(S)
  • extract_max(S)
  • increase_key (S,element,key) 将优先队列里面

证明

对于一个非空满二叉树,第一个节点编号是index=1 则对每个节点index :

  • 他的左节点left=index *2
  • right = index *2 +1

证明: 归纳法 init: 当index= 1 时, left = 2 , 满足left = index*2
当index=1 时,right = 3 , 满足 right = index*2 +1

deduction: n+1 元素: 如果他是左节点 , 则他的前一个节点满足 n = (pre_parent *2 +1) 对于n+1 个元素 , n+1 = (pre_parent *2 +1) +1 = (pre_parent +1 )*2 即满足递推公式

同理右节点同理

所以证明完毕

相关论文

算法导论

lucene tim格式

· 5 min read

背景

tim文件是lucene 存储词相关统计信息的文件. 与它相关的还有tip文件

格式和例子

文件格式:

可以从最下面的相关阅读可以获取对应的文档

TermsDict (.tim) --> Header, PostingsHeader, NodeBlockNumBlocks, Footer
NodeBlock --> (OuterNode | InnerNode)
OuterNode --> EntryCount, SuffixLength, ByteSuffixLength, StatsLength, < TermStats >EntryCount, MetaLength, <TermMetadata>EntryCount
InnerNode --> EntryCount, SuffixLength[,Sub?], ByteSuffixLength, StatsLength, < TermStats ? >EntryCount, MetaLength, <TermMetadata ? >EntryCount
TermStats --> DocFreq, TotalTermFreq
Header --> CodecHeader
EntryCount,SuffixLength,StatsLength,DocFreq,MetaLength --> VInt
TotalTermFreq --> VLong
Footer --> CodecFooter

例子

hexdump -C  _j_Lucene90_0.tim 

00000000 3f d7 6c 17 12 42 6c 6f 63 6b 54 72 65 65 54 65 |?.l..BlockTreeTe|
00000010 72 6d 73 44 69 63 74 00 00 00 00 fe ea 80 e6 45 |rmsDict........E|
00000020 20 d8 56 64 1b 1b 1b 89 70 fe 67 0a 4c 75 63 65 | .Vd....p.g.Luce|
00000030 6e 65 39 30 5f 30 25 bc 03 61 6d 61 6e 64 62 75 |ne90_0%..amandbu|
00000040 74 63 61 6e 64 6f 68 65 6c 6c 6f 68 69 69 69 73 |tcandohellohiiis|
00000050 69 74 6b 6e 6f 77 6d 61 79 6d 6f 6e 67 6f 6e 6f |itknowmaymongono|
00000060 74 74 72 79 77 68 61 74 77 6f 72 6c 64 79 6f 75 |ttrywhatworldyou|
00000070 24 02 03 03 03 02 05 02 01 02 02 04 03 05 03 03 |$...............|
00000080 04 05 03 10 04 00 09 02 01 04 00 03 02 01 01 02 |................|
00000090 01 07 02 02 26 7a 3d 04 01 02 03 01 01 01 01 01 |....&z=.........| <--- 第六个字节 ,也就是7a开头
000000a0 05 01 01 01 00 02 04 00 02 01 01 01 01 01 02 01 |................|
000000b0 01 01 02 01 01 01 01 05 01 03 01 05 a4 03 2f 68 |............../h|
000000c0 6f 6d 65 2f 75 62 75 6e 74 75 2f 64 6f 63 2f 68 |ome/ubuntu/doc/h|
000000d0 65 6c 6c 6f 2e 74 78 74 2f 68 6f 6d 65 2f 75 62 |ello.txt/home/ub|
000000e0 75 6e 74 75 2f 64 6f 63 2f 6d 6f 6e 67 6f 2e 74 |untu/doc/mongo.t|
000000f0 78 74 05 1a 01 03 04 82 01 01 03 c0 28 93 e8 00 |xt..........(...|
00000100 00 00 00 00 00 00 00 da 02 a3 a3 |...........|

这里的ste.in 是tim文件的数据

main[2] list
472 }
473 }
474
475 // metadata
476 => ste.fr.parent.postingsReader.decodeTerm(bytesReader, ste.fr.fieldInfo, state, absolute);
477
478 metaDataUpto++;
479 absolute = false;
480 }
481 state.termBlockOrd = metaDataUpto;
main[2] print ste.in
ste.in = "MMapIndexInput(path="/home/ubuntu/index/_j_Lucene90_0.tim")"

这里的对应的是

main[2] dump bytesReader.bytes
bytesReader.bytes = {
122, 61, 4, 1, 2, 3, 1, 1, 1, 1, 1, 5, 1, 1, 1, 0, 2, 4, 0, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 5, 1, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}

hexdump -C _j_Lucene90_0.tim

00000000 3f d7 6c 17 12 42 6c 6f 63 6b 54 72 65 65 54 65 |?.l..BlockTreeTe| 00000010 72 6d 73 44 69 63 74 00 00 00 00 fe ea 80 e6 45 |rmsDict........E| 00000020 20 d8 56 64 1b 1b 1b 89 70 fe 67 0a 4c 75 63 65 | .Vd....p.g.Luce| 00000030 6e 65 39 30 5f 30 25 bc 03 61 6d 61 6e 64 62 75 |ne90_0%..amandbu| 00000040 74 63 61 6e 64 6f 68 65 6c 6c 6f 68 69 69 69 73 |tcandohellohiiis| 00000050 69 74 6b 6e 6f 77 6d 61 79 6d 6f 6e 67 6f 6e 6f |itknowmaymongono| 00000060 74 74 72 79 77 68 61 74 77 6f 72 6c 64 79 6f 75 |ttrywhatworldyou| 00000070 24 02 03 03 03 02 05 02 01 02 02 04 03 05 03 03 |$...............| 00000080 04 05 03 10 04 00 09 02 01 04 00 03 02 01 01 02 |................| 00000090 01 07 02 02 26 7a 3d 04 01 02 03 01 01 01 01 01 |....&z=.........| 000000a0 05 01 01 01 00 02 04 00 02 01 01 01 01 01 02 01 |................| 000000b0 01 01 02 01 01 01 01 05 01 03 01 05 a4 03 2f 68 |............../h| 000000c0 6f 6d 65 2f 75 62 75 6e 74 75 2f 64 6f 63 2f 68 |ome/ubuntu/doc/h| 000000d0 65 6c 6c 6f 2e 74 78 74 2f 68 6f 6d 65 2f 75 62 |ello.txt/home/ub| 000000e0 75 6e 74 75 2f 64 6f 63 2f 6d 6f 6e 67 6f 2e 74 |untu/doc/mongo.t| 000000f0 78 74 05 1a 01 03 04 82 01 01 03 c0 28 93 e8 00 |xt..........(...| 00000100 00 00 00 00 00 00 00 da 02 a3 a3 |...........|

相关阅读

lucene 10源码分析

· 15 min read

背景

我家里的电脑的lucene是10版本的

创建索引和保存

### 断点
java -agentlib:jdwp=transport=dt_socket,server=y,address=8000 -cp /home/dai/lucene/lucene/demo/build/libs/lucene-demo-10.0.0-SNAPSHOT.jar:/home/dai/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar org.apache.lucene.demo.IndexFiles -docs /home/dai/docs
### jdb 调试
jdb -attach 8000 -sourcepath /home/dai/lucene/lucene/demo/src/java/:/home/dai/lucene/lucene/core/src/java/

分词

倒排索引和分词都在这块代码

main[1] where
[1] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,140)
[2] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[3] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[4] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[5] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[6] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[7] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[8] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[9] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[10] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[12] java.nio.file.Files.walkFileTree (Files.java:2,811)
[13] java.nio.file.Files.walkFileTree (Files.java:2,882)
[14] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[15] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

Step completed: "thread=main", org.apache.lucene.index.TermsHashPerField.add(), line=193 bci=22
193 int termID = bytesHash.add(termBytes);

main[1] print termBytes
termBytes = "[2f 68 6f 6d 65 2f 64 61 69 2f 64 6f 63 73 2f 62 62 62 2e 74 78 74]"

invert

倒排索引,核心是构造一个term=>doc 的映射。比较核心的类是lucene/core/src/java/org/apache/lucene/index/FreqProxTermsWriterPerField.java,这是

  @Override
void addTerm(final int termID, final int docID) {
final FreqProxPostingsArray postings = freqProxPostingsArray;
assert !hasFreq || postings.termFreqs[termID] > 0;

if (!hasFreq) {
assert postings.termFreqs == null;
if (termFreqAtt.getTermFrequency() != 1) {
throw new IllegalStateException(
"field \""
+ getFieldName()
+ "\": must index term freq while using custom TermFrequencyAttribute");
}
if (docID != postings.lastDocIDs[termID]) {
// New document; now encode docCode for previous doc:
assert docID > postings.lastDocIDs[termID];
writeVInt(0, postings.lastDocCodes[termID]);
postings.lastDocCodes[termID] = docID - postings.lastDocIDs[termID];
postings.lastDocIDs[termID] = docID;
fieldState.uniqueTermCount++;
}
} else if (docID != postings.lastDocIDs[termID]) {
assert docID > postings.lastDocIDs[termID]
: "id: " + docID + " postings ID: " + postings.lastDocIDs[termID] + " termID: " + termID;
// Term not yet seen in the current doc but previously
// seen in other doc(s) since the last flush

// Now that we know doc freq for previous doc,
// write it & lastDocCode
if (1 == postings.termFreqs[termID]) {
writeVInt(0, postings.lastDocCodes[termID] | 1);
} else {
writeVInt(0, postings.lastDocCodes[termID]);
writeVInt(0, postings.termFreqs[termID]);
}

// Init freq for the current document
postings.termFreqs[termID] = getTermFreq();
fieldState.maxTermFrequency =
Math.max(postings.termFreqs[termID], fieldState.maxTermFrequency);
postings.lastDocCodes[termID] = (docID - postings.lastDocIDs[termID]) << 1;
postings.lastDocIDs[termID] = docID;
if (hasProx) {
writeProx(termID, fieldState.position);
if (hasOffsets) {
postings.lastOffsets[termID] = 0;
writeOffsets(termID, fieldState.offset);
}
} else {
assert !hasOffsets;
}
fieldState.uniqueTermCount++;
} else {
postings.termFreqs[termID] = Math.addExact(postings.termFreqs[termID], getTermFreq());
fieldState.maxTermFrequency =
Math.max(fieldState.maxTermFrequency, postings.termFreqs[termID]);
if (hasProx) {
writeProx(termID, fieldState.position - postings.lastPositions[termID]);
if (hasOffsets) {
writeOffsets(termID, fieldState.offset);
}
}
}
}

生成termId

堆栈

main[1] where
[1] org.apache.lucene.index.TermsHashPerField.initStreamSlices (TermsHashPerField.java:150)
[2] org.apache.lucene.index.TermsHashPerField.add (TermsHashPerField.java:198)
[3] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,224)
[4] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[5] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[6] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[7] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[8] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[9] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[10] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[11] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[13] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[14] java.nio.file.Files.walkFileTree (Files.java:2,811)
[15] java.nio.file.Files.walkFileTree (Files.java:2,882)
[16] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[17] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

      IntBlockPool intPool,
ByteBlockPool bytePool,
ByteBlockPool termBytePool,

首先介绍intPool这个变量这个变量维护了一个二维数组int buffers[][]和三个偏移量来保存bytePool的偏移量。

public final class IntBlockPool {
...

// 类初始化是10 , 后面会自动扩容,核心结构 , 这个二维数组存的是bytePool 偏移量,默认初始化容量是10
public int[][] buffers = new int[10][];

// 二维数组偏移量,也就是联合buffers使用 。一般这样用 buffers[bufferUpto+offset]
private int bufferUpto = -1;
// 二维数组中的一维数组 , 描述的是最新写入的buffers
// 举例 buffer = buffers[1];
public int[] buffer;
//intUpto 描述的是相对于一维数组的偏移
public int intUpto = INT_BLOCK_SIZE;
// 绝对偏移 ,相对于二维数组的偏移 ,有点像计算机里面的相对跳转和绝对跳转
public int intOffset = -INT_BLOCK_SIZE;
}

然后和intPool一样,bytePooltermBytePool 也是用几个变量加一个二维数组描述

public final class ByteBlockPool implements Accountable {
...
// 核心结构,一个二维数组
public byte[][] buffers = new byte[10][];

/** index into the buffers array pointing to the current buffer used as the head */
private int bufferUpto = -1; // Which buffer we are upto
/** Where we are in head buffer */
public int byteUpto = BYTE_BLOCK_SIZE;

/** Current head buffer */
public byte[] buffer;
/** Current head offset */
public int byteOffset = -BYTE_BLOCK_SIZE;

查询搜索

断点

## 断点
java -agentlib:jdwp=transport=dt_socket,server=y,address=8000 -cp /home/dai/lucene/lucene/demo/build/libs/lucene-demo-10.0.0-SNAPSHOT.jar:/home/dai/lucene/lucene/core/build/libs/lucene-core-10.0.0-SNAPSHOT.jar:/home/dai/lucene/lucene/queryparser/build/libs/lucene-queryparser-10.0.0-SNAPSHOT.jar org.apache.lucene.demo.SearchFiles

## jdb 调试
jdb -attach 8000 -sourcepath /home/dai/lucene/lucene/demo/src/java/:/home/dai/lucene/lucene/core/src/java/

termState描述的是term的统计信息

ain[1] print termState
termState = "TermStates
state=docFreq=1 totalTermFreq=1 termBlockOrd=2 blockFP=0 docStartFP=63 posStartFP=63 payStartFP=0 lastPosBlockOffset=-1 singletonDocID=6
"
main[1] print term
term = "contents:am"
main[1] where
[1] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:233)
[2] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:894)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[4] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[6] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[7] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

排序

默认排序是BM25Similarity

main[1] where
[1] org.apache.lucene.search.similarities.BM25Similarity.scorer (BM25Similarity.java:200)
[2] org.apache.lucene.search.TermQuery$TermWeight.<init> (TermQuery.java:75)
[3] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:233)
[4] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:894)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

核心搜索参数

main[1] list
763 // there is no doc of interest in this reader context
764 // continue with the following leaf
765 continue;
766 }
767 => BulkScorer scorer = weight.bulkScorer(ctx);
768 if (scorer != null) {
769 try {
770 scorer.score(leafCollector, ctx.reader().getLiveDocs());
771 } catch (
772 @SuppressWarnings("unused")
main[1] where
[1] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[4] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[6] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[7] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

获取reader

Step completed: "thread=main", org.apache.lucene.index.LeafReaderContext.reader(), line=67 bci=0
67 return reader;

main[1] print reader
reader = "_0(10.0.0):c7:[diagnostics={source=flush, os.arch=amd64, java.runtime.version=17.0.3+7-Ubuntu-0ubuntu0.22.04.1, os.version=5.15.0-33-generic, java.vendor=Private Build, os=Linux, timestamp=1656601918836, java.version=17.0.3, java.vm.version=17.0.3+7-Ubuntu-0ubuntu0.22.04.1, lucene.version=10.0.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=c276i3vlaza4c6uumuxapfnvf"
main[1] where
[1] org.apache.lucene.index.LeafReaderContext.reader (LeafReaderContext.java:67)
[2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[5] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[7] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[8] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

其中的reader 对象

main[1] dump reader
reader = {
si: instance of org.apache.lucene.index.SegmentCommitInfo(id=1531)
originalSi: instance of org.apache.lucene.index.SegmentCommitInfo(id=1532)
metaData: instance of org.apache.lucene.index.LeafMetaData(id=1533)
liveDocs: null
hardLiveDocs: null
numDocs: 7
core: instance of org.apache.lucene.index.SegmentCoreReaders(id=1534)
segDocValues: instance of org.apache.lucene.index.SegmentDocValues(id=1535)
isNRT: false
docValuesProducer: null
fieldInfos: instance of org.apache.lucene.index.FieldInfos(id=1536)
readerClosedListeners: instance of java.util.concurrent.CopyOnWriteArraySet(id=1537)
readerCacheHelper: instance of org.apache.lucene.index.SegmentReader$1(id=1538)
coreCacheHelper: instance of org.apache.lucene.index.SegmentReader$2(id=1539)
$assertionsDisabled: true
org.apache.lucene.index.LeafReader.readerContext: instance of org.apache.lucene.index.LeafReaderContext(id=1540)
org.apache.lucene.index.LeafReader.$assertionsDisabled: true
org.apache.lucene.index.IndexReader.closed: false
org.apache.lucene.index.IndexReader.closedByChild: false
org.apache.lucene.index.IndexReader.refCount: instance of java.util.concurrent.atomic.AtomicInteger(id=1541)
org.apache.lucene.index.IndexReader.parentReaders: instance of java.util.Collections$SynchronizedSet(id=1542)
}

排序:

main[1] list
222
223 @Override
224 public int score(LeafCollector collector, Bits acceptDocs, int min, int max)
225 throws IOException {
226 => collector.setScorer(scorer);
227 DocIdSetIterator scorerIterator = twoPhase == null ? iterator : twoPhase.approximation();
228 DocIdSetIterator competitiveIterator = collector.competitiveIterator();
229 DocIdSetIterator filteredIterator;
230 if (competitiveIterator == null) {
231 filteredIterator = scorerIterator;
main[1] where
[1] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:226)
[2] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

排序

  private static class SimpleTopScoreDocCollector extends TopScoreDocCollector {

...

@Override
public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
...
return new ScorerLeafCollector() {
...
@Override
public void collect(int doc) throws IOException {
float score = scorer.score(); <---- 这里不用传docId 就能获取score ,是因为可以从父类TopScoreDocCollector 获取docId

// This collector relies on the fact that scorers produce positive values:
assert score >= 0; // NOTE: false for NaN

totalHits++;
hitsThresholdChecker.incrementHitCount();

if (minScoreAcc != null && (totalHits & minScoreAcc.modInterval) == 0) {
updateGlobalMinCompetitiveScore(scorer);
}

if (score <= pqTop.score) {
if (totalHitsRelation == TotalHits.Relation.EQUAL_TO) {
// we just reached totalHitsThreshold, we can start setting the min
// competitive score now
updateMinCompetitiveScore(scorer);
}
// Since docs are returned in-order (i.e., increasing doc Id), a document
// with equal score to pqTop.score cannot compete since HitQueue favors
// documents with lower doc Ids. Therefore reject those docs too.
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
updateMinCompetitiveScore(scorer);
}
};
}
main[1] print scorer
scorer = "scorer(weight(contents:am))[org.apache.lucene.search.TermScorer@290dbf45]"
main[1] dump scorer
scorer = {
postingsEnum: instance of org.apache.lucene.index.SlowImpactsEnum(id=1546)
impactsEnum: instance of org.apache.lucene.index.SlowImpactsEnum(id=1546)
iterator: instance of org.apache.lucene.search.ImpactsDISI(id=1547)
docScorer: instance of org.apache.lucene.search.LeafSimScorer(id=1548)
impactsDisi: instance of org.apache.lucene.search.ImpactsDISI(id=1547)
$assertionsDisabled: true
org.apache.lucene.search.Scorer.weight: instance of org.apache.lucene.search.TermQuery$TermWeight(id=1549)
}
main[1] where
[1] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect (TopScoreDocCollector.java:76) <--- 这里没有传doc_id 进去scorer 是因为有个回调, 可以获取doc_id , 这里会有歌pq,是一个排序好的doc
[2] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:305)
[3] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[4] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

核心算分函数

排序算分

main[1] list
246 // float. And then monotonicity is preserved through composition via
247 // x -> 1 + x and x -> 1 - 1/x.
248 // Finally we expand weight * (1 - 1 / (1 + freq * 1/norm)) to
249 // weight - weight / (1 + freq * 1/norm), which runs slightly faster.
250 => float normInverse = cache[((byte) encodedNorm) & 0xFF];
251 return weight - weight / (1f + freq * normInverse);
252 }
253
254 @Override
255 public Explanation explain(Explanation freq, long encodedNorm) {
main[1] where
[1] org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer.score (BM25Similarity.java:250)
[2] org.apache.lucene.search.LeafSimScorer.score (LeafSimScorer.java:60)
[3] org.apache.lucene.search.TermScorer.score (TermScorer.java:75)
[4] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect (TopScoreDocCollector.java:73)
[5] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:305)
[6] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[7] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[11] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[12] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[13] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[14] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

reduce 过程

main[1] list
60 * Populates the results array with the ScoreDoc instances. This can be overridden in case a
61 * different ScoreDoc type should be returned.
62 */
63 protected void populateResults(ScoreDoc[] results, int howMany) {
64 => for (int i = howMany - 1; i >= 0; i--) {
65 results[i] = pq.pop();
66 }
67 }
68
69 /**
main[1] where
[1] org.apache.lucene.search.TopDocsCollector.populateResults (TopDocsCollector.java:64)
[2] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:166)
[3] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:98)
[4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:526)
[5] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

辅助函数,获取topk的数据内容

堆栈:

main[1] where
[1] org.apache.lucene.search.TopDocs.mergeAux (TopDocs.java:312)
[2] org.apache.lucene.search.TopDocs.merge (TopDocs.java:216)
[3] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:528)
[4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

  /**
* Auxiliary method used by the {@link #merge} impls. A sort value of null is used to indicate
* that docs should be sorted by score.
*/
private static TopDocs mergeAux(
Sort sort, int start, int size, TopDocs[] shardHits, Comparator<ScoreDoc> tieBreaker) {

final PriorityQueue<ShardRef> queue;
if (sort == null) {
queue = new ScoreMergeSortQueue(shardHits, tieBreaker);
} else {
queue = new MergeSortQueue(sort, shardHits, tieBreaker);
}

long totalHitCount = 0;
TotalHits.Relation totalHitsRelation = TotalHits.Relation.EQUAL_TO;
int availHitCount = 0;
for (int shardIDX = 0; shardIDX < shardHits.length; shardIDX++) {
final TopDocs shard = shardHits[shardIDX];
// totalHits can be non-zero even if no hits were
// collected, when searchAfter was used:
totalHitCount += shard.totalHits.value;
// If any hit count is a lower bound then the merged
// total hit count is a lower bound as well
if (shard.totalHits.relation == TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO) {
totalHitsRelation = TotalHits.Relation.GREATER_THAN_OR_EQUAL_TO;
}
if (shard.scoreDocs != null && shard.scoreDocs.length > 0) {
availHitCount += shard.scoreDocs.length;
queue.add(new ShardRef(shardIDX));
}
}

final ScoreDoc[] hits;
boolean unsetShardIndex = false;
if (availHitCount <= start) {
hits = new ScoreDoc[0];
} else {
hits = new ScoreDoc[Math.min(size, availHitCount - start)];
int requestedResultWindow = start + size;
int numIterOnHits = Math.min(availHitCount, requestedResultWindow);
int hitUpto = 0;
while (hitUpto < numIterOnHits) {
assert queue.size() > 0;
ShardRef ref = queue.top();
final ScoreDoc hit = shardHits[ref.shardIndex].scoreDocs[ref.hitIndex++];

// Irrespective of whether we use shard indices for tie breaking or not, we check for
// consistent
// order in shard indices to defend against potential bugs
if (hitUpto > 0) {
if (unsetShardIndex != (hit.shardIndex == -1)) {
throw new IllegalArgumentException("Inconsistent order of shard indices");
}
}

unsetShardIndex |= hit.shardIndex == -1;

if (hitUpto >= start) {
hits[hitUpto - start] = hit;
}

hitUpto++;

if (ref.hitIndex < shardHits[ref.shardIndex].scoreDocs.length) {
// Not done with this these TopDocs yet:
queue.updateTop();
} else {
queue.pop();
}
}
}

TotalHits totalHits = new TotalHits(totalHitCount, totalHitsRelation);
if (sort == null) {
return new TopDocs(totalHits, hits);
} else {
return new TopFieldDocs(totalHits, hits, sort.getSort());
}
}

通过docid 获取对应的文档

        fieldsStream.seek(startPointer);
decompressor.decompress(fieldsStream, totalLength, offset, length, bytes);
assert bytes.length == length;
documentInput = new ByteArrayDataInput(bytes.bytes, bytes.offset, bytes.length);

堆栈:

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek (ByteBufferIndexInput.java:576)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState.document (Lucene90CompressingStoredFieldsReader.java:594)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:610)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


mmap加载文件到内存:

Breakpoint hit: "thread=main", org.apache.lucene.store.ByteBufferIndexInput.setCurBuf(), line=86 bci=0
86 this.curBuf = curBuf;

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput.setCurBuf (ByteBufferIndexInput.java:86)
[2] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.<init> (ByteBufferIndexInput.java:556)
[3] org.apache.lucene.store.ByteBufferIndexInput.newInstance (ByteBufferIndexInput.java:63)
[4] org.apache.lucene.store.MMapDirectory.openInput (MMapDirectory.java:238)
[5] org.apache.lucene.store.Directory.openChecksumInput (Directory.java:152)
[6] org.apache.lucene.index.SegmentInfos.readCommit (SegmentInfos.java:290)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:88)
[8] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[9] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:798)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[11] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[12] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

很明显,打开文件是在org.apache.lucene.store.MMapDirectory.openInput 这个类实现就是打开文件。

先打开文件segments_1

main[1] print name
name = "segments_1"
main[1] list
228
229 /** Creates an IndexInput for the file with the given name. */
230 @Override
231 public IndexInput openInput(String name, IOContext context) throws IOException {
232 => ensureOpen();
233 ensureCanRead(name);
234 Path path = directory.resolve(name);
235 try (FileChannel c = FileChannel.open(path, StandardOpenOption.READ)) {
236 final String resourceDescription = "MMapIndexInput(path=\"" + path.toString() + "\")";
237 final boolean useUnmap = getUseUnmap();
main[1]

举例读取字符串:

  private static void readField(DataInput in, StoredFieldVisitor visitor, FieldInfo info, int bits)
throws IOException {
switch (bits & TYPE_MASK) {
case BYTE_ARR:
int length = in.readVInt();
byte[] data = new byte[length];
in.readBytes(data, 0, length);
visitor.binaryField(info, data);
break;
case STRING:
visitor.stringField(info, in.readString());
break;
case NUMERIC_INT:
visitor.intField(info, in.readZInt());
break;
case NUMERIC_FLOAT:
visitor.floatField(info, readZFloat(in));
break;
case NUMERIC_LONG:
visitor.longField(info, readTLong(in));
break;
case NUMERIC_DOUBLE:
visitor.doubleField(info, readZDouble(in));
break;
default:
throw new AssertionError("Unknown type flag: " + Integer.toHexString(bits));
}
}
main[1] where
[1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
[3] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[4] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[5] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[6] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[7] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[8] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

main[1] list
66 }
67
68 @Override
69 public void stringField(FieldInfo fieldInfo, String value) throws IOException {
70 => final FieldType ft = new FieldType(TextField.TYPE_STORED);
71 ft.setStoreTermVectors(fieldInfo.hasVectors());
72 ft.setOmitNorms(fieldInfo.omitsNorms());
73 ft.setIndexOptions(fieldInfo.getIndexOptions());
74 doc.add(
75 new StoredField(
main[1] print value
value = "/home/dai/docs/aaa.txt"
main[1] where
[1] org.apache.lucene.document.DocumentStoredFieldVisitor.stringField (DocumentStoredFieldVisitor.java:70)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
[4] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[5] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[6] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[7] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

将读到的string 加载到doc对象里面

这是核心函数 , mmap 读取文件,然后seek 算出偏移和长度 ,从文件中读取出来并构造成对象


/**
* Get the serialized representation of the given docID. This docID has to be contained in the
* current block.
*/
SerializedDocument document(int docID) throws IOException {
if (contains(docID) == false) {
throw new IllegalArgumentException();
}

final int index = docID - docBase;
final int offset = Math.toIntExact(offsets[index]);
final int length = Math.toIntExact(offsets[index + 1]) - offset;
final int totalLength = Math.toIntExact(offsets[chunkDocs]);
final int numStoredFields = Math.toIntExact(this.numStoredFields[index]);

final BytesRef bytes;
if (merging) {
bytes = this.bytes;
} else {
bytes = new BytesRef();
}
...
fieldsStream.seek(startPointer); // 计算偏移量
decompressor.decompress(fieldsStream, totalLength, offset, length, bytes); // 解压内容
assert bytes.length == length;
documentInput = new ByteArrayDataInput(bytes.bytes, bytes.offset, bytes.length); // 将内容塞到对象里面
}

return new SerializedDocument(documentInput, length, numStoredFields);
}
}

获取doc

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.advance(), line=498 bci=0
498 if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {

main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.advance (Lucene90PostingsReader.java:498)
[2] org.apache.lucene.index.SlowImpactsEnum.advance (SlowImpactsEnum.java:77)
[3] org.apache.lucene.search.ImpactsDISI.advance (ImpactsDISI.java:128)
[4] org.apache.lucene.search.ImpactsDISI.nextDoc (ImpactsDISI.java:133)
[5] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:301)
[6] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[7] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[11] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[12] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[13] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[14] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

term query 和遍历

注意到 ImpactsEnum 实现了iteratorDocId

1,138      }
1,139
1,140 @Override
1,141 public ImpactsEnum impacts(int flags) throws IOException {
1,142 => assert !eof;
1,143 // if (DEBUG) {
1,144 // System.out.println("BTTR.docs seg=" + segment);
1,145 // }
1,146 currentFrame.decodeMetaData();
1,147 // if (DEBUG) {
main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.impacts (SegmentTermsEnum.java:1,142)
[2] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:114)
[3] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

注意到PostingsEnum 也有docidIterater

排序topk


main[1] where
[1] org.apache.lucene.util.PriorityQueue.upHeap (PriorityQueue.java:276)
[2] org.apache.lucene.util.PriorityQueue.add (PriorityQueue.java:161)
[3] org.apache.lucene.search.TopDocs.mergeAux (TopDocs.java:303)
[4] org.apache.lucene.search.TopDocs.merge (TopDocs.java:216)
[5] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:528)
[6] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[9] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[11] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


@Override
public boolean lessThan(ShardRef first, ShardRef second) {
assert first != second;
ScoreDoc firstScoreDoc = shardHits[first.shardIndex][first.hitIndex];
ScoreDoc secondScoreDoc = shardHits[second.shardIndex][second.hitIndex];
if (firstScoreDoc.score < secondScoreDoc.score) {
return false;
} else if (firstScoreDoc.score > secondScoreDoc.score) {
return true;
} else {
return tieBreakLessThan(first, firstScoreDoc, second, secondScoreDoc, tieBreakerComparator);
}

相关阅读

lucene 编译安装

· 5 min read

需要编译和了解lucene代码

编译

因为lucene锁死了版本,所以要切换成jdk17,我本地是jdk18

clone代码

## clone 代码
git clone https://github.com/apache/lucene.git

### 切换目录
cd lucene

### 编译
./gradlew

## 如果是翻墙,可以使用代理,这样会快一点
## 指定域名和端口
./gradlew -DsocksProxyHost=192.168.1.102 -DsocksProxyPort=1081

启动和测试

### 打包demo
./gradlew lucene:demo:jar

### 执行demo
java -cp /home/ubuntu/lucene-9.1.0/lucene/demo/build/classes/java/main:/home/ubuntu/lucene-9.1.0/lucene/core/build/classes/java/main/ org.apache.lucene.demo.IndexFiles -

操作系统是ubuntu切换jdk17命令如下:

### 安装jdk17
sudo apt install openjdk-17-jdk
# Configure Java 切换java
sudo update-alternatives --config java

# Configure Java Compiler 切换javac
sudo update-alternatives --config javac


### 查看切换之后的命令,java 已经是17了
java --version
openjdk 17.0.3 2022-04-19
OpenJDK Runtime Environment (build 17.0.3+7-Ubuntu-0ubuntu0.22.04.1)
OpenJDK 64-Bit Server VM (build 17.0.3+7-Ubuntu-0ubuntu0.22.04.1, mixed mode, sharing)

遇到的错误

gradle-wrapper.jar 下载不下来,跳过证书:

wget --no-check-certificate  https://raw.githubusercontent.com/gradle/gradle/v7.3.3/gradle/wrapper/gradle-wrapper.jar

然后放到{$luceneGitDir}/gradle/wrapper/ 下面 , 这里luceneGitDir 是你的git clone 下来的lucuene 目录

相关代码

      IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setUseCompoundFile(false); // 生成多个文件

写入header

对应的jdb调试

main[1] stop in  org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter:136
Deferring breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter:136.
It will be set after the class is loaded.
main[1] cont
> Set deferred breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter:136

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.<init>(), line=136 bci=180
136 CodecUtil.writeIndexHeader(

main[1] list
132
133 fieldsStream =
134 directory.createOutput(
135 IndexFileNames.segmentFileName(segment, segmentSuffix, FIELDS_EXTENSION), context);
136 => CodecUtil.writeIndexHeader(
137 fieldsStream, formatName, VERSION_CURRENT, si.getId(), segmentSuffix);
138 assert CodecUtil.indexHeaderLength(formatName, segmentSuffix)
139 == fieldsStream.getFilePointer();
140
141 indexWriter =
main[1] print formatName
formatName = "Lucene90StoredFieldsFastData"

对应堆栈

  [1] org.apache.lucene.store.OutputStreamIndexOutput.writeByte (OutputStreamIndexOutput.java:54)
[2] org.apache.lucene.codecs.CodecUtil.writeBEInt (CodecUtil.java:653)
[3] org.apache.lucene.codecs.CodecUtil.writeHeader (CodecUtil.java:82)
[4] org.apache.lucene.codecs.CodecUtil.writeIndexHeader (CodecUtil.java:125)
[5] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.<init> (Lucene90CompressingStoredFieldsWriter.java:128)
[6] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsFormat.fieldsWriter (Lucene90CompressingStoredFieldsFormat.java:140)
[7] org.apache.lucene.codecs.lucene90.Lucene90StoredFieldsFormat.fieldsWriter (Lucene90StoredFieldsFormat.java:154)
[8] org.apache.lucene.index.StoredFieldsConsumer.initStoredFieldsWriter (StoredFieldsConsumer.java:49)
[9] org.apache.lucene.index.StoredFieldsConsumer.startDocument (StoredFieldsConsumer.java:56)
[10] org.apache.lucene.index.IndexingChain.startStoredFields (IndexingChain.java:556)
[11] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:587)
[12] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[13] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[14] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[15] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[16] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[17] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[18] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[19] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[20] java.nio.file.Files.walkFileTree (Files.java:2,725)
[21] java.nio.file.Files.walkFileTree (Files.java:2,797)
[22] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[23] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

倒排索引

main[1] where
[1] org.apache.lucene.index.TermsHashPerField.initStreamSlices (TermsHashPerField.java:150)
[2] org.apache.lucene.index.TermsHashPerField.add (TermsHashPerField.java:198)
[3] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,224)
[4] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[5] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[6] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[7] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[8] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[9] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[10] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:277)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[13] java.nio.file.Files.walkFileTree (Files.java:2,725)
[14] java.nio.file.Files.walkFileTree (Files.java:2,797)
[15] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[16] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

写入内容

main[1] where
[1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField (Lucene90CompressingStoredFieldsWriter.java:276)
[2] org.apache.lucene.index.StoredFieldsConsumer.writeField (StoredFieldsConsumer.java:65)
[3] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:749)
[4] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[5] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[6] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[7] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[8] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[9] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[10] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[13] java.nio.file.Files.walkFileTree (Files.java:2,725)
[14] java.nio.file.Files.walkFileTree (Files.java:2,797)
[15] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[16] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

查看fdt文件

hexdump -C _0.fdt
00000000 3f d7 6c 17 1c 4c 75 63 65 6e 65 39 30 53 74 6f |?.l..Lucene90Sto|
00000010 72 65 64 46 69 65 6c 64 73 46 61 73 74 44 61 74 |redFieldsFastDat|
00000020 61 00 00 00 01 85 88 12 2b 0c 73 6b 95 30 38 76 |a.......+.sk.08v|
00000030 c9 0a 2a 52 29 00 00 0a 00 01 00 1c 02 06 03 07 |..*R)...........|
00000040 07 07 07 07 07 07 07 07 20 00 1a 60 2f 68 6f 6d |........ ..`/hom|
00000050 65 2f 60 75 62 75 6e 74 75 60 2f 64 6f 63 2f 6d |e/`ubuntu`/doc/m|
00000060 60 6f 6e 67 6f 2e 74 60 78 74 00 1a 2f 68 60 6f |`ongo.t`xt../h`o|
00000070 6d 65 2f 75 62 60 75 6e 74 75 2f 64 60 6f 63 2f |me/ub`untu/d`oc/|
00000080 68 65 6c 60 6c 6f 2e 74 78 74 c0 28 93 e8 00 00 |hel`lo.txt.(....|
00000090 00 00 00 00 00 00 c8 75 0a 41 |.......u.A|
0000009a

fdt描述

然后分析fdt格式: [1-4]代表第一个字节到第四个字节

[1-4]前四位字节是大端的magic number CODEC_MAGIC = 0x3fd76c17 [5-33] 第五个字节描述字符串长度,后面的[6-33]是具体的字符串,也就是16进制1c也就是10进制的28 , 因为字符串长度是28的字符串Lucene90StoredFieldsFastData [34-37]字符串后面是写死的版本大端的1 [38-53] 16字节用唯一id描述这个文件

缓冲池

TermsHashPerField持有三个缓冲池intPool,bytePool,termBytePool

  TermsHashPerField(
int streamCount,
IntBlockPool intPool,
ByteBlockPool bytePool,
ByteBlockPool termBytePool,
Counter bytesUsed,
TermsHashPerField nextPerField,
String fieldName,
IndexOptions indexOptions) {
this.intPool = intPool;
this.bytePool = bytePool;
this.streamCount = streamCount;
this.fieldName = fieldName;
this.nextPerField = nextPerField;
assert indexOptions != IndexOptions.NONE;
this.indexOptions = indexOptions;
PostingsBytesStartArray byteStarts = new PostingsBytesStartArray(this, bytesUsed);
bytesHash = new BytesRefHash(termBytePool, HASH_INIT_SIZE, byteStarts);
}

生成term

main[1] where
[1] org.apache.lucene.util.BytesRefHash.add (BytesRefHash.java:247)
[2] org.apache.lucene.index.TermsHashPerField.add (TermsHashPerField.java:193)
[3] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,224)
[4] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[5] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[6] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[7] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[8] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[9] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[10] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:277)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[13] java.nio.file.Files.walkFileTree (Files.java:2,725)
[14] java.nio.file.Files.walkFileTree (Files.java:2,797)
[15] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[16] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

arch 查询

相关阅读

main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:476)
[2] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[3] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[4] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[5] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

相关阅读

lucene源码分析

· 44 min read

lucene 分为两部分:

  • 写入
    写入则是写入文件系统

  • 查询
    则是通过了 分词、排序、topk提取等过程,获取对应的docid,再通过docid 回查对应的内容

Vint

vint 是一个可变长的数组,是一个小端的变长数组,每个字节最高位置1代表后面还有(也就是最后一个字节的最高位是0)

相关代码

      IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setUseCompoundFile(false); // 生成多个文件

开始debug

### 调试java 代码

java -agentlib:jdwp=transport=dt_socket,server=y,address=8000 -cp ./lucene-demo-9.1.0-SNAPSHOT.jar:/home/ubuntu/lucene-9.1.0/lucene/core/build/libs/lucene-core-9.1.0-SNAPSHOT.jar:/home/ubuntu/lucene-9.1.0/lucene/queryparser/build/libs/lucene-queryparser-9.1.0-SNAPSHOT.jar org.apache.lucene.demo.SearchFiles

### jdb 连接上jdk
jdb -attach 8000 -sourcepath /home/ubuntu/lucene-9.1.0/lucene/demo/src/java/

查看fdt文件

hexdump -C _0.fdt
00000000 3f d7 6c 17 1c 4c 75 63 65 6e 65 39 30 53 74 6f |?.l..Lucene90Sto|
00000010 72 65 64 46 69 65 6c 64 73 46 61 73 74 44 61 74 |redFieldsFastDat|
00000020 61 00 00 00 01 85 88 12 2b 0c 73 6b 95 30 38 76 |a.......+.sk.08v|
00000030 c9 0a 2a 52 29 00 00 0a 00 01 00 1c 02 06 03 07 |..*R)...........|
00000040 07 07 07 07 07 07 07 07 20 00 1a 60 2f 68 6f 6d |........ ..`/hom|
00000050 65 2f 60 75 62 75 6e 74 75 60 2f 64 6f 63 2f 6d |e/`ubuntu`/doc/m|
00000060 60 6f 6e 67 6f 2e 74 60 78 74 00 1a 2f 68 60 6f |`ongo.t`xt../h`o|
00000070 6d 65 2f 75 62 60 75 6e 74 75 2f 64 60 6f 63 2f |me/ub`untu/d`oc/|
00000080 68 65 6c 60 6c 6f 2e 74 78 74 c0 28 93 e8 00 00 |hel`lo.txt.(....|
00000090 00 00 00 00 00 00 c8 75 0a 41 |.......u.A|
0000009a

writeField

ubuntu@VM-0-3-ubuntu:~$ jdb -attach 8000 -sourcepath /home/ubuntu/lucene-9.1.0/lucene/demo/src/java/:/home/ubuntu/lucene-9.1.0/lucene/core/src/java/ 
Set uncaught java.lang.Throwable
Set deferred uncaught java.lang.Throwable
Initializing jdb ...
>
VM Started: No frames on the current call stack

main[1] stop in org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField
Deferring breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField.
It will be set after the class is loaded.
main[1] cont
> Set deferred breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField(), line=276 bci=0
276 ++numStoredFieldsInDoc;

main[1] wheree^H^H
Unrecognized command: 'wher'. Try help...
main[1] where
[1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField (Lucene90CompressingStoredFieldsWriter.java:276)
[2] org.apache.lucene.index.StoredFieldsConsumer.writeField (StoredFieldsConsumer.java:65)
[3] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:749)
[4] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[5] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[6] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[7] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[8] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[9] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[10] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[13] java.nio.file.Files.walkFileTree (Files.java:2,725)
[14] java.nio.file.Files.walkFileTree (Files.java:2,797)
[15] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[16] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)
main[1] list
272
273 @Override
274 public void writeField(FieldInfo info, IndexableField field) throws IOException {
275
276 => ++numStoredFieldsInDoc;
277
278 int bits = 0;
279 final BytesRef bytes;
280 final String string;
281
main[1] print field
field = "stored,indexed,omitNorms,indexOptions=DOCS<path:/home/ubuntu/doc/mongo.txt>"
main[1] print info
info = "org.apache.lucene.index.FieldInfo@32464a14"

分词和倒排索引

main[1] where
[1] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,138)
[2] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[3] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[4] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[5] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[6] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[7] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[8] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:277)
[9] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[10] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[11] java.nio.file.Files.walkFileTree (Files.java:2,725)
[12] java.nio.file.Files.walkFileTree (Files.java:2,797)
[13] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[14] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

term描述

      IntBlockPool intPool,
ByteBlockPool bytePool,
ByteBlockPool termBytePool,

倒排索引term 在内存中用以下内容描述: intPool 包含三个变量:

  • 二维数组buffers[][]
  • int bufferUpto 描述的是二维数组 buffers[][]的第一级的偏移 , 一般都是这样用 int[] buff = buffers[bufferUpto + offset]
  • int intUpto 描述的是整体的偏移量,描述是偏移所有的buffers 的字节数
  • int intOffset 描述的是header buffer的偏移量

那么buffers[xxx][yyy]的值又是什么呢? 这个buffers二维数组存的也是偏移量.是什么的偏移量呢?

intPool描述的是bytePooltermBytePool 的偏移量

term 写入tim文件

会将term一个个写入

main[1] where 
[1] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlock (Lucene90BlockTreeTermsWriter.java:963)
[2] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks (Lucene90BlockTreeTermsWriter.java:709)
[3] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.finish (Lucene90BlockTreeTermsWriter.java:1,105)
[4] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write (Lucene90BlockTreeTermsWriter.java:370)
[5] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write (PerFieldPostingsFormat.java:171)
[6] org.apache.lucene.index.FreqProxTermsWriter.flush (FreqProxTermsWriter.java:131)
[7] org.apache.lucene.index.IndexingChain.flush (IndexingChain.java:300)
[8] org.apache.lucene.index.DocumentsWriterPerThread.flush (DocumentsWriterPerThread.java:391)
[9] org.apache.lucene.index.DocumentsWriter.doFlush (DocumentsWriter.java:493)
[10] org.apache.lucene.index.DocumentsWriter.flushAllThreads (DocumentsWriter.java:672)
[11] org.apache.lucene.index.IndexWriter.doFlush (IndexWriter.java:4,014)
[12] org.apache.lucene.index.IndexWriter.flush (IndexWriter.java:3,988)
[13] org.apache.lucene.index.IndexWriter.shutdown (IndexWriter.java:1,321)
[14] org.apache.lucene.index.IndexWriter.close (IndexWriter.java:1,361)
[15] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:166)

查询

main[1] where
[1] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector.getLeafCollector (TopScoreDocCollector.java:57)
[2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:759)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[5] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[7] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[8] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

获取term

从terms reader 读取term

main[1] print fieldMap.get(field)
fieldMap.get(field) = "BlockTreeTerms(seg=_j terms=18,postings=20,positions=25,docs=2)"
main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.terms (Lucene90BlockTreeTermsReader.java:294)
[2] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms (PerFieldPostingsFormat.java:353)
[3] org.apache.lucene.index.CodecReader.terms (CodecReader.java:114)
[4] org.apache.lucene.index.Terms.getTerms (Terms.java:41)
[5] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:115)
[6] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[7] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[8] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[10] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[11] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[12] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


通过arc 获取对应output

Breakpoint hit: "thread=main", org.apache.lucene.util.fst.FST.findTargetArc(), line=1,412 bci=0
1,412 if (labelToMatch == END_LABEL) {

main[1] where
[1] org.apache.lucene.util.fst.FST.findTargetArc (FST.java:1,412)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:511)
[3] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[4] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

打开tim文件

main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.<init> (Lucene90BlockTreeTermsReader.java:135)
[2] org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.fieldsProducer (Lucene90PostingsFormat.java:427)
[3] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init> (PerFieldPostingsFormat.java:329)
[4] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer (PerFieldPostingsFormat.java:391)
[5] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:118)
[6] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
[8] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[9] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[11] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[12] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

获取topk的数据核心函数mergeAux,一个辅助函数获取topk的内容

Step completed: "thread=main", org.apache.lucene.search.TopDocs.mergeAux(), line=291 bci=43
291 for (int shardIDX = 0; shardIDX < shardHits.length; shardIDX++) {

main[1] where
[1] org.apache.lucene.search.TopDocs.mergeAux (TopDocs.java:291)
[2] org.apache.lucene.search.TopDocs.merge (TopDocs.java:216)
[3] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:528)
[4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

docid 获取对应的文案内容

通过docid 获取document

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek (ByteBufferIndexInput.java:529)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState.document (Lucene90CompressingStoredFieldsReader.java:594)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:610)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

seek 方法通过偏移获取document,其中seek 中curBufjava.nio.DirectByteBufferR

525    
526 @Override
527 public void seek(long pos) throws IOException {
528 try {
529 => curBuf.position((int) pos);
530 } catch (IllegalArgumentException e) {
531 if (pos < 0) {
532 throw new IllegalArgumentException("Seeking to negative position: " + this, e);
533 } else {
534 throw new EOFException("seek past EOF: " + this);
main[1] print curBuf
curBuf = "java.nio.DirectByteBufferR[pos=60 lim=154 cap=154]"

main[1] list
168
169 // NOTE: AIOOBE not EOF if you read too much
170 @Override
171 public void readBytes(byte[] b, int offset, int len) {
172 => System.arraycopy(bytes, pos, b, offset, len);
173 pos += len;
174 }
175 }
main[1] where
[1] org.apache.lucene.store.ByteArrayDataInput.readBytes (ByteArrayDataInput.java:172)
[2] org.apache.lucene.store.DataInput.readString (DataInput.java:265)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

通过堆外内存加载文件数据

Breakpoint hit: "thread=main", org.apache.lucene.store.ByteBufferIndexInput.setCurBuf(), line=83 bci=0
83 this.curBuf = curBuf;

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput.setCurBuf (ByteBufferIndexInput.java:83)
[2] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.<init> (ByteBufferIndexInput.java:520)
[3] org.apache.lucene.store.ByteBufferIndexInput.newInstance (ByteBufferIndexInput.java:60)
[4] org.apache.lucene.store.MMapDirectory.openInput (MMapDirectory.java:238)
[5] org.apache.lucene.store.Directory.openChecksumInput (Directory.java:152)
[6] org.apache.lucene.index.SegmentInfos.readCommit (SegmentInfos.java:297)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:88)
[8] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[9] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[11] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[12] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

filechannel 的 map

java 对应的类

src\java.base\share\classes\sun\nio\ch\FileChannelImpl.java
// Creates a new mapping
private native long map0(int prot, long position, long length, boolean isSync)
throws IOException;

native 对应的c实现类

src\java.base\unix\native\libnio\ch\FileChannelImpl.c
JNIEXPORT jlong JNICALL
Java_sun_nio_ch_FileChannelImpl_map0(JNIEnv *env, jobject this,
jint prot, jlong off, jlong len, jboolean map_sync)
{
void *mapAddress = 0;
jobject fdo = (*env)->GetObjectField(env, this, chan_fd);
jint fd = fdval(env, fdo);
int protections = 0;
int flags = 0;

// should never be called with map_sync and prot == PRIVATE
assert((prot != sun_nio_ch_FileChannelImpl_MAP_PV) || !map_sync);

if (prot == sun_nio_ch_FileChannelImpl_MAP_RO) {
protections = PROT_READ;
flags = MAP_SHARED;
} else if (prot == sun_nio_ch_FileChannelImpl_MAP_RW) {
protections = PROT_WRITE | PROT_READ;
flags = MAP_SHARED;
} else if (prot == sun_nio_ch_FileChannelImpl_MAP_PV) {
protections = PROT_WRITE | PROT_READ;
flags = MAP_PRIVATE;
}

// if MAP_SYNC and MAP_SHARED_VALIDATE are not defined then it is
// best to define them here. This ensures the code compiles on old
// OS releases which do not provide the relevant headers. If run
// on the same machine then it will work if the kernel contains
// the necessary support otherwise mmap should fail with an
// invalid argument error

#ifndef MAP_SYNC
#define MAP_SYNC 0x80000
#endif
#ifndef MAP_SHARED_VALIDATE
#define MAP_SHARED_VALIDATE 0x03
#endif

if (map_sync) {
// ensure
// 1) this is Linux on AArch64, x86_64, or PPC64 LE
// 2) the mmap APIs are available at compile time
#if !defined(LINUX) || ! (defined(aarch64) || (defined(amd64) && defined(_LP64)) || defined(ppc64le))
// TODO - implement for solaris/AIX/BSD/WINDOWS and for 32 bit
JNU_ThrowInternalError(env, "should never call map on platform where MAP_SYNC is unimplemented");
return IOS_THROWN;
#else
flags |= MAP_SYNC | MAP_SHARED_VALIDATE;
#endif
}

mapAddress = mmap64(
0, /* Let OS decide location */
len, /* Number of bytes to map */
protections, /* File permissions */
flags, /* Changes are shared */
fd, /* File descriptor of mapped file */
off); /* Offset into file */

if (mapAddress == MAP_FAILED) {
if (map_sync && errno == ENOTSUP) {
JNU_ThrowIOExceptionWithLastError(env, "map with mode MAP_SYNC unsupported");
return IOS_THROWN;
}

if (errno == ENOMEM) {
JNU_ThrowOutOfMemoryError(env, "Map failed");
return IOS_THROWN;
}
return handle(env, -1, "Map failed");
}

return ((jlong) (unsigned long) mapAddress);
}

mmap 映射文件读取硬盘中的内容

FileChannel.open 底层是一个native方法,如果是linux系统,就是mmap64

main[1] list
228
229 /** Creates an IndexInput for the file with the given name. */
230 @Override
231 public IndexInput openInput(String name, IOContext context) throws IOException {
232 => ensureOpen();
233 ensureCanRead(name);
234 Path path = directory.resolve(name);
235 try (FileChannel c = FileChannel.open(path, StandardOpenOption.READ)) {
236 final String resourceDescription = "MMapIndexInput(path=\"" + path.toString() + "\")";
237 final boolean useUnmap = getUseUnmap();
main[1] print name
name = "_j.fnm"
main[1] where
[1] org.apache.lucene.store.MMapDirectory.openInput (MMapDirectory.java:232)
[2] org.apache.lucene.store.Directory.openChecksumInput (Directory.java:152)
[3] org.apache.lucene.codecs.lucene90.Lucene90FieldInfosFormat.read (Lucene90FieldInfosFormat.java:124)
[4] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:111)
[5] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
[6] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[8] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[9] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[11] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

读取mmap后的数据

mmap之后的buf在哪里会被用到呢?
和普通的文件读写类似,也就是seek后读字节

lucene\core\src\java\org\apache\lucene\store\ByteBufferIndexInput.java
@Override
public final void readBytes(byte[] b, int offset, int len) throws IOException {
try {
guard.getBytes(curBuf, b, offset, len);
} catch (
@SuppressWarnings("unused")
BufferUnderflowException e) {
int curAvail = curBuf.remaining();
while (len > curAvail) {
guard.getBytes(curBuf, b, offset, curAvail);
len -= curAvail;
offset += curAvail;
curBufIndex++;
if (curBufIndex >= buffers.length) {
throw new EOFException("read past EOF: " + this);
}
setCurBuf(buffers[curBufIndex]);
curBuf.position(0);
curAvail = curBuf.remaining();
}
guard.getBytes(curBuf, b, offset, len);
} catch (
@SuppressWarnings("unused")
NullPointerException npe) {
throw new AlreadyClosedException("Already closed: " + this);
}
}

mmap后读取数据

main[1] where
[1] jdk.internal.misc.Unsafe.copyMemory (Unsafe.java:782)
[2] java.nio.DirectByteBuffer.get (DirectByteBuffer.java:308)
[3] org.apache.lucene.store.ByteBufferGuard.getBytes (ByteBufferGuard.java:93)
[4] org.apache.lucene.store.ByteBufferIndexInput.readBytes (ByteBufferIndexInput.java:114)
[5] org.apache.lucene.store.BufferedChecksumIndexInput.readBytes (BufferedChecksumIndexInput.java:46)
[6] org.apache.lucene.store.DataInput.readString (DataInput.java:265)
[7] org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic (CodecUtil.java:202)
[8] org.apache.lucene.codecs.CodecUtil.checkHeader (CodecUtil.java:193)
[9] org.apache.lucene.codecs.CodecUtil.checkIndexHeader (CodecUtil.java:253)
[10] org.apache.lucene.codecs.lucene90.Lucene90FieldInfosFormat.read (Lucene90FieldInfosFormat.java:128)
[11] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:111)
[12] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
[13] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
[14] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[15] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[16] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[17] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[18] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[19] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

文件格式介绍

.fnm 文件

格式出处

fnm 文件 由这几部分组成:

  • Header
  • FieldsCount : 字段的个数
  • 数组,长度为FieldsCount , 数组中每个元素包含包含这几个字段: [FieldName: 字段名 ,FieldNumber:字段number ,FieldBits, DocValuesBits, DocValuesGen ,DimensionCount , DimensionNumBytes ]
  • Footer

fnm 描述的field的基础信息,也可以算是metadata信息



Field names are stored in the field info file, with suffix .fnm.

FieldInfos (.fnm) --> Header,FieldsCount, <FieldName,FieldNumber, FieldBits,DocValuesBits,DocValuesGen,Attributes,DimensionCount,DimensionNumBytes> ,Footer

Data types:

Header --> IndexHeader
FieldsCount --> VInt
FieldName --> String
FieldBits, IndexOptions, DocValuesBits --> Byte
FieldNumber, DimensionCount, DimensionNumBytes --> VInt
Attributes --> Map<String,String>
DocValuesGen --> Int64
Footer --> CodecFooter
Field Descriptions:
FieldsCount: the number of fields in this file.
FieldName: name of the field as a UTF-8 String.
FieldNumber: the field's number. Note that unlike previous versions of Lucene, the fields are not numbered implicitly by their order in the file, instead explicitly.
FieldBits: a byte containing field options.
The low order bit (0x1) is one for fields that have term vectors stored, and zero for fields without term vectors.
If the second lowest order-bit is set (0x2), norms are omitted for the indexed field.
If the third lowest-order bit is set (0x4), payloads are stored for the indexed field.
IndexOptions: a byte containing index options.
0: not indexed
1: indexed as DOCS_ONLY
2: indexed as DOCS_AND_FREQS
3: indexed as DOCS_AND_FREQS_AND_POSITIONS
4: indexed as DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
DocValuesBits: a byte containing per-document value types. The type recorded as two four-bit integers, with the high-order bits representing norms options, and the low-order bits representing DocValues options. Each four-bit integer can be decoded as such:
0: no DocValues for this field.
1: NumericDocValues. (DocValuesType.NUMERIC)
2: BinaryDocValues. (DocValuesType#BINARY)
3: SortedDocValues. (DocValuesType#SORTED)
DocValuesGen is the generation count of the field's DocValues. If this is -1, there are no DocValues updates to that field. Anything above zero means there are updates stored by DocValuesFormat.
Attributes: a key-value map of codec-private attributes.
PointDimensionCount, PointNumBytes: these are non-zero only if the field is indexed as points, e.g. using LongPoint
VectorDimension: it is non-zero if the field is indexed as vectors.
VectorSimilarityFunction: a byte containing distance function used for similarity calculation.
0: EUCLIDEAN distance. (VectorSimilarityFunction.EUCLIDEAN)
1: DOT_PRODUCT similarity. (VectorSimilarityFunction.DOT_PRODUCT)
2: COSINE similarity. (VectorSimilarityFunction.COSINE)

.fdt

文件路径: lucene\backward-codecs\src\java\org\apache\lucene\backward_codecs\lucene50\Lucene50CompoundFormat.java

没有找到90的版本的fdt格式,只有2.9.4的,将就使用fdt格式

main[1] print fieldsStreamFN
fieldsStreamFN = "_j.fdt"
main[1] list
124 numDocs = si.maxDoc();
125
126 final String fieldsStreamFN =
127 IndexFileNames.segmentFileName(segment, segmentSuffix, FIELDS_EXTENSION);
128 => ChecksumIndexInput metaIn = null;
129 try {
130 // Open the data file
131 fieldsStream = d.openInput(fieldsStreamFN, context);
132 version =
133 CodecUtil.checkIndexHeader(
main[1] where
[1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.<init> (Lucene90CompressingStoredFieldsReader.java:128)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsFormat.fieldsReader (Lucene90CompressingStoredFieldsFormat.java:133)
[3] org.apache.lucene.codecs.lucene90.Lucene90StoredFieldsFormat.fieldsReader (Lucene90StoredFieldsFormat.java:136)
[4] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:138)
[5] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
[6] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[8] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[9] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[11] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

加载doc的内容到Document 对象

整个流程是通过docid 获取document 的内容


@Override
public void visitDocument(int docID, StoredFieldVisitor visitor) throws IOException {

final SerializedDocument doc = document(docID); // 通过docID 获取doc对象

for (int fieldIDX = 0; fieldIDX < doc.numStoredFields; fieldIDX++) {
final long infoAndBits = doc.in.readVLong();
final int fieldNumber = (int) (infoAndBits >>> TYPE_BITS);
final FieldInfo fieldInfo = fieldInfos.fieldInfo(fieldNumber);

final int bits = (int) (infoAndBits & TYPE_MASK);
assert bits <= NUMERIC_DOUBLE : "bits=" + Integer.toHexString(bits);

switch (visitor.needsField(fieldInfo)) {
case YES:
readField(doc.in, visitor, fieldInfo, bits); // 通过input , 也就是input 绑定的fd ,去读mmap64 映射的文件 ,在这里会读取后缀名为 .fdt 的文件
break;
...
}
}
}
main[1] where
[1] org.apache.lucene.document.Document.add (Document.java:60)
[2] org.apache.lucene.document.DocumentStoredFieldVisitor.stringField (DocumentStoredFieldVisitor.java:74)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

通过docid 构建 SerializedDocument

首先入口在这里:

org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document

Lucene90CompressingStoredFieldsReader的document 方法

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek (ByteBufferIndexInput.java:529)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:606)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
[4] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[5] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[6] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[7] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
  SerializedDocument document(int docID) throws IOException {
if (state.contains(docID) == false) {
fieldsStream.seek(indexReader.getStartPointer(docID)); // 通过mmap64 偏移
state.reset(docID);
}
assert state.contains(docID);
return state.document(docID); // 再看具体的实现 , 这个state 对象对应的类是一个静态内部类
}

下面看看静态内部类的实现

    /**
* Get the serialized representation of the given docID. This docID has to be contained in the
* current block.
*/
SerializedDocument document(int docID) throws IOException {
if (contains(docID) == false) {
throw new IllegalArgumentException();
}

final int index = docID - docBase;
final int offset = Math.toIntExact(offsets[index]);
final int length = Math.toIntExact(offsets[index + 1]) - offset;
final int totalLength = Math.toIntExact(offsets[chunkDocs]);
final int numStoredFields = Math.toIntExact(this.numStoredFields[index]);

final BytesRef bytes;
if (merging) {
bytes = this.bytes;
} else {
bytes = new BytesRef();
}

final DataInput documentInput;
if (length == 0) {
...
} else {
fieldsStream.seek(startPointer); // seek mmap64 偏移量获取文件
decompressor.decompress(fieldsStream, totalLength, offset, length, bytes); // 解压对应的数据
assert bytes.length == length;
documentInput = new ByteArrayDataInput(bytes.bytes, bytes.offset, bytes.length); // 将数据塞入bytes
}

return new SerializedDocument(documentInput, length, numStoredFields); // 构建SerializedDocument
}
}

下面具体描述加载内容的过程:

 pos = 4
main[1] dump bytes
bytes = {
120, 116, 0, 26, 47, 104, 111, 109, 101, 47, 117, 98, 117, 110, 116, 117, 47, 100, 111, 99, 47, 104, 101, 108, 108, 111, 46, 116, 120, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}
main[1] print in
in = "MMapIndexInput(path="/home/ubuntu/index/_j.fdt")"
main[1] where
[1] org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor.decompress (LZ4WithPresetDictCompressionMode.java:88)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState.document (Lucene90CompressingStoredFieldsReader.java:595)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:610)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

term 文件的加载和处理

 public SegmentTermsEnum(FieldReader fr) throws IOException {
this.fr = fr;

// if (DEBUG) {
// System.out.println("BTTR.init seg=" + fr.parent.segment);
// }
stack = new SegmentTermsEnumFrame[0];

// Used to hold seek by TermState, or cached seek
staticFrame = new SegmentTermsEnumFrame(this, -1);

if (fr.index == null) {
fstReader = null;
} else {
fstReader = fr.index.getBytesReader();
}

// Init w/ root block; don't use index since it may
// not (and need not) have been loaded
for (int arcIdx = 0; arcIdx < arcs.length; arcIdx++) {
arcs[arcIdx] = new FST.Arc<>();
}

currentFrame = staticFrame;
final FST.Arc<BytesRef> arc;
if (fr.index != null) {
arc = fr.index.getFirstArc(arcs[0]);
// Empty string prefix must have an output in the index!
assert arc.isFinal();
} else {
arc = null;
}
// currentFrame = pushFrame(arc, rootCode, 0);
// currentFrame.loadBlock();
validIndexPrefix = 0;
// if (DEBUG) {
// System.out.println("init frame state " + currentFrame.ord);
// printSeekState();
// }

// System.out.println();
// computeBlockStats().print(System.out);
}

解析获取getArc

  private FST.Arc<BytesRef> getArc(int ord) {
if (ord >= arcs.length) {
@SuppressWarnings({"rawtypes", "unchecked"})
final FST.Arc<BytesRef>[] next =
new FST.Arc[ArrayUtil.oversize(1 + ord, RamUsageEstimator.NUM_BYTES_OBJECT_REF)];
System.arraycopy(arcs, 0, next, 0, arcs.length);
for (int arcOrd = arcs.length; arcOrd < next.length; arcOrd++) {
next[arcOrd] = new FST.Arc<>();
}
arcs = next;
}
return arcs[ord];
}
Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.getArc(), line=222 bci=0
222 if (ord >= arcs.length) {

main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.getArc (SegmentTermsEnum.java:222)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:511)
[3] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[4] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

获取所有的数据

main[1] where
[1] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:300)
[2] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[3] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1] list
296 DocIdSetIterator iterator,
297 TwoPhaseIterator twoPhase,
298 Bits acceptDocs)
299 throws IOException {
300 => if (twoPhase == null) {
301 for (int doc = iterator.nextDoc();
302 doc != DocIdSetIterator.NO_MORE_DOCS;
303 doc = iterator.nextDoc()) {
304 if (acceptDocs == null || acceptDocs.get(doc)) {
305 collector.collect(doc);
main[1] print iterator
iterator = "org.apache.lucene.search.ImpactsDISI@6279cee3"

main[1] list
494 @Override
495 public int advance(int target) throws IOException {
496 // current skip docID < docIDs generated from current buffer <= next skip docID
497 // we don't need to skip if target is buffered already
498 => if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {
499
500 if (skipper == null) {
501 // Lazy init: first time this enum has ever been used for skipping
502 skipper =
503 new Lucene90SkipReader(
main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.advance (Lucene90PostingsReader.java:498)
[2] org.apache.lucene.index.SlowImpactsEnum.advance (SlowImpactsEnum.java:77)
[3] org.apache.lucene.search.ImpactsDISI.advance (ImpactsDISI.java:135)
[4] org.apache.lucene.search.ImpactsDISI.nextDoc (ImpactsDISI.java:140)
[5] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:301)
[6] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[7] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[11] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[12] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[13] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[14] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

生成iterator 的相关类 , 对应的是SegmentTermsEnum

main[1] where
[1] org.apache.lucene.search.TermQuery$TermWeight.getTermsEnum (TermQuery.java:145)
[2] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:107)
[3] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1] print termsEnum
termsEnum = "org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum@1a84f40f"

getTermsEnum 方法能拿到term的统计位置偏移,SegmentTermsEnum 不包含dociterator

main[1] where
[1] org.apache.lucene.index.Term.bytes (Term.java:128)
[2] org.apache.lucene.search.TermQuery$TermWeight.getTermsEnum (TermQuery.java:145)
[3] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:107)
[4] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


144 final TermsEnum termsEnum = context.reader().terms(term.field()).iterator();
145 => termsEnum.seekExact(term.bytes(), state);
146 return termsEnum;
147 }

这里的term.bytes() 就是我们的搜索值 , 所以term对应的倒排信息是从这里开始读的(还没看完,暂时那么定)

读出倒排信息之后,开始排序. score 有iteration 可以遍历所有doc_id

main[1] list
348 // (needsFreq=false)
349 private boolean isFreqsRead;
350 private int singletonDocID; // docid when there is a single pulsed posting, otherwise -1
351
352 => public BlockDocsEnum(FieldInfo fieldInfo) throws IOException {
353 this.startDocIn = Lucene90PostingsReader.this.docIn;
354 this.docIn = null;
355 indexHasFreq = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS) >= 0;
356 indexHasPos =
357 fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.<init> (Lucene90PostingsReader.java:352)
[2] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.postings (Lucene90PostingsReader.java:258)
[3] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.impacts (Lucene90PostingsReader.java:280)
[4] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.impacts (SegmentTermsEnum.java:1,150)
[5] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:114)
[6] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[10] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[11] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[12] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

topk collector的堆栈

Breakpoint hit: "thread=main", org.apache.lucene.search.TopDocsCollector.populateResults(), line=64 bci=0
64 for (int i = howMany - 1; i >= 0; i--) {

main[1] where
[1] org.apache.lucene.search.TopDocsCollector.populateResults (TopDocsCollector.java:64)
[2] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:166)
[3] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:98)
[4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:526)
[5] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

search 过程

main[1] dump collector
collector = {
org.apache.lucene.search.TopScoreDocCollector.docBase: 0
org.apache.lucene.search.TopScoreDocCollector.pqTop: instance of org.apache.lucene.search.ScoreDoc(id=1529)
org.apache.lucene.search.TopScoreDocCollector.hitsThresholdChecker: instance of org.apache.lucene.search.HitsThresholdChecker$LocalHitsThresholdChecker(id=1530)
org.apache.lucene.search.TopScoreDocCollector.minScoreAcc: null
org.apache.lucene.search.TopScoreDocCollector.minCompetitiveScore: 0.0
org.apache.lucene.search.TopScoreDocCollector.$assertionsDisabled: true
org.apache.lucene.search.TopDocsCollector.EMPTY_TOPDOCS: instance of org.apache.lucene.search.TopDocs(id=1531)
org.apache.lucene.search.TopDocsCollector.pq: instance of org.apache.lucene.search.HitQueue(id=1532)
org.apache.lucene.search.TopDocsCollector.totalHits: 0
org.apache.lucene.search.TopDocsCollector.totalHitsRelation: instance of org.apache.lucene.search.TotalHits$Relation(id=1533)
}
main[1] print collector
collector = "org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector@62bd765"

获取hits 数量的过程

690      private <C extends Collector, T> T search(
691 Weight weight, CollectorManager<C, T> collectorManager, C firstCollector) throws IOException {
692 if (executor == null || leafSlices.length <= 1) {
693 search(leafContexts, weight, firstCollector);
694 => return collectorManager.reduce(Collections.singletonList(firstCollector));
695 } else {
696 final List<C> collectors = new ArrayList<>(leafSlices.length);
697 collectors.add(firstCollector);
698 final ScoreMode scoreMode = firstCollector.scoreMode();
699 for (int i = 1; i < leafSlices.length; ++i) {
main[1] where
[1] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[3] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[5] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[6] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

org.apache.lucene.search.TopScoreDocCollector.create , 一直往上翻,发现org.apache.lucene.search.IndexSearcher.searchAfter 就已经有了. 那么这个hit数量是从哪里初始化的呢?

很明显,search会填充firstCollector的数据,那么是在哪里赋值的呢?

 protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
throws IOException {

// TODO: should we make this
// threaded...? the Collector could be sync'd?
// always use single thread:
for (LeafReaderContext ctx : leaves) { // search each subreader
final LeafCollector leafCollector;
try {
leafCollector = collector.getLeafCollector(ctx);
} catch (
@SuppressWarnings("unused")
CollectionTerminatedException e) {
// there is no doc of interest in this reader context
// continue with the following leaf
continue;
}
BulkScorer scorer = weight.bulkScorer(ctx); /// 在这里会获取total hits
if (scorer != null) {
try {
scorer.score(leafCollector, ctx.reader().getLiveDocs());
} catch (
@SuppressWarnings("unused")
CollectionTerminatedException e) {
// collection was terminated prematurely
// continue with the following leaf
}
}
}
}

看完最后的堆栈,我们确定了totalHits 是在这里赋值的 , 也就是只要调用了一次就自增一, 很明显这是一个统计,那么这个统计就是命中的搜索内容,那么搜索内容是怎么来的呢?

我们只能往上追溯

main[1] where
[1] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect (TopScoreDocCollector.java:73)
[2] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:305)
[3] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[4] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

@Override
public void collect(int doc) throws IOException {
float score = scorer.score(); // 计算分数 , 也是回调专用的score 函数 , 插件化

// This collector relies on the fact that scorers produce positive values:
assert score >= 0; // NOTE: false for NaN

totalHits++; // hit +1 在这里触发
hitsThresholdChecker.incrementHitCount();

if (minScoreAcc != null && (totalHits & minScoreAcc.modInterval) == 0) {
updateGlobalMinCompetitiveScore(scorer);
}

if (score <= pqTop.score) {
if (totalHitsRelation == TotalHits.Relation.EQUAL_TO) {
// we just reached totalHitsThreshold, we can start setting the min
// competitive score now
updateMinCompetitiveScore(scorer);
}
// Since docs are returned in-order (i.e., increasing doc Id), a document
// with equal score to pqTop.score cannot compete since HitQueue favors
// documents with lower doc Ids. Therefore reject those docs too.
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
updateMinCompetitiveScore(scorer);
}
};

继续往上面推之后,我们找到了堆栈,scorer 是根据context生成的

  /**
* Optional method, to return a {@link BulkScorer} to score the query and send hits to a {@link
* Collector}. Only queries that have a different top-level approach need to override this; the
* default implementation pulls a normal {@link Scorer} and iterates and collects the resulting
* hits which are not marked as deleted.
*
* @param context the {@link org.apache.lucene.index.LeafReaderContext} for which to return the
* {@link Scorer}.
* @return a {@link BulkScorer} which scores documents and passes them to a collector.
* @throws IOException if there is a low-level I/O error
*/
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {

Scorer scorer = scorer(context);
if (scorer == null) {
// No docs match
return null;
}

// This impl always scores docs in order, so we can
// ignore scoreDocsInOrder:
return new DefaultBulkScorer(scorer);
}

再往上看: 刚刚看到了bulkScorer 回调了一个scorer 方法,这个scorer抽象方法的实现是在org.apache.lucene.search.TermQuery$TermWeight.scorer

这个scorer方法根据入参context 以及外部类termQuery.term计算htis命中的个数

main[1] list
103 assert termStates == null || termStates.wasBuiltFor(ReaderUtil.getTopLevelContext(context))
104 : "The top-reader used to create Weight is not the same as the current reader's top-reader ("
105 + ReaderUtil.getTopLevelContext(context);
106 ;
107 => final TermsEnum termsEnum = getTermsEnum(context);
108 if (termsEnum == null) {
109 return null;
110 }
111 LeafSimScorer scorer =
112 new LeafSimScorer(simScorer, context.reader(), term.field(), scoreMode.needsScores()); // 这里term是外部类的term ,也就是this$0.term
main[1] where
[1] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:107)
[2] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

所以之后会调用advance,最后调用的是下面这个advance方法, 这里会用到docTermStartFP , 那么这个遍历在哪里初始化?

其实是在termStates里面获取,初始化的地方在docTermStartFP = termState.docStartFP;



lucene\core\src\java\org\apache\lucene\codecs\lucene90\Lucene90PostingsReader.java
@Override
public int advance(int target) throws IOException {
// current skip docID < docIDs generated from current buffer <= next skip docID
// we don't need to skip if target is buffered already
if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {

if (skipper == null) {
// Lazy init: first time this enum has ever been used for skipping
skipper =
new Lucene90SkipReader(
docIn.clone(), MAX_SKIP_LEVELS, indexHasPos, indexHasOffsets, indexHasPayloads);
}

if (!skipped) {
assert skipOffset != -1;
// This is the first time this enum has skipped
// since reset() was called; load the skip data:
skipper.init(docTermStartFP + skipOffset, docTermStartFP, 0, 0, docFreq);
skipped = true;
}

// always plus one to fix the result, since skip position in Lucene90SkipReader
// is a little different from MultiLevelSkipListReader
final int newDocUpto = skipper.skipTo(target) + 1;

if (newDocUpto >= blockUpto) {
// Skipper moved
assert newDocUpto % BLOCK_SIZE == 0 : "got " + newDocUpto;
blockUpto = newDocUpto;

// Force to read next block
docBufferUpto = BLOCK_SIZE;
accum = skipper.getDoc(); // actually, this is just lastSkipEntry
docIn.seek(skipper.getDocPointer()); // now point to the block we want to search
// even if freqBuffer were not read from the previous block, we will mark them as read,
// as we don't need to skip the previous block freqBuffer in refillDocs,
// as we have already positioned docIn where in needs to be.
isFreqsRead = true;
}
// next time we call advance, this is used to
// foresee whether skipper is necessary.
nextSkipDoc = skipper.getNextSkipDoc();
}
if (docBufferUpto == BLOCK_SIZE) {
refillDocs();
}

// Now scan... this is an inlined/pared down version
// of nextDoc():
long doc;
while (true) {
doc = docBuffer[docBufferUpto];

if (doc >= target) {
break;
}
++docBufferUpto;
}

docBufferUpto++;
return this.doc = (int) doc;
}

@Override
public long cost() {
return docFreq;
}
}

那么我们继续看termStates是怎么初始化的? 我先猜测term会是termStates 的一个成员变量

通过断点,我们最后找到了下面这个:

main[1] list
178 }
179
180 @Override
181 public BlockTermState newTermState() {
182 => return new IntBlockTermState();
183 }
184
185 @Override
186 public void close() throws IOException {
187 IOUtils.close(docIn, posIn, payIn);
main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.newTermState (Lucene90PostingsReader.java:182)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.<init> (SegmentTermsEnumFrame.java:101)
[3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.<init> (SegmentTermsEnum.java:76)
[4] org.apache.lucene.codecs.lucene90.blocktree.FieldReader.iterator (FieldReader.java:153)
[5] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:116)
[6] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[7] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[8] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[10] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[11] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[12] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

最后这里应该就是最最核心的获取词的流程了,i hope so

main[1] list
113
114 private static TermsEnum loadTermsEnum(LeafReaderContext ctx, Term term) throws IOException {
115 final Terms terms = Terms.getTerms(ctx.reader(), term.field());
116 final TermsEnum termsEnum = terms.iterator();
117 => if (termsEnum.seekExact(term.bytes())) {
118 return termsEnum;
119 }
120 return null;
121 }
122
main[1] print term.bytes()
term.bytes() = "[61 6d]"
main[1] where
[1] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[2] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[3] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[4] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

最后的最后应该是调用这里: 获取所有的term的个数,具体是哪里还需要判断,但是路径应该就是这里了

  // Target's prefix matches this block's prefix; we
// scan the entries check if the suffix matches.
public SeekStatus scanToTermLeaf(BytesRef target, boolean exactOnly) throws IOException {

// if (DEBUG) System.out.println(" scanToTermLeaf: block fp=" + fp + " prefix=" + prefix + "
// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + brToString(target) + " term=" +
// brToString(term));

assert nextEnt != -1;

ste.termExists = true;
subCode = 0;

if (nextEnt == entCount) {
if (exactOnly) {
fillTerm();
}
return SeekStatus.END;
}

assert prefixMatches(target);

// TODO: binary search when all terms have the same length, which is common for ID fields,
// which are also the most sensitive to lookup performance?
// Loop over each entry (term or sub-block) in this block:
do {
nextEnt++;

suffix = suffixLengthsReader.readVInt();

// if (DEBUG) {
// BytesRef suffixBytesRef = new BytesRef();
// suffixBytesRef.bytes = suffixBytes;
// suffixBytesRef.offset = suffixesReader.getPosition();
// suffixBytesRef.length = suffix;
// System.out.println(" cycle: term " + (nextEnt-1) + " (of " + entCount + ") suffix="
// + brToString(suffixBytesRef));
// }

startBytePos = suffixesReader.getPosition();
suffixesReader.skipBytes(suffix);

// Loop over bytes in the suffix, comparing to the target
final int cmp =
Arrays.compareUnsigned(
suffixBytes,
startBytePos,
startBytePos + suffix,
target.bytes,
target.offset + prefix,
target.offset + target.length);

if (cmp < 0) {
// Current entry is still before the target;
// keep scanning
} else if (cmp > 0) {
// Done! Current entry is after target --
// return NOT_FOUND:
fillTerm();

// if (DEBUG) System.out.println(" not found");
return SeekStatus.NOT_FOUND;
} else {
// Exact match!

// This cannot be a sub-block because we
// would have followed the index to this
// sub-block from the start:

assert ste.termExists;
fillTerm();
// if (DEBUG) System.out.println(" found!");
return SeekStatus.FOUND;
}
} while (nextEnt < entCount);

// It is possible (and OK) that terms index pointed us
// at this block, but, we scanned the entire block and
// did not find the term to position to. This happens
// when the target is after the last term in the block
// (but, before the next term in the index). EG
// target could be foozzz, and terms index pointed us
// to the foo* block, but the last term in this block
// was fooz (and, eg, first term in the next block will
// bee fop).
// if (DEBUG) System.out.println(" block end");
if (exactOnly) {
fillTerm();
}

// TODO: not consistent that in the
// not-exact case we don't next() into the next
// frame here
return SeekStatus.END;
}

// Target's prefix matches this block's prefix; we
// scan the entries check if the suffix matches.
public SeekStatus scanToTermNonLeaf(BytesRef target, boolean exactOnly) throws IOException {

// if (DEBUG) System.out.println(" scanToTermNonLeaf: block fp=" + fp + " prefix=" + prefix +
// " nextEnt=" + nextEnt + " (of " + entCount + ") target=" + brToString(target) + " term=" +
// brToString(target));

assert nextEnt != -1;

if (nextEnt == entCount) {
if (exactOnly) {
fillTerm();
ste.termExists = subCode == 0;
}
return SeekStatus.END;
}

assert prefixMatches(target);

// Loop over each entry (term or sub-block) in this block:
while (nextEnt < entCount) {

nextEnt++;

final int code = suffixLengthsReader.readVInt();
suffix = code >>> 1;

// if (DEBUG) {
// BytesRef suffixBytesRef = new BytesRef();
// suffixBytesRef.bytes = suffixBytes;
// suffixBytesRef.offset = suffixesReader.getPosition();
// suffixBytesRef.length = suffix;
// System.out.println(" cycle: " + ((code&1)==1 ? "sub-block" : "term") + " " +
// (nextEnt-1) + " (of " + entCount + ") suffix=" + brToString(suffixBytesRef));
// }

final int termLen = prefix + suffix;
startBytePos = suffixesReader.getPosition();
suffixesReader.skipBytes(suffix);
ste.termExists = (code & 1) == 0;
if (ste.termExists) {
state.termBlockOrd++;
subCode = 0;
} else {
subCode = suffixLengthsReader.readVLong();
lastSubFP = fp - subCode;
}

final int cmp =
Arrays.compareUnsigned(
suffixBytes,
startBytePos,
startBytePos + suffix,
target.bytes,
target.offset + prefix,
target.offset + target.length);

if (cmp < 0) {
// Current entry is still before the target;
// keep scanning
} else if (cmp > 0) {
// Done! Current entry is after target --
// return NOT_FOUND:
fillTerm();

// if (DEBUG) System.out.println(" maybe done exactOnly=" + exactOnly + "
// ste.termExists=" + ste.termExists);

if (!exactOnly && !ste.termExists) {
// System.out.println(" now pushFrame");
// TODO this
// We are on a sub-block, and caller wants
// us to position to the next term after
// the target, so we must recurse into the
// sub-frame(s):
ste.currentFrame = ste.pushFrame(null, ste.currentFrame.lastSubFP, termLen);
ste.currentFrame.loadBlock();
while (ste.currentFrame.next()) {
ste.currentFrame = ste.pushFrame(null, ste.currentFrame.lastSubFP, ste.term.length());
ste.currentFrame.loadBlock(); /////////////////////////////////////////////////// 这里会有流的加载
}
}

// if (DEBUG) System.out.println(" not found");
return SeekStatus.NOT_FOUND;
} else {
// Exact match!

// This cannot be a sub-block because we
// would have followed the index to this
// sub-block from the start:

assert ste.termExists;
fillTerm();
// if (DEBUG) System.out.println(" found!");
return SeekStatus.FOUND;
}
}

// It is possible (and OK) that terms index pointed us
// at this block, but, we scanned the entire block and
// did not find the term to position to. This happens
// when the target is after the last term in the block
// (but, before the next term in the index). EG
// target could be foozzz, and terms index pointed us
// to the foo* block, but the last term in this block
// was fooz (and, eg, first term in the next block will
// bee fop).
// if (DEBUG) System.out.println(" block end");
if (exactOnly) {
fillTerm();
}

// TODO: not consistent that in the
// not-exact case we don't next() into the next
// frame here
return SeekStatus.END;
}

termState 是如何被反序列化的?

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.decodeTerm(), line=194 bci=0
194 final IntBlockTermState termState = (IntBlockTermState) _termState;

main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.decodeTerm (Lucene90PostingsReader.java:194)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.decodeMetaData (SegmentTermsEnumFrame.java:476)
[3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.termState (SegmentTermsEnum.java:1,178)
[4] org.apache.lucene.index.TermStates.build (TermStates.java:104)
[5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


@Override
public void decodeTerm(
DataInput in, FieldInfo fieldInfo, BlockTermState _termState, boolean absolute)
throws IOException {
final IntBlockTermState termState = (IntBlockTermState) _termState;
final boolean fieldHasPositions =
fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
final boolean fieldHasOffsets =
fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)
>= 0;
final boolean fieldHasPayloads = fieldInfo.hasPayloads();

if (absolute) {
termState.docStartFP = 0;
termState.posStartFP = 0;
termState.payStartFP = 0;
}

final long l = in.readVLong();
if ((l & 0x01) == 0) {
termState.docStartFP += l >>> 1;
if (termState.docFreq == 1) {
termState.singletonDocID = in.readVInt();
} else {
termState.singletonDocID = -1;
}
} else {
assert absolute == false;
assert termState.singletonDocID != -1;
termState.singletonDocID += BitUtil.zigZagDecode(l >>> 1);
}

if (fieldHasPositions) {
termState.posStartFP += in.readVLong();
if (fieldHasOffsets || fieldHasPayloads) {
termState.payStartFP += in.readVLong();
}
if (termState.totalTermFreq > BLOCK_SIZE) {
termState.lastPosBlockOffset = in.readVLong();
} else {
termState.lastPosBlockOffset = -1;
}
}

if (termState.docFreq > BLOCK_SIZE) {
termState.skipOffset = in.readVLong();
} else {
termState.skipOffset = -1;
}
}

其实ste持有term的引用

main[2] dump ste.term.ref.bytes
ste.term.ref.bytes = {
97, 109, 0, 0, 0, 0, 0, 0
}
main[2] where
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.decodeMetaData (SegmentTermsEnumFrame.java:476)
[3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.termState (SegmentTermsEnum.java:1,178)
[4] org.apache.lucene.index.TermStates.build (TermStates.java:104)
[5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

ste.in 描述的是读取的文件:

 ste.in = {
$assertionsDisabled: true
org.apache.lucene.store.ByteBufferIndexInput.EMPTY_FLOATBUFFER: instance of java.nio.HeapFloatBuffer(id=1473)
org.apache.lucene.store.ByteBufferIndexInput.EMPTY_LONGBUFFER: instance of java.nio.HeapLongBuffer(id=1474)
org.apache.lucene.store.ByteBufferIndexInput.EMPTY_INTBUFFER: instance of java.nio.HeapIntBuffer(id=1475)
org.apache.lucene.store.ByteBufferIndexInput.length: 1993
org.apache.lucene.store.ByteBufferIndexInput.chunkSizeMask: 1073741823
org.apache.lucene.store.ByteBufferIndexInput.chunkSizePower: 30
org.apache.lucene.store.ByteBufferIndexInput.guard: instance of org.apache.lucene.store.ByteBufferGuard(id=1476)
org.apache.lucene.store.ByteBufferIndexInput.buffers: instance of java.nio.ByteBuffer[1] (id=1477)
org.apache.lucene.store.ByteBufferIndexInput.curBufIndex: 0
org.apache.lucene.store.ByteBufferIndexInput.curBuf: instance of java.nio.DirectByteBufferR(id=1479)
org.apache.lucene.store.ByteBufferIndexInput.curLongBufferViews: null
org.apache.lucene.store.ByteBufferIndexInput.curIntBufferViews: null
org.apache.lucene.store.ByteBufferIndexInput.curFloatBufferViews: null
org.apache.lucene.store.ByteBufferIndexInput.isClone: true
org.apache.lucene.store.ByteBufferIndexInput.$assertionsDisabled: true
org.apache.lucene.store.IndexInput.resourceDescription: "MMapIndexInput(path="/home/dai/index/_7.cfs") [slice=_7_Lucene90_0.tim]"
}

相关阅读

  public void nextLeaf() {
// if (DEBUG) System.out.println(" frame.next ord=" + ord + " nextEnt=" + nextEnt + "
// entCount=" + entCount);
assert nextEnt != -1 && nextEnt < entCount
: "nextEnt=" + nextEnt + " entCount=" + entCount + " fp=" + fp;
nextEnt++;
suffix = suffixLengthsReader.readVInt();
startBytePos = suffixesReader.getPosition();
ste.term.setLength(prefix + suffix);
ste.term.grow(ste.term.length());
suffixesReader.readBytes(ste.term.bytes(), prefix, suffix);
ste.termExists = true;
}

public boolean nextNonLeaf() throws IOException {
// if (DEBUG) System.out.println(" stef.next ord=" + ord + " nextEnt=" + nextEnt + " entCount="
// + entCount + " fp=" + suffixesReader.getPosition());
while (true) {
if (nextEnt == entCount) {
assert arc == null || (isFloor && isLastInFloor == false)
: "isFloor=" + isFloor + " isLastInFloor=" + isLastInFloor;
loadNextFloorBlock();
if (isLeafBlock) {
nextLeaf();
return false;
} else {
continue;
}
}

assert nextEnt != -1 && nextEnt < entCount
: "nextEnt=" + nextEnt + " entCount=" + entCount + " fp=" + fp;
nextEnt++;
final int code = suffixLengthsReader.readVInt();
suffix = code >>> 1;
startBytePos = suffixesReader.getPosition();
ste.term.setLength(prefix + suffix);
ste.term.grow(ste.term.length());
suffixesReader.readBytes(ste.term.bytes(), prefix, suffix); // 这里是最核心的地方吗?
if ((code & 1) == 0) {
// A normal term
ste.termExists = true;
subCode = 0;
state.termBlockOrd++;
return false;
} else {
// A sub-block; make sub-FP absolute:
ste.termExists = false;
subCode = suffixLengthsReader.readVLong();
lastSubFP = fp - subCode;
// if (DEBUG) {
// System.out.println(" lastSubFP=" + lastSubFP);
// }
return true;
}
}
}

看上去这就行读取term 在文件中的位置信息:

main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.scanToTermLeaf (SegmentTermsEnumFrame.java:593)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.scanToTerm (SegmentTermsEnumFrame.java:530)
[3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:538)
[4] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[5] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[6] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[7] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[9] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[11] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1] dump suffixBytes
suffixBytes = {
97, 109, 97, 110, 100, 98, 117, 116, 99, 97, 110, 100, 111, 104, 101, 108, 108, 111, 104, 105, 105, 105, 115, 105, 116, 107, 110, 111, 119, 109, 97, 121, 109, 111, 110, 103, 111, 110, 111, 116, 116, 114, 121, 119, 104, 97, 116, 119, 111, 114, 108, 100, 121, 111, 117, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}

term 对应docfreq 的统计信息的读取位置

main[1] list
451 // postings
452
453 // TODO: if docFreq were bulk decoded we could
454 // just skipN here:
455 => if (statsSingletonRunLength > 0) {
456 state.docFreq = 1;
457 state.totalTermFreq = 1;
458 statsSingletonRunLength--;
459 } else {
460 int token = statsReader.readVInt();
main[1] print statsSingletonRunLength
statsSingletonRunLength = 0
main[1] next
>
Step completed: "thread=main", org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.decodeMetaData(), line=460 bci=80
460 int token = statsReader.readVInt();

main[1] list
456 state.docFreq = 1;
457 state.totalTermFreq = 1;
458 statsSingletonRunLength--;
459 } else {
460 => int token = statsReader.readVInt();
461 if ((token & 1) == 1) {
462 state.docFreq = 1;
463 state.totalTermFreq = 1;
464 statsSingletonRunLength = token >>> 1;
465 } else {
main[1] print statsReader
statsReader = "org.apache.lucene.store.ByteArrayDataInput@6b67034"
main[1] dump statsReader
statsReader = {
bytes: instance of byte[64] (id=1520)
pos: 0
limit: 16
}
main[1] dump statsReader.bytes
statsReader.bytes = {
4, 0, 9, 2, 1, 4, 0, 3, 2, 1, 1, 2, 1, 7, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}

搜索的termam对应的是

00000000  3f d7 6c 17 12 42 6c 6f  63 6b 54 72 65 65 54 65  |?.l..BlockTreeTe|
00000010 72 6d 73 44 69 63 74 00 00 00 00 fe ea 80 e6 45 |rmsDict........E|
00000020 20 d8 56 64 1b 1b 1b 89 70 fe 67 0a 4c 75 63 65 | .Vd....p.g.Luce|
00000030 6e 65 39 30 5f 30 25 bc 03 61 6d 61 6e 64 62 75 |ne90_0%..amandbu|
00000040 74 63 61 6e 64 6f 68 65 6c 6c 6f 68 69 69 69 73 |tcandohellohiiis|
00000050 69 74 6b 6e 6f 77 6d 61 79 6d 6f 6e 67 6f 6e 6f |itknowmaymongono|
00000060 74 74 72 79 77 68 61 74 77 6f 72 6c 64 79 6f 75 |ttrywhatworldyou|
00000070 24 02 03 03 03 02 05 02 01 02 02 04 03 05 03 03 |$...............|
00000080 04 05 03 10 04 00 09 02 01 04 00 03 02 01 01 02 |................| <---- 在这一行第四个开始的序列
00000090 01 07 02 02 26 7a 3d 04 01 02 03 01 01 01 01 01 |....&z=.........|
000000a0 05 01 01 01 00 02 04 00 02 01 01 01 01 01 02 01 |................|
000000b0 01 01 02 01 01 01 01 05 01 03 01 05 a4 03 2f 68 |............../h|
000000c0 6f 6d 65 2f 75 62 75 6e 74 75 2f 64 6f 63 2f 68 |ome/ubuntu/doc/h|
000000d0 65 6c 6c 6f 2e 74 78 74 2f 68 6f 6d 65 2f 75 62 |ello.txt/home/ub|
000000e0 75 6e 74 75 2f 64 6f 63 2f 6d 6f 6e 67 6f 2e 74 |untu/doc/mongo.t|
000000f0 78 74 05 1a 01 03 04 82 01 01 03 c0 28 93 e8 00 |xt..........(...|
00000100 00 00 00 00 00 00 00 da 02 a3 a3 |...........|

那么docFreq 的赋值在哪里呢?

 currentFrame.state.docFreq = 2
main[1] list
1,113 assert !eof;
1,114 // if (DEBUG) System.out.println("BTR.docFreq");
1,115 currentFrame.decodeMetaData();
1,116 // if (DEBUG) System.out.println(" return " + currentFrame.state.docFreq);
1,117 => return currentFrame.state.docFreq;
1,118 }
1,119
1,120 @Override
1,121 public long totalTermFreq() throws IOException {
1,122 assert !eof;
main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.docFreq (SegmentTermsEnum.java:1,117)
[2] org.apache.lucene.index.TermStates.build (TermStates.java:107)
[3] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[4] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

读取的过程:

readByte:110, ByteBufferIndexInput (org.apache.lucene.store)
readVInt:121, DataInput (org.apache.lucene.store)
readVIntBlock:149, Lucene90PostingsReader (org.apache.lucene.codecs.lucene90)
refillDocs:472, Lucene90PostingsReader$BlockDocsEnum (org.apache.lucene.codecs.lucene90)
advance:538, Lucene90PostingsReader$BlockDocsEnum (org.apache.lucene.codecs.lucene90)
advance:77, SlowImpactsEnum (org.apache.lucene.index)
advance:128, ImpactsDISI (org.apache.lucene.search)
nextDoc:133, ImpactsDISI (org.apache.lucene.search)
scoreAll:301, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:247, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:38, BulkScorer (org.apache.lucene.search)
search:776, IndexSearcher (org.apache.lucene.search)
search:694, IndexSearcher (org.apache.lucene.search)
search:688, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:161, SearchFiles (com.dinosaur.lucene.skiptest)
queryTest:52, QueryTest (com.dinosaur.lucene.demo)

tim 文件在哪里初始化

  void loadBlock() throws IOException {

// Clone the IndexInput lazily, so that consumers
// that just pull a TermsEnum to
// seekExact(TermState) don't pay this cost:
ste.initIndexInput();

if (nextEnt != -1) {
// Already loaded
return;
}
// System.out.println("blc=" + blockLoadCount);

ste.in.seek(fp);
int code = ste.in.readVInt();
entCount = code >>> 1;
assert entCount > 0;
isLastInFloor = (code & 1) != 0;

assert arc == null || (isLastInFloor || isFloor)
: "fp=" + fp + " arc=" + arc + " isFloor=" + isFloor + " isLastInFloor=" + isLastInFloor;

// TODO: if suffixes were stored in random-access
// array structure, then we could do binary search
// instead of linear scan to find target term; eg
// we could have simple array of offsets

final long startSuffixFP = ste.in.getFilePointer();
// term suffixes:
final long codeL = ste.in.readVLong();
isLeafBlock = (codeL & 0x04) != 0;
final int numSuffixBytes = (int) (codeL >>> 3);
if (suffixBytes.length < numSuffixBytes) {
suffixBytes = new byte[ArrayUtil.oversize(numSuffixBytes, 1)];
}
try {
compressionAlg = CompressionAlgorithm.byCode((int) codeL & 0x03);
} catch (IllegalArgumentException e) {
throw new CorruptIndexException(e.getMessage(), ste.in, e);
}
compressionAlg.read(ste.in, suffixBytes, numSuffixBytes);
suffixesReader.reset(suffixBytes, 0, numSuffixBytes);

int numSuffixLengthBytes = ste.in.readVInt();
final boolean allEqual = (numSuffixLengthBytes & 0x01) != 0;
numSuffixLengthBytes >>>= 1;
if (suffixLengthBytes.length < numSuffixLengthBytes) {
suffixLengthBytes = new byte[ArrayUtil.oversize(numSuffixLengthBytes, 1)];
}
if (allEqual) {
Arrays.fill(suffixLengthBytes, 0, numSuffixLengthBytes, ste.in.readByte());
} else {
ste.in.readBytes(suffixLengthBytes, 0, numSuffixLengthBytes);
}
suffixLengthsReader.reset(suffixLengthBytes, 0, numSuffixLengthBytes);
totalSuffixBytes = ste.in.getFilePointer() - startSuffixFP;

/*if (DEBUG) {
if (arc == null) {
System.out.println(" loadBlock (next) fp=" + fp + " entCount=" + entCount + " prefixLen=" + prefix + " isLastInFloor=" + isLastInFloor + " leaf?=" + isLeafBlock);
} else {
System.out.println(" loadBlock (seek) fp=" + fp + " entCount=" + entCount + " prefixLen=" + prefix + " hasTerms?=" + hasTerms + " isFloor?=" + isFloor + " isLastInFloor=" + isLastInFloor + " leaf?=" + isLeafBlock);
}
}*/

// stats
int numBytes = ste.in.readVInt();
if (statBytes.length < numBytes) {
statBytes = new byte[ArrayUtil.oversize(numBytes, 1)];
}
ste.in.readBytes(statBytes, 0, numBytes);
statsReader.reset(statBytes, 0, numBytes);
statsSingletonRunLength = 0;
metaDataUpto = 0;

state.termBlockOrd = 0;
nextEnt = 0;
lastSubFP = -1;

// TODO: we could skip this if !hasTerms; but
// that's rare so won't help much
// metadata
numBytes = ste.in.readVInt();
if (bytes.length < numBytes) {
bytes = new byte[ArrayUtil.oversize(numBytes, 1)];
}
ste.in.readBytes(bytes, 0, numBytes);
bytesReader.reset(bytes, 0, numBytes);

// Sub-blocks of a single floor block are always
// written one after another -- tail recurse:
fpEnd = ste.in.getFilePointer();
// if (DEBUG) {
// System.out.println(" fpEnd=" + fpEnd);
// }
}
我们知道Lucene将索引文件拆分为了多个文件,这里我们仅讨论倒排索引部分。

Lucene把用于存储Term的索引文件叫Terms Index,它的后缀是.tip;把Postings信息分别存储在.doc、.pay、.pox,分别记录Postings的DocId信息和Term的词频、Payload信息、pox是记录位置信息。Terms Dictionary的文件后缀称为.tim,它是Term与Postings的关系纽带,存储了Term和其对应的Postings文件指针。

总体来说,通过Terms Index(.tip)能够快速地在Terms Dictionary(.tim)中找到你的想要的Term,以及它对应的Postings文件指针与Term在Segment作用域上的统计信息。


postings: 实际上Postings包含的东西并不仅仅是DocIDs(我们通常把这一个有序文档编号系列叫DocIDs),它还包括文档编号、以及词频、Term在文档中的位置信息、还有Payload数据。

所以关于倒排索引至少涉及5类文件,本文不会全面展开。

相关阅读