Skip to main content

lucene源码分析

· 44 min read

lucene 分为两部分:

  • 写入
    写入则是写入文件系统

  • 查询
    则是通过了 分词、排序、topk提取等过程,获取对应的docid,再通过docid 回查对应的内容

Vint

vint 是一个可变长的数组,是一个小端的变长数组,每个字节最高位置1代表后面还有(也就是最后一个字节的最高位是0)

相关代码

      IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setUseCompoundFile(false); // 生成多个文件

开始debug

### 调试java 代码

java -agentlib:jdwp=transport=dt_socket,server=y,address=8000 -cp ./lucene-demo-9.1.0-SNAPSHOT.jar:/home/ubuntu/lucene-9.1.0/lucene/core/build/libs/lucene-core-9.1.0-SNAPSHOT.jar:/home/ubuntu/lucene-9.1.0/lucene/queryparser/build/libs/lucene-queryparser-9.1.0-SNAPSHOT.jar org.apache.lucene.demo.SearchFiles

### jdb 连接上jdk
jdb -attach 8000 -sourcepath /home/ubuntu/lucene-9.1.0/lucene/demo/src/java/

查看fdt文件

hexdump -C _0.fdt
00000000 3f d7 6c 17 1c 4c 75 63 65 6e 65 39 30 53 74 6f |?.l..Lucene90Sto|
00000010 72 65 64 46 69 65 6c 64 73 46 61 73 74 44 61 74 |redFieldsFastDat|
00000020 61 00 00 00 01 85 88 12 2b 0c 73 6b 95 30 38 76 |a.......+.sk.08v|
00000030 c9 0a 2a 52 29 00 00 0a 00 01 00 1c 02 06 03 07 |..*R)...........|
00000040 07 07 07 07 07 07 07 07 20 00 1a 60 2f 68 6f 6d |........ ..`/hom|
00000050 65 2f 60 75 62 75 6e 74 75 60 2f 64 6f 63 2f 6d |e/`ubuntu`/doc/m|
00000060 60 6f 6e 67 6f 2e 74 60 78 74 00 1a 2f 68 60 6f |`ongo.t`xt../h`o|
00000070 6d 65 2f 75 62 60 75 6e 74 75 2f 64 60 6f 63 2f |me/ub`untu/d`oc/|
00000080 68 65 6c 60 6c 6f 2e 74 78 74 c0 28 93 e8 00 00 |hel`lo.txt.(....|
00000090 00 00 00 00 00 00 c8 75 0a 41 |.......u.A|
0000009a

writeField

ubuntu@VM-0-3-ubuntu:~$ jdb -attach 8000 -sourcepath /home/ubuntu/lucene-9.1.0/lucene/demo/src/java/:/home/ubuntu/lucene-9.1.0/lucene/core/src/java/ 
Set uncaught java.lang.Throwable
Set deferred uncaught java.lang.Throwable
Initializing jdb ...
>
VM Started: No frames on the current call stack

main[1] stop in org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField
Deferring breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField.
It will be set after the class is loaded.
main[1] cont
> Set deferred breakpoint org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField(), line=276 bci=0
276 ++numStoredFieldsInDoc;

main[1] wheree^H^H
Unrecognized command: 'wher'. Try help...
main[1] where
[1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsWriter.writeField (Lucene90CompressingStoredFieldsWriter.java:276)
[2] org.apache.lucene.index.StoredFieldsConsumer.writeField (StoredFieldsConsumer.java:65)
[3] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:749)
[4] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[5] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[6] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[7] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[8] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[9] org.apache.lucene.index.IndexWriter.addDocument (IndexWriter.java:1,469)
[10] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:271)
[11] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[12] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[13] java.nio.file.Files.walkFileTree (Files.java:2,725)
[14] java.nio.file.Files.walkFileTree (Files.java:2,797)
[15] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[16] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)
main[1] list
272
273 @Override
274 public void writeField(FieldInfo info, IndexableField field) throws IOException {
275
276 => ++numStoredFieldsInDoc;
277
278 int bits = 0;
279 final BytesRef bytes;
280 final String string;
281
main[1] print field
field = "stored,indexed,omitNorms,indexOptions=DOCS<path:/home/ubuntu/doc/mongo.txt>"
main[1] print info
info = "org.apache.lucene.index.FieldInfo@32464a14"

分词和倒排索引

main[1] where
[1] org.apache.lucene.index.IndexingChain$PerField.invert (IndexingChain.java:1,138)
[2] org.apache.lucene.index.IndexingChain.processField (IndexingChain.java:729)
[3] org.apache.lucene.index.IndexingChain.processDocument (IndexingChain.java:620)
[4] org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments (DocumentsWriterPerThread.java:241)
[5] org.apache.lucene.index.DocumentsWriter.updateDocuments (DocumentsWriter.java:432)
[6] org.apache.lucene.index.IndexWriter.updateDocuments (IndexWriter.java:1,531)
[7] org.apache.lucene.index.IndexWriter.updateDocument (IndexWriter.java:1,816)
[8] org.apache.lucene.demo.IndexFiles.indexDoc (IndexFiles.java:277)
[9] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:212)
[10] org.apache.lucene.demo.IndexFiles$1.visitFile (IndexFiles.java:208)
[11] java.nio.file.Files.walkFileTree (Files.java:2,725)
[12] java.nio.file.Files.walkFileTree (Files.java:2,797)
[13] org.apache.lucene.demo.IndexFiles.indexDocs (IndexFiles.java:206)
[14] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:157)

term描述

      IntBlockPool intPool,
ByteBlockPool bytePool,
ByteBlockPool termBytePool,

倒排索引term 在内存中用以下内容描述: intPool 包含三个变量:

  • 二维数组buffers[][]
  • int bufferUpto 描述的是二维数组 buffers[][]的第一级的偏移 , 一般都是这样用 int[] buff = buffers[bufferUpto + offset]
  • int intUpto 描述的是整体的偏移量,描述是偏移所有的buffers 的字节数
  • int intOffset 描述的是header buffer的偏移量

那么buffers[xxx][yyy]的值又是什么呢? 这个buffers二维数组存的也是偏移量.是什么的偏移量呢?

intPool描述的是bytePooltermBytePool 的偏移量

term 写入tim文件

会将term一个个写入

main[1] where 
[1] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlock (Lucene90BlockTreeTermsWriter.java:963)
[2] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.writeBlocks (Lucene90BlockTreeTermsWriter.java:709)
[3] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter.finish (Lucene90BlockTreeTermsWriter.java:1,105)
[4] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter.write (Lucene90BlockTreeTermsWriter.java:370)
[5] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write (PerFieldPostingsFormat.java:171)
[6] org.apache.lucene.index.FreqProxTermsWriter.flush (FreqProxTermsWriter.java:131)
[7] org.apache.lucene.index.IndexingChain.flush (IndexingChain.java:300)
[8] org.apache.lucene.index.DocumentsWriterPerThread.flush (DocumentsWriterPerThread.java:391)
[9] org.apache.lucene.index.DocumentsWriter.doFlush (DocumentsWriter.java:493)
[10] org.apache.lucene.index.DocumentsWriter.flushAllThreads (DocumentsWriter.java:672)
[11] org.apache.lucene.index.IndexWriter.doFlush (IndexWriter.java:4,014)
[12] org.apache.lucene.index.IndexWriter.flush (IndexWriter.java:3,988)
[13] org.apache.lucene.index.IndexWriter.shutdown (IndexWriter.java:1,321)
[14] org.apache.lucene.index.IndexWriter.close (IndexWriter.java:1,361)
[15] org.apache.lucene.demo.IndexFiles.main (IndexFiles.java:166)

查询

main[1] where
[1] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector.getLeafCollector (TopScoreDocCollector.java:57)
[2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:759)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[5] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[7] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[8] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

获取term

从terms reader 读取term

main[1] print fieldMap.get(field)
fieldMap.get(field) = "BlockTreeTerms(seg=_j terms=18,postings=20,positions=25,docs=2)"
main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.terms (Lucene90BlockTreeTermsReader.java:294)
[2] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms (PerFieldPostingsFormat.java:353)
[3] org.apache.lucene.index.CodecReader.terms (CodecReader.java:114)
[4] org.apache.lucene.index.Terms.getTerms (Terms.java:41)
[5] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:115)
[6] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[7] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[8] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[10] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[11] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[12] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


通过arc 获取对应output

Breakpoint hit: "thread=main", org.apache.lucene.util.fst.FST.findTargetArc(), line=1,412 bci=0
1,412 if (labelToMatch == END_LABEL) {

main[1] where
[1] org.apache.lucene.util.fst.FST.findTargetArc (FST.java:1,412)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:511)
[3] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[4] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

打开tim文件

main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsReader.<init> (Lucene90BlockTreeTermsReader.java:135)
[2] org.apache.lucene.codecs.lucene90.Lucene90PostingsFormat.fieldsProducer (Lucene90PostingsFormat.java:427)
[3] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.<init> (PerFieldPostingsFormat.java:329)
[4] org.apache.lucene.codecs.perfield.PerFieldPostingsFormat.fieldsProducer (PerFieldPostingsFormat.java:391)
[5] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:118)
[6] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
[8] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[9] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[11] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[12] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

获取topk的数据核心函数mergeAux,一个辅助函数获取topk的内容

Step completed: "thread=main", org.apache.lucene.search.TopDocs.mergeAux(), line=291 bci=43
291 for (int shardIDX = 0; shardIDX < shardHits.length; shardIDX++) {

main[1] where
[1] org.apache.lucene.search.TopDocs.mergeAux (TopDocs.java:291)
[2] org.apache.lucene.search.TopDocs.merge (TopDocs.java:216)
[3] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:528)
[4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

docid 获取对应的文案内容

通过docid 获取document

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek (ByteBufferIndexInput.java:529)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState.document (Lucene90CompressingStoredFieldsReader.java:594)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:610)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

seek 方法通过偏移获取document,其中seek 中curBufjava.nio.DirectByteBufferR

525    
526 @Override
527 public void seek(long pos) throws IOException {
528 try {
529 => curBuf.position((int) pos);
530 } catch (IllegalArgumentException e) {
531 if (pos < 0) {
532 throw new IllegalArgumentException("Seeking to negative position: " + this, e);
533 } else {
534 throw new EOFException("seek past EOF: " + this);
main[1] print curBuf
curBuf = "java.nio.DirectByteBufferR[pos=60 lim=154 cap=154]"

main[1] list
168
169 // NOTE: AIOOBE not EOF if you read too much
170 @Override
171 public void readBytes(byte[] b, int offset, int len) {
172 => System.arraycopy(bytes, pos, b, offset, len);
173 pos += len;
174 }
175 }
main[1] where
[1] org.apache.lucene.store.ByteArrayDataInput.readBytes (ByteArrayDataInput.java:172)
[2] org.apache.lucene.store.DataInput.readString (DataInput.java:265)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

通过堆外内存加载文件数据

Breakpoint hit: "thread=main", org.apache.lucene.store.ByteBufferIndexInput.setCurBuf(), line=83 bci=0
83 this.curBuf = curBuf;

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput.setCurBuf (ByteBufferIndexInput.java:83)
[2] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.<init> (ByteBufferIndexInput.java:520)
[3] org.apache.lucene.store.ByteBufferIndexInput.newInstance (ByteBufferIndexInput.java:60)
[4] org.apache.lucene.store.MMapDirectory.openInput (MMapDirectory.java:238)
[5] org.apache.lucene.store.Directory.openChecksumInput (Directory.java:152)
[6] org.apache.lucene.index.SegmentInfos.readCommit (SegmentInfos.java:297)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:88)
[8] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[9] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[11] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[12] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

filechannel 的 map

java 对应的类

src\java.base\share\classes\sun\nio\ch\FileChannelImpl.java
// Creates a new mapping
private native long map0(int prot, long position, long length, boolean isSync)
throws IOException;

native 对应的c实现类

src\java.base\unix\native\libnio\ch\FileChannelImpl.c
JNIEXPORT jlong JNICALL
Java_sun_nio_ch_FileChannelImpl_map0(JNIEnv *env, jobject this,
jint prot, jlong off, jlong len, jboolean map_sync)
{
void *mapAddress = 0;
jobject fdo = (*env)->GetObjectField(env, this, chan_fd);
jint fd = fdval(env, fdo);
int protections = 0;
int flags = 0;

// should never be called with map_sync and prot == PRIVATE
assert((prot != sun_nio_ch_FileChannelImpl_MAP_PV) || !map_sync);

if (prot == sun_nio_ch_FileChannelImpl_MAP_RO) {
protections = PROT_READ;
flags = MAP_SHARED;
} else if (prot == sun_nio_ch_FileChannelImpl_MAP_RW) {
protections = PROT_WRITE | PROT_READ;
flags = MAP_SHARED;
} else if (prot == sun_nio_ch_FileChannelImpl_MAP_PV) {
protections = PROT_WRITE | PROT_READ;
flags = MAP_PRIVATE;
}

// if MAP_SYNC and MAP_SHARED_VALIDATE are not defined then it is
// best to define them here. This ensures the code compiles on old
// OS releases which do not provide the relevant headers. If run
// on the same machine then it will work if the kernel contains
// the necessary support otherwise mmap should fail with an
// invalid argument error

#ifndef MAP_SYNC
#define MAP_SYNC 0x80000
#endif
#ifndef MAP_SHARED_VALIDATE
#define MAP_SHARED_VALIDATE 0x03
#endif

if (map_sync) {
// ensure
// 1) this is Linux on AArch64, x86_64, or PPC64 LE
// 2) the mmap APIs are available at compile time
#if !defined(LINUX) || ! (defined(aarch64) || (defined(amd64) && defined(_LP64)) || defined(ppc64le))
// TODO - implement for solaris/AIX/BSD/WINDOWS and for 32 bit
JNU_ThrowInternalError(env, "should never call map on platform where MAP_SYNC is unimplemented");
return IOS_THROWN;
#else
flags |= MAP_SYNC | MAP_SHARED_VALIDATE;
#endif
}

mapAddress = mmap64(
0, /* Let OS decide location */
len, /* Number of bytes to map */
protections, /* File permissions */
flags, /* Changes are shared */
fd, /* File descriptor of mapped file */
off); /* Offset into file */

if (mapAddress == MAP_FAILED) {
if (map_sync && errno == ENOTSUP) {
JNU_ThrowIOExceptionWithLastError(env, "map with mode MAP_SYNC unsupported");
return IOS_THROWN;
}

if (errno == ENOMEM) {
JNU_ThrowOutOfMemoryError(env, "Map failed");
return IOS_THROWN;
}
return handle(env, -1, "Map failed");
}

return ((jlong) (unsigned long) mapAddress);
}

mmap 映射文件读取硬盘中的内容

FileChannel.open 底层是一个native方法,如果是linux系统,就是mmap64

main[1] list
228
229 /** Creates an IndexInput for the file with the given name. */
230 @Override
231 public IndexInput openInput(String name, IOContext context) throws IOException {
232 => ensureOpen();
233 ensureCanRead(name);
234 Path path = directory.resolve(name);
235 try (FileChannel c = FileChannel.open(path, StandardOpenOption.READ)) {
236 final String resourceDescription = "MMapIndexInput(path=\"" + path.toString() + "\")";
237 final boolean useUnmap = getUseUnmap();
main[1] print name
name = "_j.fnm"
main[1] where
[1] org.apache.lucene.store.MMapDirectory.openInput (MMapDirectory.java:232)
[2] org.apache.lucene.store.Directory.openChecksumInput (Directory.java:152)
[3] org.apache.lucene.codecs.lucene90.Lucene90FieldInfosFormat.read (Lucene90FieldInfosFormat.java:124)
[4] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:111)
[5] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
[6] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[8] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[9] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[11] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

读取mmap后的数据

mmap之后的buf在哪里会被用到呢?
和普通的文件读写类似,也就是seek后读字节

lucene\core\src\java\org\apache\lucene\store\ByteBufferIndexInput.java
@Override
public final void readBytes(byte[] b, int offset, int len) throws IOException {
try {
guard.getBytes(curBuf, b, offset, len);
} catch (
@SuppressWarnings("unused")
BufferUnderflowException e) {
int curAvail = curBuf.remaining();
while (len > curAvail) {
guard.getBytes(curBuf, b, offset, curAvail);
len -= curAvail;
offset += curAvail;
curBufIndex++;
if (curBufIndex >= buffers.length) {
throw new EOFException("read past EOF: " + this);
}
setCurBuf(buffers[curBufIndex]);
curBuf.position(0);
curAvail = curBuf.remaining();
}
guard.getBytes(curBuf, b, offset, len);
} catch (
@SuppressWarnings("unused")
NullPointerException npe) {
throw new AlreadyClosedException("Already closed: " + this);
}
}

mmap后读取数据

main[1] where
[1] jdk.internal.misc.Unsafe.copyMemory (Unsafe.java:782)
[2] java.nio.DirectByteBuffer.get (DirectByteBuffer.java:308)
[3] org.apache.lucene.store.ByteBufferGuard.getBytes (ByteBufferGuard.java:93)
[4] org.apache.lucene.store.ByteBufferIndexInput.readBytes (ByteBufferIndexInput.java:114)
[5] org.apache.lucene.store.BufferedChecksumIndexInput.readBytes (BufferedChecksumIndexInput.java:46)
[6] org.apache.lucene.store.DataInput.readString (DataInput.java:265)
[7] org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic (CodecUtil.java:202)
[8] org.apache.lucene.codecs.CodecUtil.checkHeader (CodecUtil.java:193)
[9] org.apache.lucene.codecs.CodecUtil.checkIndexHeader (CodecUtil.java:253)
[10] org.apache.lucene.codecs.lucene90.Lucene90FieldInfosFormat.read (Lucene90FieldInfosFormat.java:128)
[11] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:111)
[12] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
[13] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
[14] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[15] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[16] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[17] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[18] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[19] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

文件格式介绍

.fnm 文件

格式出处

fnm 文件 由这几部分组成:

  • Header
  • FieldsCount : 字段的个数
  • 数组,长度为FieldsCount , 数组中每个元素包含包含这几个字段: [FieldName: 字段名 ,FieldNumber:字段number ,FieldBits, DocValuesBits, DocValuesGen ,DimensionCount , DimensionNumBytes ]
  • Footer

fnm 描述的field的基础信息,也可以算是metadata信息



Field names are stored in the field info file, with suffix .fnm.

FieldInfos (.fnm) --> Header,FieldsCount, <FieldName,FieldNumber, FieldBits,DocValuesBits,DocValuesGen,Attributes,DimensionCount,DimensionNumBytes> ,Footer

Data types:

Header --> IndexHeader
FieldsCount --> VInt
FieldName --> String
FieldBits, IndexOptions, DocValuesBits --> Byte
FieldNumber, DimensionCount, DimensionNumBytes --> VInt
Attributes --> Map<String,String>
DocValuesGen --> Int64
Footer --> CodecFooter
Field Descriptions:
FieldsCount: the number of fields in this file.
FieldName: name of the field as a UTF-8 String.
FieldNumber: the field's number. Note that unlike previous versions of Lucene, the fields are not numbered implicitly by their order in the file, instead explicitly.
FieldBits: a byte containing field options.
The low order bit (0x1) is one for fields that have term vectors stored, and zero for fields without term vectors.
If the second lowest order-bit is set (0x2), norms are omitted for the indexed field.
If the third lowest-order bit is set (0x4), payloads are stored for the indexed field.
IndexOptions: a byte containing index options.
0: not indexed
1: indexed as DOCS_ONLY
2: indexed as DOCS_AND_FREQS
3: indexed as DOCS_AND_FREQS_AND_POSITIONS
4: indexed as DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
DocValuesBits: a byte containing per-document value types. The type recorded as two four-bit integers, with the high-order bits representing norms options, and the low-order bits representing DocValues options. Each four-bit integer can be decoded as such:
0: no DocValues for this field.
1: NumericDocValues. (DocValuesType.NUMERIC)
2: BinaryDocValues. (DocValuesType#BINARY)
3: SortedDocValues. (DocValuesType#SORTED)
DocValuesGen is the generation count of the field's DocValues. If this is -1, there are no DocValues updates to that field. Anything above zero means there are updates stored by DocValuesFormat.
Attributes: a key-value map of codec-private attributes.
PointDimensionCount, PointNumBytes: these are non-zero only if the field is indexed as points, e.g. using LongPoint
VectorDimension: it is non-zero if the field is indexed as vectors.
VectorSimilarityFunction: a byte containing distance function used for similarity calculation.
0: EUCLIDEAN distance. (VectorSimilarityFunction.EUCLIDEAN)
1: DOT_PRODUCT similarity. (VectorSimilarityFunction.DOT_PRODUCT)
2: COSINE similarity. (VectorSimilarityFunction.COSINE)

.fdt

文件路径: lucene\backward-codecs\src\java\org\apache\lucene\backward_codecs\lucene50\Lucene50CompoundFormat.java

没有找到90的版本的fdt格式,只有2.9.4的,将就使用fdt格式

main[1] print fieldsStreamFN
fieldsStreamFN = "_j.fdt"
main[1] list
124 numDocs = si.maxDoc();
125
126 final String fieldsStreamFN =
127 IndexFileNames.segmentFileName(segment, segmentSuffix, FIELDS_EXTENSION);
128 => ChecksumIndexInput metaIn = null;
129 try {
130 // Open the data file
131 fieldsStream = d.openInput(fieldsStreamFN, context);
132 version =
133 CodecUtil.checkIndexHeader(
main[1] where
[1] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.<init> (Lucene90CompressingStoredFieldsReader.java:128)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsFormat.fieldsReader (Lucene90CompressingStoredFieldsFormat.java:133)
[3] org.apache.lucene.codecs.lucene90.Lucene90StoredFieldsFormat.fieldsReader (Lucene90StoredFieldsFormat.java:136)
[4] org.apache.lucene.index.SegmentCoreReaders.<init> (SegmentCoreReaders.java:138)
[5] org.apache.lucene.index.SegmentReader.<init> (SegmentReader.java:91)
[6] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:94)
[7] org.apache.lucene.index.StandardDirectoryReader$1.doBody (StandardDirectoryReader.java:77)
[8] org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run (SegmentInfos.java:809)
[9] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:109)
[10] org.apache.lucene.index.StandardDirectoryReader.open (StandardDirectoryReader.java:67)
[11] org.apache.lucene.index.DirectoryReader.open (DirectoryReader.java:60)
[12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:105)

加载doc的内容到Document 对象

整个流程是通过docid 获取document 的内容


@Override
public void visitDocument(int docID, StoredFieldVisitor visitor) throws IOException {

final SerializedDocument doc = document(docID); // 通过docID 获取doc对象

for (int fieldIDX = 0; fieldIDX < doc.numStoredFields; fieldIDX++) {
final long infoAndBits = doc.in.readVLong();
final int fieldNumber = (int) (infoAndBits >>> TYPE_BITS);
final FieldInfo fieldInfo = fieldInfos.fieldInfo(fieldNumber);

final int bits = (int) (infoAndBits & TYPE_MASK);
assert bits <= NUMERIC_DOUBLE : "bits=" + Integer.toHexString(bits);

switch (visitor.needsField(fieldInfo)) {
case YES:
readField(doc.in, visitor, fieldInfo, bits); // 通过input , 也就是input 绑定的fd ,去读mmap64 映射的文件 ,在这里会读取后缀名为 .fdt 的文件
break;
...
}
}
}
main[1] where
[1] org.apache.lucene.document.Document.add (Document.java:60)
[2] org.apache.lucene.document.DocumentStoredFieldVisitor.stringField (DocumentStoredFieldVisitor.java:74)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.readField (Lucene90CompressingStoredFieldsReader.java:246)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:640)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

通过docid 构建 SerializedDocument

首先入口在这里:

org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document

Lucene90CompressingStoredFieldsReader的document 方法

main[1] where
[1] org.apache.lucene.store.ByteBufferIndexInput$SingleBufferImpl.seek (ByteBufferIndexInput.java:529)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:606)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
[4] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[5] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[6] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[7] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
  SerializedDocument document(int docID) throws IOException {
if (state.contains(docID) == false) {
fieldsStream.seek(indexReader.getStartPointer(docID)); // 通过mmap64 偏移
state.reset(docID);
}
assert state.contains(docID);
return state.document(docID); // 再看具体的实现 , 这个state 对象对应的类是一个静态内部类
}

下面看看静态内部类的实现

    /**
* Get the serialized representation of the given docID. This docID has to be contained in the
* current block.
*/
SerializedDocument document(int docID) throws IOException {
if (contains(docID) == false) {
throw new IllegalArgumentException();
}

final int index = docID - docBase;
final int offset = Math.toIntExact(offsets[index]);
final int length = Math.toIntExact(offsets[index + 1]) - offset;
final int totalLength = Math.toIntExact(offsets[chunkDocs]);
final int numStoredFields = Math.toIntExact(this.numStoredFields[index]);

final BytesRef bytes;
if (merging) {
bytes = this.bytes;
} else {
bytes = new BytesRef();
}

final DataInput documentInput;
if (length == 0) {
...
} else {
fieldsStream.seek(startPointer); // seek mmap64 偏移量获取文件
decompressor.decompress(fieldsStream, totalLength, offset, length, bytes); // 解压对应的数据
assert bytes.length == length;
documentInput = new ByteArrayDataInput(bytes.bytes, bytes.offset, bytes.length); // 将数据塞入bytes
}

return new SerializedDocument(documentInput, length, numStoredFields); // 构建SerializedDocument
}
}

下面具体描述加载内容的过程:

 pos = 4
main[1] dump bytes
bytes = {
120, 116, 0, 26, 47, 104, 111, 109, 101, 47, 117, 98, 117, 110, 116, 117, 47, 100, 111, 99, 47, 104, 101, 108, 108, 111, 46, 116, 120, 116, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}
main[1] print in
in = "MMapIndexInput(path="/home/ubuntu/index/_j.fdt")"
main[1] where
[1] org.apache.lucene.codecs.lucene90.LZ4WithPresetDictCompressionMode$LZ4WithPresetDictDecompressor.decompress (LZ4WithPresetDictCompressionMode.java:88)
[2] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader$BlockState.document (Lucene90CompressingStoredFieldsReader.java:595)
[3] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.document (Lucene90CompressingStoredFieldsReader.java:610)
[4] org.apache.lucene.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.visitDocument (Lucene90CompressingStoredFieldsReader.java:628)
[5] org.apache.lucene.index.CodecReader.document (CodecReader.java:89)
[6] org.apache.lucene.index.BaseCompositeReader.document (BaseCompositeReader.java:154)
[7] org.apache.lucene.index.IndexReader.document (IndexReader.java:380)
[8] org.apache.lucene.search.IndexSearcher.doc (IndexSearcher.java:380)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:214)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

term 文件的加载和处理

 public SegmentTermsEnum(FieldReader fr) throws IOException {
this.fr = fr;

// if (DEBUG) {
// System.out.println("BTTR.init seg=" + fr.parent.segment);
// }
stack = new SegmentTermsEnumFrame[0];

// Used to hold seek by TermState, or cached seek
staticFrame = new SegmentTermsEnumFrame(this, -1);

if (fr.index == null) {
fstReader = null;
} else {
fstReader = fr.index.getBytesReader();
}

// Init w/ root block; don't use index since it may
// not (and need not) have been loaded
for (int arcIdx = 0; arcIdx < arcs.length; arcIdx++) {
arcs[arcIdx] = new FST.Arc<>();
}

currentFrame = staticFrame;
final FST.Arc<BytesRef> arc;
if (fr.index != null) {
arc = fr.index.getFirstArc(arcs[0]);
// Empty string prefix must have an output in the index!
assert arc.isFinal();
} else {
arc = null;
}
// currentFrame = pushFrame(arc, rootCode, 0);
// currentFrame.loadBlock();
validIndexPrefix = 0;
// if (DEBUG) {
// System.out.println("init frame state " + currentFrame.ord);
// printSeekState();
// }

// System.out.println();
// computeBlockStats().print(System.out);
}

解析获取getArc

  private FST.Arc<BytesRef> getArc(int ord) {
if (ord >= arcs.length) {
@SuppressWarnings({"rawtypes", "unchecked"})
final FST.Arc<BytesRef>[] next =
new FST.Arc[ArrayUtil.oversize(1 + ord, RamUsageEstimator.NUM_BYTES_OBJECT_REF)];
System.arraycopy(arcs, 0, next, 0, arcs.length);
for (int arcOrd = arcs.length; arcOrd < next.length; arcOrd++) {
next[arcOrd] = new FST.Arc<>();
}
arcs = next;
}
return arcs[ord];
}
Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.getArc(), line=222 bci=0
222 if (ord >= arcs.length) {

main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.getArc (SegmentTermsEnum.java:222)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:511)
[3] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[4] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

获取所有的数据

main[1] where
[1] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:300)
[2] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[3] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1] list
296 DocIdSetIterator iterator,
297 TwoPhaseIterator twoPhase,
298 Bits acceptDocs)
299 throws IOException {
300 => if (twoPhase == null) {
301 for (int doc = iterator.nextDoc();
302 doc != DocIdSetIterator.NO_MORE_DOCS;
303 doc = iterator.nextDoc()) {
304 if (acceptDocs == null || acceptDocs.get(doc)) {
305 collector.collect(doc);
main[1] print iterator
iterator = "org.apache.lucene.search.ImpactsDISI@6279cee3"

main[1] list
494 @Override
495 public int advance(int target) throws IOException {
496 // current skip docID < docIDs generated from current buffer <= next skip docID
497 // we don't need to skip if target is buffered already
498 => if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {
499
500 if (skipper == null) {
501 // Lazy init: first time this enum has ever been used for skipping
502 skipper =
503 new Lucene90SkipReader(
main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.advance (Lucene90PostingsReader.java:498)
[2] org.apache.lucene.index.SlowImpactsEnum.advance (SlowImpactsEnum.java:77)
[3] org.apache.lucene.search.ImpactsDISI.advance (ImpactsDISI.java:135)
[4] org.apache.lucene.search.ImpactsDISI.nextDoc (ImpactsDISI.java:140)
[5] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:301)
[6] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[7] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[11] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[12] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[13] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[14] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

生成iterator 的相关类 , 对应的是SegmentTermsEnum

main[1] where
[1] org.apache.lucene.search.TermQuery$TermWeight.getTermsEnum (TermQuery.java:145)
[2] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:107)
[3] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[7] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[9] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[10] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1] print termsEnum
termsEnum = "org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum@1a84f40f"

getTermsEnum 方法能拿到term的统计位置偏移,SegmentTermsEnum 不包含dociterator

main[1] where
[1] org.apache.lucene.index.Term.bytes (Term.java:128)
[2] org.apache.lucene.search.TermQuery$TermWeight.getTermsEnum (TermQuery.java:145)
[3] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:107)
[4] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


144 final TermsEnum termsEnum = context.reader().terms(term.field()).iterator();
145 => termsEnum.seekExact(term.bytes(), state);
146 return termsEnum;
147 }

这里的term.bytes() 就是我们的搜索值 , 所以term对应的倒排信息是从这里开始读的(还没看完,暂时那么定)

读出倒排信息之后,开始排序. score 有iteration 可以遍历所有doc_id

main[1] list
348 // (needsFreq=false)
349 private boolean isFreqsRead;
350 private int singletonDocID; // docid when there is a single pulsed posting, otherwise -1
351
352 => public BlockDocsEnum(FieldInfo fieldInfo) throws IOException {
353 this.startDocIn = Lucene90PostingsReader.this.docIn;
354 this.docIn = null;
355 indexHasFreq = fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS) >= 0;
356 indexHasPos =
357 fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockDocsEnum.<init> (Lucene90PostingsReader.java:352)
[2] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.postings (Lucene90PostingsReader.java:258)
[3] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.impacts (Lucene90PostingsReader.java:280)
[4] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.impacts (SegmentTermsEnum.java:1,150)
[5] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:114)
[6] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[10] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[11] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[12] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

topk collector的堆栈

Breakpoint hit: "thread=main", org.apache.lucene.search.TopDocsCollector.populateResults(), line=64 bci=0
64 for (int i = howMany - 1; i >= 0; i--) {

main[1] where
[1] org.apache.lucene.search.TopDocsCollector.populateResults (TopDocsCollector.java:64)
[2] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:166)
[3] org.apache.lucene.search.TopDocsCollector.topDocs (TopDocsCollector.java:98)
[4] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:526)
[5] org.apache.lucene.search.IndexSearcher$2.reduce (IndexSearcher.java:505)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

search 过程

main[1] dump collector
collector = {
org.apache.lucene.search.TopScoreDocCollector.docBase: 0
org.apache.lucene.search.TopScoreDocCollector.pqTop: instance of org.apache.lucene.search.ScoreDoc(id=1529)
org.apache.lucene.search.TopScoreDocCollector.hitsThresholdChecker: instance of org.apache.lucene.search.HitsThresholdChecker$LocalHitsThresholdChecker(id=1530)
org.apache.lucene.search.TopScoreDocCollector.minScoreAcc: null
org.apache.lucene.search.TopScoreDocCollector.minCompetitiveScore: 0.0
org.apache.lucene.search.TopScoreDocCollector.$assertionsDisabled: true
org.apache.lucene.search.TopDocsCollector.EMPTY_TOPDOCS: instance of org.apache.lucene.search.TopDocs(id=1531)
org.apache.lucene.search.TopDocsCollector.pq: instance of org.apache.lucene.search.HitQueue(id=1532)
org.apache.lucene.search.TopDocsCollector.totalHits: 0
org.apache.lucene.search.TopDocsCollector.totalHitsRelation: instance of org.apache.lucene.search.TotalHits$Relation(id=1533)
}
main[1] print collector
collector = "org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector@62bd765"

获取hits 数量的过程

690      private <C extends Collector, T> T search(
691 Weight weight, CollectorManager<C, T> collectorManager, C firstCollector) throws IOException {
692 if (executor == null || leafSlices.length <= 1) {
693 search(leafContexts, weight, firstCollector);
694 => return collectorManager.reduce(Collections.singletonList(firstCollector));
695 } else {
696 final List<C> collectors = new ArrayList<>(leafSlices.length);
697 collectors.add(firstCollector);
698 final ScoreMode scoreMode = firstCollector.scoreMode();
699 for (int i = 1; i < leafSlices.length; ++i) {
main[1] where
[1] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:694)
[2] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[3] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[5] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[6] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

org.apache.lucene.search.TopScoreDocCollector.create , 一直往上翻,发现org.apache.lucene.search.IndexSearcher.searchAfter 就已经有了. 那么这个hit数量是从哪里初始化的呢?

很明显,search会填充firstCollector的数据,那么是在哪里赋值的呢?

 protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
throws IOException {

// TODO: should we make this
// threaded...? the Collector could be sync'd?
// always use single thread:
for (LeafReaderContext ctx : leaves) { // search each subreader
final LeafCollector leafCollector;
try {
leafCollector = collector.getLeafCollector(ctx);
} catch (
@SuppressWarnings("unused")
CollectionTerminatedException e) {
// there is no doc of interest in this reader context
// continue with the following leaf
continue;
}
BulkScorer scorer = weight.bulkScorer(ctx); /// 在这里会获取total hits
if (scorer != null) {
try {
scorer.score(leafCollector, ctx.reader().getLiveDocs());
} catch (
@SuppressWarnings("unused")
CollectionTerminatedException e) {
// collection was terminated prematurely
// continue with the following leaf
}
}
}
}

看完最后的堆栈,我们确定了totalHits 是在这里赋值的 , 也就是只要调用了一次就自增一, 很明显这是一个统计,那么这个统计就是命中的搜索内容,那么搜索内容是怎么来的呢?

我们只能往上追溯

main[1] where
[1] org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1.collect (TopScoreDocCollector.java:73)
[2] org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll (Weight.java:305)
[3] org.apache.lucene.search.Weight$DefaultBulkScorer.score (Weight.java:247)
[4] org.apache.lucene.search.BulkScorer.score (BulkScorer.java:38)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:770)
[6] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

@Override
public void collect(int doc) throws IOException {
float score = scorer.score(); // 计算分数 , 也是回调专用的score 函数 , 插件化

// This collector relies on the fact that scorers produce positive values:
assert score >= 0; // NOTE: false for NaN

totalHits++; // hit +1 在这里触发
hitsThresholdChecker.incrementHitCount();

if (minScoreAcc != null && (totalHits & minScoreAcc.modInterval) == 0) {
updateGlobalMinCompetitiveScore(scorer);
}

if (score <= pqTop.score) {
if (totalHitsRelation == TotalHits.Relation.EQUAL_TO) {
// we just reached totalHitsThreshold, we can start setting the min
// competitive score now
updateMinCompetitiveScore(scorer);
}
// Since docs are returned in-order (i.e., increasing doc Id), a document
// with equal score to pqTop.score cannot compete since HitQueue favors
// documents with lower doc Ids. Therefore reject those docs too.
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
updateMinCompetitiveScore(scorer);
}
};

继续往上面推之后,我们找到了堆栈,scorer 是根据context生成的

  /**
* Optional method, to return a {@link BulkScorer} to score the query and send hits to a {@link
* Collector}. Only queries that have a different top-level approach need to override this; the
* default implementation pulls a normal {@link Scorer} and iterates and collects the resulting
* hits which are not marked as deleted.
*
* @param context the {@link org.apache.lucene.index.LeafReaderContext} for which to return the
* {@link Scorer}.
* @return a {@link BulkScorer} which scores documents and passes them to a collector.
* @throws IOException if there is a low-level I/O error
*/
public BulkScorer bulkScorer(LeafReaderContext context) throws IOException {

Scorer scorer = scorer(context);
if (scorer == null) {
// No docs match
return null;
}

// This impl always scores docs in order, so we can
// ignore scoreDocsInOrder:
return new DefaultBulkScorer(scorer);
}

再往上看: 刚刚看到了bulkScorer 回调了一个scorer 方法,这个scorer抽象方法的实现是在org.apache.lucene.search.TermQuery$TermWeight.scorer

这个scorer方法根据入参context 以及外部类termQuery.term计算htis命中的个数

main[1] list
103 assert termStates == null || termStates.wasBuiltFor(ReaderUtil.getTopLevelContext(context))
104 : "The top-reader used to create Weight is not the same as the current reader's top-reader ("
105 + ReaderUtil.getTopLevelContext(context);
106 ;
107 => final TermsEnum termsEnum = getTermsEnum(context);
108 if (termsEnum == null) {
109 return null;
110 }
111 LeafSimScorer scorer =
112 new LeafSimScorer(simScorer, context.reader(), term.field(), scoreMode.needsScores()); // 这里term是外部类的term ,也就是this$0.term
main[1] where
[1] org.apache.lucene.search.TermQuery$TermWeight.scorer (TermQuery.java:107)
[2] org.apache.lucene.search.Weight.bulkScorer (Weight.java:166)
[3] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:767)
[4] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:693)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:687)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

所以之后会调用advance,最后调用的是下面这个advance方法, 这里会用到docTermStartFP , 那么这个遍历在哪里初始化?

其实是在termStates里面获取,初始化的地方在docTermStartFP = termState.docStartFP;



lucene\core\src\java\org\apache\lucene\codecs\lucene90\Lucene90PostingsReader.java
@Override
public int advance(int target) throws IOException {
// current skip docID < docIDs generated from current buffer <= next skip docID
// we don't need to skip if target is buffered already
if (docFreq > BLOCK_SIZE && target > nextSkipDoc) {

if (skipper == null) {
// Lazy init: first time this enum has ever been used for skipping
skipper =
new Lucene90SkipReader(
docIn.clone(), MAX_SKIP_LEVELS, indexHasPos, indexHasOffsets, indexHasPayloads);
}

if (!skipped) {
assert skipOffset != -1;
// This is the first time this enum has skipped
// since reset() was called; load the skip data:
skipper.init(docTermStartFP + skipOffset, docTermStartFP, 0, 0, docFreq);
skipped = true;
}

// always plus one to fix the result, since skip position in Lucene90SkipReader
// is a little different from MultiLevelSkipListReader
final int newDocUpto = skipper.skipTo(target) + 1;

if (newDocUpto >= blockUpto) {
// Skipper moved
assert newDocUpto % BLOCK_SIZE == 0 : "got " + newDocUpto;
blockUpto = newDocUpto;

// Force to read next block
docBufferUpto = BLOCK_SIZE;
accum = skipper.getDoc(); // actually, this is just lastSkipEntry
docIn.seek(skipper.getDocPointer()); // now point to the block we want to search
// even if freqBuffer were not read from the previous block, we will mark them as read,
// as we don't need to skip the previous block freqBuffer in refillDocs,
// as we have already positioned docIn where in needs to be.
isFreqsRead = true;
}
// next time we call advance, this is used to
// foresee whether skipper is necessary.
nextSkipDoc = skipper.getNextSkipDoc();
}
if (docBufferUpto == BLOCK_SIZE) {
refillDocs();
}

// Now scan... this is an inlined/pared down version
// of nextDoc():
long doc;
while (true) {
doc = docBuffer[docBufferUpto];

if (doc >= target) {
break;
}
++docBufferUpto;
}

docBufferUpto++;
return this.doc = (int) doc;
}

@Override
public long cost() {
return docFreq;
}
}

那么我们继续看termStates是怎么初始化的? 我先猜测term会是termStates 的一个成员变量

通过断点,我们最后找到了下面这个:

main[1] list
178 }
179
180 @Override
181 public BlockTermState newTermState() {
182 => return new IntBlockTermState();
183 }
184
185 @Override
186 public void close() throws IOException {
187 IOUtils.close(docIn, posIn, payIn);
main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.newTermState (Lucene90PostingsReader.java:182)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.<init> (SegmentTermsEnumFrame.java:101)
[3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.<init> (SegmentTermsEnum.java:76)
[4] org.apache.lucene.codecs.lucene90.blocktree.FieldReader.iterator (FieldReader.java:153)
[5] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:116)
[6] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[7] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[8] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[10] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[11] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[12] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[13] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1]

最后这里应该就是最最核心的获取词的流程了,i hope so

main[1] list
113
114 private static TermsEnum loadTermsEnum(LeafReaderContext ctx, Term term) throws IOException {
115 final Terms terms = Terms.getTerms(ctx.reader(), term.field());
116 final TermsEnum termsEnum = terms.iterator();
117 => if (termsEnum.seekExact(term.bytes())) {
118 return termsEnum;
119 }
120 return null;
121 }
122
main[1] print term.bytes()
term.bytes() = "[61 6d]"
main[1] where
[1] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[2] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[3] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[4] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

最后的最后应该是调用这里: 获取所有的term的个数,具体是哪里还需要判断,但是路径应该就是这里了

  // Target's prefix matches this block's prefix; we
// scan the entries check if the suffix matches.
public SeekStatus scanToTermLeaf(BytesRef target, boolean exactOnly) throws IOException {

// if (DEBUG) System.out.println(" scanToTermLeaf: block fp=" + fp + " prefix=" + prefix + "
// nextEnt=" + nextEnt + " (of " + entCount + ") target=" + brToString(target) + " term=" +
// brToString(term));

assert nextEnt != -1;

ste.termExists = true;
subCode = 0;

if (nextEnt == entCount) {
if (exactOnly) {
fillTerm();
}
return SeekStatus.END;
}

assert prefixMatches(target);

// TODO: binary search when all terms have the same length, which is common for ID fields,
// which are also the most sensitive to lookup performance?
// Loop over each entry (term or sub-block) in this block:
do {
nextEnt++;

suffix = suffixLengthsReader.readVInt();

// if (DEBUG) {
// BytesRef suffixBytesRef = new BytesRef();
// suffixBytesRef.bytes = suffixBytes;
// suffixBytesRef.offset = suffixesReader.getPosition();
// suffixBytesRef.length = suffix;
// System.out.println(" cycle: term " + (nextEnt-1) + " (of " + entCount + ") suffix="
// + brToString(suffixBytesRef));
// }

startBytePos = suffixesReader.getPosition();
suffixesReader.skipBytes(suffix);

// Loop over bytes in the suffix, comparing to the target
final int cmp =
Arrays.compareUnsigned(
suffixBytes,
startBytePos,
startBytePos + suffix,
target.bytes,
target.offset + prefix,
target.offset + target.length);

if (cmp < 0) {
// Current entry is still before the target;
// keep scanning
} else if (cmp > 0) {
// Done! Current entry is after target --
// return NOT_FOUND:
fillTerm();

// if (DEBUG) System.out.println(" not found");
return SeekStatus.NOT_FOUND;
} else {
// Exact match!

// This cannot be a sub-block because we
// would have followed the index to this
// sub-block from the start:

assert ste.termExists;
fillTerm();
// if (DEBUG) System.out.println(" found!");
return SeekStatus.FOUND;
}
} while (nextEnt < entCount);

// It is possible (and OK) that terms index pointed us
// at this block, but, we scanned the entire block and
// did not find the term to position to. This happens
// when the target is after the last term in the block
// (but, before the next term in the index). EG
// target could be foozzz, and terms index pointed us
// to the foo* block, but the last term in this block
// was fooz (and, eg, first term in the next block will
// bee fop).
// if (DEBUG) System.out.println(" block end");
if (exactOnly) {
fillTerm();
}

// TODO: not consistent that in the
// not-exact case we don't next() into the next
// frame here
return SeekStatus.END;
}

// Target's prefix matches this block's prefix; we
// scan the entries check if the suffix matches.
public SeekStatus scanToTermNonLeaf(BytesRef target, boolean exactOnly) throws IOException {

// if (DEBUG) System.out.println(" scanToTermNonLeaf: block fp=" + fp + " prefix=" + prefix +
// " nextEnt=" + nextEnt + " (of " + entCount + ") target=" + brToString(target) + " term=" +
// brToString(target));

assert nextEnt != -1;

if (nextEnt == entCount) {
if (exactOnly) {
fillTerm();
ste.termExists = subCode == 0;
}
return SeekStatus.END;
}

assert prefixMatches(target);

// Loop over each entry (term or sub-block) in this block:
while (nextEnt < entCount) {

nextEnt++;

final int code = suffixLengthsReader.readVInt();
suffix = code >>> 1;

// if (DEBUG) {
// BytesRef suffixBytesRef = new BytesRef();
// suffixBytesRef.bytes = suffixBytes;
// suffixBytesRef.offset = suffixesReader.getPosition();
// suffixBytesRef.length = suffix;
// System.out.println(" cycle: " + ((code&1)==1 ? "sub-block" : "term") + " " +
// (nextEnt-1) + " (of " + entCount + ") suffix=" + brToString(suffixBytesRef));
// }

final int termLen = prefix + suffix;
startBytePos = suffixesReader.getPosition();
suffixesReader.skipBytes(suffix);
ste.termExists = (code & 1) == 0;
if (ste.termExists) {
state.termBlockOrd++;
subCode = 0;
} else {
subCode = suffixLengthsReader.readVLong();
lastSubFP = fp - subCode;
}

final int cmp =
Arrays.compareUnsigned(
suffixBytes,
startBytePos,
startBytePos + suffix,
target.bytes,
target.offset + prefix,
target.offset + target.length);

if (cmp < 0) {
// Current entry is still before the target;
// keep scanning
} else if (cmp > 0) {
// Done! Current entry is after target --
// return NOT_FOUND:
fillTerm();

// if (DEBUG) System.out.println(" maybe done exactOnly=" + exactOnly + "
// ste.termExists=" + ste.termExists);

if (!exactOnly && !ste.termExists) {
// System.out.println(" now pushFrame");
// TODO this
// We are on a sub-block, and caller wants
// us to position to the next term after
// the target, so we must recurse into the
// sub-frame(s):
ste.currentFrame = ste.pushFrame(null, ste.currentFrame.lastSubFP, termLen);
ste.currentFrame.loadBlock();
while (ste.currentFrame.next()) {
ste.currentFrame = ste.pushFrame(null, ste.currentFrame.lastSubFP, ste.term.length());
ste.currentFrame.loadBlock(); /////////////////////////////////////////////////// 这里会有流的加载
}
}

// if (DEBUG) System.out.println(" not found");
return SeekStatus.NOT_FOUND;
} else {
// Exact match!

// This cannot be a sub-block because we
// would have followed the index to this
// sub-block from the start:

assert ste.termExists;
fillTerm();
// if (DEBUG) System.out.println(" found!");
return SeekStatus.FOUND;
}
}

// It is possible (and OK) that terms index pointed us
// at this block, but, we scanned the entire block and
// did not find the term to position to. This happens
// when the target is after the last term in the block
// (but, before the next term in the index). EG
// target could be foozzz, and terms index pointed us
// to the foo* block, but the last term in this block
// was fooz (and, eg, first term in the next block will
// bee fop).
// if (DEBUG) System.out.println(" block end");
if (exactOnly) {
fillTerm();
}

// TODO: not consistent that in the
// not-exact case we don't next() into the next
// frame here
return SeekStatus.END;
}

termState 是如何被反序列化的?

Breakpoint hit: "thread=main", org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.decodeTerm(), line=194 bci=0
194 final IntBlockTermState termState = (IntBlockTermState) _termState;

main[1] where
[1] org.apache.lucene.codecs.lucene90.Lucene90PostingsReader.decodeTerm (Lucene90PostingsReader.java:194)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.decodeMetaData (SegmentTermsEnumFrame.java:476)
[3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.termState (SegmentTermsEnum.java:1,178)
[4] org.apache.lucene.index.TermStates.build (TermStates.java:104)
[5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)


@Override
public void decodeTerm(
DataInput in, FieldInfo fieldInfo, BlockTermState _termState, boolean absolute)
throws IOException {
final IntBlockTermState termState = (IntBlockTermState) _termState;
final boolean fieldHasPositions =
fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS) >= 0;
final boolean fieldHasOffsets =
fieldInfo.getIndexOptions().compareTo(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS)
>= 0;
final boolean fieldHasPayloads = fieldInfo.hasPayloads();

if (absolute) {
termState.docStartFP = 0;
termState.posStartFP = 0;
termState.payStartFP = 0;
}

final long l = in.readVLong();
if ((l & 0x01) == 0) {
termState.docStartFP += l >>> 1;
if (termState.docFreq == 1) {
termState.singletonDocID = in.readVInt();
} else {
termState.singletonDocID = -1;
}
} else {
assert absolute == false;
assert termState.singletonDocID != -1;
termState.singletonDocID += BitUtil.zigZagDecode(l >>> 1);
}

if (fieldHasPositions) {
termState.posStartFP += in.readVLong();
if (fieldHasOffsets || fieldHasPayloads) {
termState.payStartFP += in.readVLong();
}
if (termState.totalTermFreq > BLOCK_SIZE) {
termState.lastPosBlockOffset = in.readVLong();
} else {
termState.lastPosBlockOffset = -1;
}
}

if (termState.docFreq > BLOCK_SIZE) {
termState.skipOffset = in.readVLong();
} else {
termState.skipOffset = -1;
}
}

其实ste持有term的引用

main[2] dump ste.term.ref.bytes
ste.term.ref.bytes = {
97, 109, 0, 0, 0, 0, 0, 0
}
main[2] where
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.decodeMetaData (SegmentTermsEnumFrame.java:476)
[3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.termState (SegmentTermsEnum.java:1,178)
[4] org.apache.lucene.index.TermStates.build (TermStates.java:104)
[5] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[6] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[8] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[9] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[10] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[11] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

ste.in 描述的是读取的文件:

 ste.in = {
$assertionsDisabled: true
org.apache.lucene.store.ByteBufferIndexInput.EMPTY_FLOATBUFFER: instance of java.nio.HeapFloatBuffer(id=1473)
org.apache.lucene.store.ByteBufferIndexInput.EMPTY_LONGBUFFER: instance of java.nio.HeapLongBuffer(id=1474)
org.apache.lucene.store.ByteBufferIndexInput.EMPTY_INTBUFFER: instance of java.nio.HeapIntBuffer(id=1475)
org.apache.lucene.store.ByteBufferIndexInput.length: 1993
org.apache.lucene.store.ByteBufferIndexInput.chunkSizeMask: 1073741823
org.apache.lucene.store.ByteBufferIndexInput.chunkSizePower: 30
org.apache.lucene.store.ByteBufferIndexInput.guard: instance of org.apache.lucene.store.ByteBufferGuard(id=1476)
org.apache.lucene.store.ByteBufferIndexInput.buffers: instance of java.nio.ByteBuffer[1] (id=1477)
org.apache.lucene.store.ByteBufferIndexInput.curBufIndex: 0
org.apache.lucene.store.ByteBufferIndexInput.curBuf: instance of java.nio.DirectByteBufferR(id=1479)
org.apache.lucene.store.ByteBufferIndexInput.curLongBufferViews: null
org.apache.lucene.store.ByteBufferIndexInput.curIntBufferViews: null
org.apache.lucene.store.ByteBufferIndexInput.curFloatBufferViews: null
org.apache.lucene.store.ByteBufferIndexInput.isClone: true
org.apache.lucene.store.ByteBufferIndexInput.$assertionsDisabled: true
org.apache.lucene.store.IndexInput.resourceDescription: "MMapIndexInput(path="/home/dai/index/_7.cfs") [slice=_7_Lucene90_0.tim]"
}

相关阅读

  public void nextLeaf() {
// if (DEBUG) System.out.println(" frame.next ord=" + ord + " nextEnt=" + nextEnt + "
// entCount=" + entCount);
assert nextEnt != -1 && nextEnt < entCount
: "nextEnt=" + nextEnt + " entCount=" + entCount + " fp=" + fp;
nextEnt++;
suffix = suffixLengthsReader.readVInt();
startBytePos = suffixesReader.getPosition();
ste.term.setLength(prefix + suffix);
ste.term.grow(ste.term.length());
suffixesReader.readBytes(ste.term.bytes(), prefix, suffix);
ste.termExists = true;
}

public boolean nextNonLeaf() throws IOException {
// if (DEBUG) System.out.println(" stef.next ord=" + ord + " nextEnt=" + nextEnt + " entCount="
// + entCount + " fp=" + suffixesReader.getPosition());
while (true) {
if (nextEnt == entCount) {
assert arc == null || (isFloor && isLastInFloor == false)
: "isFloor=" + isFloor + " isLastInFloor=" + isLastInFloor;
loadNextFloorBlock();
if (isLeafBlock) {
nextLeaf();
return false;
} else {
continue;
}
}

assert nextEnt != -1 && nextEnt < entCount
: "nextEnt=" + nextEnt + " entCount=" + entCount + " fp=" + fp;
nextEnt++;
final int code = suffixLengthsReader.readVInt();
suffix = code >>> 1;
startBytePos = suffixesReader.getPosition();
ste.term.setLength(prefix + suffix);
ste.term.grow(ste.term.length());
suffixesReader.readBytes(ste.term.bytes(), prefix, suffix); // 这里是最核心的地方吗?
if ((code & 1) == 0) {
// A normal term
ste.termExists = true;
subCode = 0;
state.termBlockOrd++;
return false;
} else {
// A sub-block; make sub-FP absolute:
ste.termExists = false;
subCode = suffixLengthsReader.readVLong();
lastSubFP = fp - subCode;
// if (DEBUG) {
// System.out.println(" lastSubFP=" + lastSubFP);
// }
return true;
}
}
}

看上去这就行读取term 在文件中的位置信息:

main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.scanToTermLeaf (SegmentTermsEnumFrame.java:593)
[2] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.scanToTerm (SegmentTermsEnumFrame.java:530)
[3] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.seekExact (SegmentTermsEnum.java:538)
[4] org.apache.lucene.index.TermStates.loadTermsEnum (TermStates.java:117)
[5] org.apache.lucene.index.TermStates.build (TermStates.java:102)
[6] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[7] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[8] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[9] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[10] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[11] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[12] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)
main[1] dump suffixBytes
suffixBytes = {
97, 109, 97, 110, 100, 98, 117, 116, 99, 97, 110, 100, 111, 104, 101, 108, 108, 111, 104, 105, 105, 105, 115, 105, 116, 107, 110, 111, 119, 109, 97, 121, 109, 111, 110, 103, 111, 110, 111, 116, 116, 114, 121, 119, 104, 97, 116, 119, 111, 114, 108, 100, 121, 111, 117, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}

term 对应docfreq 的统计信息的读取位置

main[1] list
451 // postings
452
453 // TODO: if docFreq were bulk decoded we could
454 // just skipN here:
455 => if (statsSingletonRunLength > 0) {
456 state.docFreq = 1;
457 state.totalTermFreq = 1;
458 statsSingletonRunLength--;
459 } else {
460 int token = statsReader.readVInt();
main[1] print statsSingletonRunLength
statsSingletonRunLength = 0
main[1] next
>
Step completed: "thread=main", org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnumFrame.decodeMetaData(), line=460 bci=80
460 int token = statsReader.readVInt();

main[1] list
456 state.docFreq = 1;
457 state.totalTermFreq = 1;
458 statsSingletonRunLength--;
459 } else {
460 => int token = statsReader.readVInt();
461 if ((token & 1) == 1) {
462 state.docFreq = 1;
463 state.totalTermFreq = 1;
464 statsSingletonRunLength = token >>> 1;
465 } else {
main[1] print statsReader
statsReader = "org.apache.lucene.store.ByteArrayDataInput@6b67034"
main[1] dump statsReader
statsReader = {
bytes: instance of byte[64] (id=1520)
pos: 0
limit: 16
}
main[1] dump statsReader.bytes
statsReader.bytes = {
4, 0, 9, 2, 1, 4, 0, 3, 2, 1, 1, 2, 1, 7, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
}

搜索的termam对应的是

00000000  3f d7 6c 17 12 42 6c 6f  63 6b 54 72 65 65 54 65  |?.l..BlockTreeTe|
00000010 72 6d 73 44 69 63 74 00 00 00 00 fe ea 80 e6 45 |rmsDict........E|
00000020 20 d8 56 64 1b 1b 1b 89 70 fe 67 0a 4c 75 63 65 | .Vd....p.g.Luce|
00000030 6e 65 39 30 5f 30 25 bc 03 61 6d 61 6e 64 62 75 |ne90_0%..amandbu|
00000040 74 63 61 6e 64 6f 68 65 6c 6c 6f 68 69 69 69 73 |tcandohellohiiis|
00000050 69 74 6b 6e 6f 77 6d 61 79 6d 6f 6e 67 6f 6e 6f |itknowmaymongono|
00000060 74 74 72 79 77 68 61 74 77 6f 72 6c 64 79 6f 75 |ttrywhatworldyou|
00000070 24 02 03 03 03 02 05 02 01 02 02 04 03 05 03 03 |$...............|
00000080 04 05 03 10 04 00 09 02 01 04 00 03 02 01 01 02 |................| <---- 在这一行第四个开始的序列
00000090 01 07 02 02 26 7a 3d 04 01 02 03 01 01 01 01 01 |....&z=.........|
000000a0 05 01 01 01 00 02 04 00 02 01 01 01 01 01 02 01 |................|
000000b0 01 01 02 01 01 01 01 05 01 03 01 05 a4 03 2f 68 |............../h|
000000c0 6f 6d 65 2f 75 62 75 6e 74 75 2f 64 6f 63 2f 68 |ome/ubuntu/doc/h|
000000d0 65 6c 6c 6f 2e 74 78 74 2f 68 6f 6d 65 2f 75 62 |ello.txt/home/ub|
000000e0 75 6e 74 75 2f 64 6f 63 2f 6d 6f 6e 67 6f 2e 74 |untu/doc/mongo.t|
000000f0 78 74 05 1a 01 03 04 82 01 01 03 c0 28 93 e8 00 |xt..........(...|
00000100 00 00 00 00 00 00 00 da 02 a3 a3 |...........|

那么docFreq 的赋值在哪里呢?

 currentFrame.state.docFreq = 2
main[1] list
1,113 assert !eof;
1,114 // if (DEBUG) System.out.println("BTR.docFreq");
1,115 currentFrame.decodeMetaData();
1,116 // if (DEBUG) System.out.println(" return " + currentFrame.state.docFreq);
1,117 => return currentFrame.state.docFreq;
1,118 }
1,119
1,120 @Override
1,121 public long totalTermFreq() throws IOException {
1,122 assert !eof;
main[1] where
[1] org.apache.lucene.codecs.lucene90.blocktree.SegmentTermsEnum.docFreq (SegmentTermsEnum.java:1,117)
[2] org.apache.lucene.index.TermStates.build (TermStates.java:107)
[3] org.apache.lucene.search.TermQuery.createWeight (TermQuery.java:227)
[4] org.apache.lucene.search.IndexSearcher.createWeight (IndexSearcher.java:885)
[5] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:686)
[6] org.apache.lucene.search.IndexSearcher.searchAfter (IndexSearcher.java:532)
[7] org.apache.lucene.search.IndexSearcher.search (IndexSearcher.java:542)
[8] org.apache.lucene.demo.SearchFiles.doPagingSearch (SearchFiles.java:180)
[9] org.apache.lucene.demo.SearchFiles.main (SearchFiles.java:150)

读取的过程:

readByte:110, ByteBufferIndexInput (org.apache.lucene.store)
readVInt:121, DataInput (org.apache.lucene.store)
readVIntBlock:149, Lucene90PostingsReader (org.apache.lucene.codecs.lucene90)
refillDocs:472, Lucene90PostingsReader$BlockDocsEnum (org.apache.lucene.codecs.lucene90)
advance:538, Lucene90PostingsReader$BlockDocsEnum (org.apache.lucene.codecs.lucene90)
advance:77, SlowImpactsEnum (org.apache.lucene.index)
advance:128, ImpactsDISI (org.apache.lucene.search)
nextDoc:133, ImpactsDISI (org.apache.lucene.search)
scoreAll:301, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:247, Weight$DefaultBulkScorer (org.apache.lucene.search)
score:38, BulkScorer (org.apache.lucene.search)
search:776, IndexSearcher (org.apache.lucene.search)
search:694, IndexSearcher (org.apache.lucene.search)
search:688, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:161, SearchFiles (com.dinosaur.lucene.skiptest)
queryTest:52, QueryTest (com.dinosaur.lucene.demo)

tim 文件在哪里初始化

  void loadBlock() throws IOException {

// Clone the IndexInput lazily, so that consumers
// that just pull a TermsEnum to
// seekExact(TermState) don't pay this cost:
ste.initIndexInput();

if (nextEnt != -1) {
// Already loaded
return;
}
// System.out.println("blc=" + blockLoadCount);

ste.in.seek(fp);
int code = ste.in.readVInt();
entCount = code >>> 1;
assert entCount > 0;
isLastInFloor = (code & 1) != 0;

assert arc == null || (isLastInFloor || isFloor)
: "fp=" + fp + " arc=" + arc + " isFloor=" + isFloor + " isLastInFloor=" + isLastInFloor;

// TODO: if suffixes were stored in random-access
// array structure, then we could do binary search
// instead of linear scan to find target term; eg
// we could have simple array of offsets

final long startSuffixFP = ste.in.getFilePointer();
// term suffixes:
final long codeL = ste.in.readVLong();
isLeafBlock = (codeL & 0x04) != 0;
final int numSuffixBytes = (int) (codeL >>> 3);
if (suffixBytes.length < numSuffixBytes) {
suffixBytes = new byte[ArrayUtil.oversize(numSuffixBytes, 1)];
}
try {
compressionAlg = CompressionAlgorithm.byCode((int) codeL & 0x03);
} catch (IllegalArgumentException e) {
throw new CorruptIndexException(e.getMessage(), ste.in, e);
}
compressionAlg.read(ste.in, suffixBytes, numSuffixBytes);
suffixesReader.reset(suffixBytes, 0, numSuffixBytes);

int numSuffixLengthBytes = ste.in.readVInt();
final boolean allEqual = (numSuffixLengthBytes & 0x01) != 0;
numSuffixLengthBytes >>>= 1;
if (suffixLengthBytes.length < numSuffixLengthBytes) {
suffixLengthBytes = new byte[ArrayUtil.oversize(numSuffixLengthBytes, 1)];
}
if (allEqual) {
Arrays.fill(suffixLengthBytes, 0, numSuffixLengthBytes, ste.in.readByte());
} else {
ste.in.readBytes(suffixLengthBytes, 0, numSuffixLengthBytes);
}
suffixLengthsReader.reset(suffixLengthBytes, 0, numSuffixLengthBytes);
totalSuffixBytes = ste.in.getFilePointer() - startSuffixFP;

/*if (DEBUG) {
if (arc == null) {
System.out.println(" loadBlock (next) fp=" + fp + " entCount=" + entCount + " prefixLen=" + prefix + " isLastInFloor=" + isLastInFloor + " leaf?=" + isLeafBlock);
} else {
System.out.println(" loadBlock (seek) fp=" + fp + " entCount=" + entCount + " prefixLen=" + prefix + " hasTerms?=" + hasTerms + " isFloor?=" + isFloor + " isLastInFloor=" + isLastInFloor + " leaf?=" + isLeafBlock);
}
}*/

// stats
int numBytes = ste.in.readVInt();
if (statBytes.length < numBytes) {
statBytes = new byte[ArrayUtil.oversize(numBytes, 1)];
}
ste.in.readBytes(statBytes, 0, numBytes);
statsReader.reset(statBytes, 0, numBytes);
statsSingletonRunLength = 0;
metaDataUpto = 0;

state.termBlockOrd = 0;
nextEnt = 0;
lastSubFP = -1;

// TODO: we could skip this if !hasTerms; but
// that's rare so won't help much
// metadata
numBytes = ste.in.readVInt();
if (bytes.length < numBytes) {
bytes = new byte[ArrayUtil.oversize(numBytes, 1)];
}
ste.in.readBytes(bytes, 0, numBytes);
bytesReader.reset(bytes, 0, numBytes);

// Sub-blocks of a single floor block are always
// written one after another -- tail recurse:
fpEnd = ste.in.getFilePointer();
// if (DEBUG) {
// System.out.println(" fpEnd=" + fpEnd);
// }
}
我们知道Lucene将索引文件拆分为了多个文件,这里我们仅讨论倒排索引部分。

Lucene把用于存储Term的索引文件叫Terms Index,它的后缀是.tip;把Postings信息分别存储在.doc、.pay、.pox,分别记录Postings的DocId信息和Term的词频、Payload信息、pox是记录位置信息。Terms Dictionary的文件后缀称为.tim,它是Term与Postings的关系纽带,存储了Term和其对应的Postings文件指针。

总体来说,通过Terms Index(.tip)能够快速地在Terms Dictionary(.tim)中找到你的想要的Term,以及它对应的Postings文件指针与Term在Segment作用域上的统计信息。


postings: 实际上Postings包含的东西并不仅仅是DocIDs(我们通常把这一个有序文档编号系列叫DocIDs),它还包括文档编号、以及词频、Term在文档中的位置信息、还有Payload数据。

所以关于倒排索引至少涉及5类文件,本文不会全面展开。

相关阅读