Skip to main content

12 posts tagged with "lucene"

View All Tags

fst

· 6 min read

背景

FST 即finite state machine,lucene很多内容都是用这个格式压缩和存储的.

fst 例子

介绍FST之前,先看看Hashmap.

HashMap的语义: key-> value , 也就是输入一个key,返回一个value

FST结构也是一个特别的Map, 语义和Map差不多:FST(key)=value

例子

有下面组词汇的数组[cat:5,dog:7,dogs:13]

  • key为cat,value为5
  • key为dog,value为7
  • key为dogs,value为13

最后会被序列化成这个结构

[0, 116, 15, 97, 6, 5, 115, 31, 103, 7, 111, 6, 7, 100, 22, 4, 5, 99, 16]

下面来分析这个例子的每个字节

103, 7111, 67, 100, 224, 5, 99, 16
flag=7,value:'g'也就是103,target:7,nextArch=7flag=6,value:'0'也就是111,target:9,nextArch=9flag=22 , value='d' 也就是100,output=7,target=11(为什么是11 ?因为7前面就是pos=11,nextArc=11)flag=16 , value:'c'也就是99, output =5,target=4,nextArc=14
[0, 116, 15,| 97, 6,  |5, 115, 31, |103, 7,  |111, 6,  |7, 100, 22,| 4, 5, 99, 16]
--------| ------- | -----------| ------ | ------- |--------- | -------------
(t,null)| (a,null)| (s,5) | (g,null)| (o,null)| (d:7) | (output =5,target=4,flag=16 value:'c')

常量解释:

常量描述
BIT_LAST_ARC1>>1描述该弧是最后一个弧,类似:二叉树右子节点;或者类似于三叉树的第三个节点
BIT_ARC_HAS_OUTPUT1>>4有output,也就是这个节点存了一些值
BIT_TARGET_NEXT1>>2表示该节点的下一个节点就是下一个bit,不需要在另外存了,也就是这个弧的两个节点是存在一起的

arc class分析

arc class 源码如下:

  public static final class Arc<T> {

// *** Arc fields.

private int label;

private T output;

private long target;

private byte flags;

private T nextFinalOutput;

private long nextArc;

private byte nodeFlags;

// *** Fields for arcs belonging to a node with fixed length arcs.
// So only valid when bytesPerArc != 0.
// nodeFlags == ARCS_FOR_BINARY_SEARCH || nodeFlags == ARCS_FOR_DIRECT_ADDRESSING.

private int bytesPerArc;

private long posArcsStart;

private int arcIdx;

private int numArcs;

// *** Fields for a direct addressing node. nodeFlags == ARCS_FOR_DIRECT_ADDRESSING.

/**
* Start position in the {@link FST.BytesReader} of the presence bits for a direct addressing
* node, aka the bit-table
*/
private long bitTableStart;

/** First label of a direct addressing node. */
private int firstLabel;

/**
* Index of the current label of a direct addressing node. While {@link #arcIdx} is the current
* index in the label range, {@link #presenceIndex} is its corresponding index in the list of
* actually present labels. It is equal to the number of bits set before the bit at {@link
* #arcIdx} in the bit-table. This field is a cache to avoid to count bits set repeatedly when
* iterating the next arcs.
*/
private int presenceIndex;
}
字段描述
label如果用map的key value来举例 , label就是key的一截 , 多个lebel会组成一个key , 举例 "cat" 会拆分成三个label : "c" , "a", "t"
output如果是用map的key value来举例 , output就是value的一截,多个output会组成一个value
target描述的是下一个节点的偏移量,一个弧度如果是src -> dst 这样结构的话 , target 就是dst 的位置 也就是 arr[target] 就是dst 的节点的位置
flags各种奇奇怪怪的标志位来标识这个弧的状态,用位图来将各种状态压缩
nextFinalOutput前面说了,如果这个key value 结构 , 这个描述的是value的最后一截 ,否则就是null
nextArc描述的是多个弧,就像一个多叉树的兄弟节点,这个是描述下一个兄弟节点的偏移位置
numArcs描述的是这个阶段有多少个弧,也就是这个节点有多少个子节点

写入过程:

add:473, FSTCompiler (org.apache.lucene.util.fst)
compileIndex:504, Lucene90BlockTreeTermsWriter$PendingBlock (org.apache.lucene.codecs.lucene90.blocktree)
writeBlocks:725, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
finish:1105, Lucene90BlockTreeTermsWriter$TermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:370, Lucene90BlockTreeTermsWriter (org.apache.lucene.codecs.lucene90.blocktree)
write:172, PerFieldPostingsFormat$FieldsWriter (org.apache.lucene.codecs.perfield)
flush:135, FreqProxTermsWriter (org.apache.lucene.index)
flush:310, IndexingChain (org.apache.lucene.index)
flush:392, DocumentsWriterPerThread (org.apache.lucene.index)
doFlush:492, DocumentsWriter (org.apache.lucene.index)
flushAllThreads:671, DocumentsWriter (org.apache.lucene.index)
doFlush:4194, IndexWriter (org.apache.lucene.index)
flush:4168, IndexWriter (org.apache.lucene.index)
shutdown:1322, IndexWriter (org.apache.lucene.index)
close:1362, IndexWriter (org.apache.lucene.index)
doTestSearch:133, FstTest (com.dinosaur.lucene.demo)
findTargetArc:1418, FST (org.apache.lucene.util.fst)
seekExact:511, SegmentTermsEnum (org.apache.lucene.codecs.lucene90.blocktree)
loadTermsEnum:111, TermStates (org.apache.lucene.index)
build:96, TermStates (org.apache.lucene.index)
createWeight:227, TermQuery (org.apache.lucene.search)
createWeight:904, IndexSearcher (org.apache.lucene.search)
search:687, IndexSearcher (org.apache.lucene.search)
searchAfter:523, IndexSearcher (org.apache.lucene.search)
search:538, IndexSearcher (org.apache.lucene.search)
doPagingSearch:158, SearchFiles (com.dinosaur.lucene.demo)
testSearch:128, SearchFiles (com.dinosaur.lucene.demo)

跳转内容

如何跳转

  public void decodeMetaData() throws IOException {

// if (DEBUG) System.out.println("\nBTTR.decodeMetadata seg=" + segment + " mdUpto=" +
// metaDataUpto + " vs termBlockOrd=" + state.termBlockOrd);

// lazily catch up on metadata decode:
final int limit = getTermBlockOrd();
boolean absolute = metaDataUpto == 0;
assert limit > 0;

// TODO: better API would be "jump straight to term=N"???
while (metaDataUpto < limit) {

// TODO: we could make "tiers" of metadata, ie,
// decode docFreq/totalTF but don't decode postings
// metadata; this way caller could get
// docFreq/totalTF w/o paying decode cost for
// postings

// TODO: if docFreq were bulk decoded we could
// just skipN here:
if (statsSingletonRunLength > 0) {
state.docFreq = 1;
state.totalTermFreq = 1;
statsSingletonRunLength--;
} else {
int token = statsReader.readVInt();
if ((token & 1) == 1) {
state.docFreq = 1;
state.totalTermFreq = 1;
statsSingletonRunLength = token >>> 1;
} else {
state.docFreq = token >>> 1;
if (ste.fr.fieldInfo.getIndexOptions() == IndexOptions.DOCS) {
state.totalTermFreq = state.docFreq;
} else {
state.totalTermFreq = state.docFreq + statsReader.readVLong();
}
}
}

// metadata
ste.fr.parent.postingsReader.decodeTerm(bytesReader, ste.fr.fieldInfo, state, absolute);

metaDataUpto++;
absolute = false;
}
state.termBlockOrd = metaDataUpto;
}

相关阅读

倒排索引

· 6 min read

es编译

gradle idea

跑了很久

BUILD SUCCESSFUL in 49m 34s 334 actionable tasks: 334 executed

es 堆栈

prepareRequest:61, RestCatAction (org.elasticsearch.rest.action.cat)
handleRequest:80, BaseRestHandler (org.elasticsearch.rest)
handleRequest:69, SecurityRestFilter (org.elasticsearch.xpack.security.rest)
dispatchRequest:240, RestController (org.elasticsearch.rest)
tryAllHandlers:337, RestController (org.elasticsearch.rest)
dispatchRequest:174, RestController (org.elasticsearch.rest)
dispatchRequest:324, AbstractHttpServerTransport (org.elasticsearch.http)
handleIncomingRequest:374, AbstractHttpServerTransport (org.elasticsearch.http)
incomingRequest:303, AbstractHttpServerTransport (org.elasticsearch.http)
channelRead0:66, Netty4HttpRequestHandler (org.elasticsearch.http.netty4)
channelRead0:31, Netty4HttpRequestHandler (org.elasticsearch.http.netty4)
channelRead:105, SimpleChannelInboundHandler (io.netty.channel)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:58, Netty4HttpPipeliningHandler (org.elasticsearch.http.netty4)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
channelRead:111, MessageToMessageCodec (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:323, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:297, ByteToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:286, IdleStateHandler (io.netty.handler.timeout)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:1434, DefaultChannelPipeline$HeadContext (io.netty.channel)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:965, DefaultChannelPipeline (io.netty.channel)
read:163, AbstractNioByteChannel$NioByteUnsafe (io.netty.channel.nio)
processSelectedKey:644, NioEventLoop (io.netty.channel.nio)
processSelectedKeysPlain:544, NioEventLoop (io.netty.channel.nio)
processSelectedKeys:498, NioEventLoop (io.netty.channel.nio)
run:458, NioEventLoop (io.netty.channel.nio)
run:897, SingleThreadEventExecutor$5 (io.netty.util.concurrent)
run:834, Thread (java.lang)

以及

handleRequest:97, BaseRestHandler (org.elasticsearch.rest)
handleRequest:69, SecurityRestFilter (org.elasticsearch.xpack.security.rest)
dispatchRequest:240, RestController (org.elasticsearch.rest)
tryAllHandlers:337, RestController (org.elasticsearch.rest)
dispatchRequest:174, RestController (org.elasticsearch.rest)
dispatchRequest:324, AbstractHttpServerTransport (org.elasticsearch.http)
handleIncomingRequest:374, AbstractHttpServerTransport (org.elasticsearch.http)
incomingRequest:303, AbstractHttpServerTransport (org.elasticsearch.http)
channelRead0:66, Netty4HttpRequestHandler (org.elasticsearch.http.netty4)
channelRead0:31, Netty4HttpRequestHandler (org.elasticsearch.http.netty4)
channelRead:105, SimpleChannelInboundHandler (io.netty.channel)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:58, Netty4HttpPipeliningHandler (org.elasticsearch.http.netty4)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
channelRead:111, MessageToMessageCodec (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:102, MessageToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:323, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:297, ByteToMessageDecoder (io.netty.handler.codec)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:286, IdleStateHandler (io.netty.handler.timeout)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:340, AbstractChannelHandlerContext (io.netty.channel)
channelRead:1434, DefaultChannelPipeline$HeadContext (io.netty.channel)
invokeChannelRead:362, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:348, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:965, DefaultChannelPipeline (io.netty.channel)
read:163, AbstractNioByteChannel$NioByteUnsafe (io.netty.channel.nio)
processSelectedKey:644, NioEventLoop (io.netty.channel.nio)
processSelectedKeysPlain:544, NioEventLoop (io.netty.channel.nio)
processSelectedKeys:498, NioEventLoop (io.netty.channel.nio)
run:458, NioEventLoop (io.netty.channel.nio)
run:897, SingleThreadEventExecutor$5 (io.netty.util.concurrent)
run:834, Thread (java.lang)

倒排索引简介

到排索引解决什么问题?

当我们有一个文档a.txt,里面有一堆文字hello wrold ,i am dinosaur

我们需要从所有文档里面判断这个文档里面是否存在world 这个词汇,应该怎么做呢?

当文档的数量很少的时候,可以

  • 1 打开文件
  • 2 从头开始去读取文件内容判断是否包含world

那么当我们不仅仅只有一个文档a.txt,我们还有b.txtc.txt的时候,我们怎么判断某个词word是否在这些文档里面呢?如果word在里面,又在那些文档的第几行呢?

如果我们还用之前的从头开始一个个文件读的话,如果文档数量少还好,如果文档很多,我们就非常慢才能读完所有的文档。

倒排索引解决的其中一个问题就是如何快速定位某个词是是否在这些文档中,如果在又在哪些文档里面。

相关例子

baseline invert index

倒排索引包括主要两个部分:

  • 第一部分$word$包含$t$两个域:
    • $f_{t}$: 文档(document)中包含词$t$的文档个数,也就是说有多少个文档含有词$t$,那么$f_{t}$等于几。
    • 指向$inverted list$的指针
  • 第二部分$invert list$是一个列表,列表的每个元素包括以下两个域:
    • $docid$: 文档对应的id,可以理解为文档主键
    • $f_{d}$: 该$docid$ 中包含词$t$的数量

uwiAvq.png

我自己写了的demo代码github 地址,输出如下

 keeper  3|[{1 1} {4 1} {5 1}]
In 1|[{2 1}]
house 2|[{2 1} {3 1}]
nignt 2|[{4 1} {5 1}]
did 1|[{4 1}]
dark 1|[{6 1}]
old 4|[{1 1} {2 1} {3 1} {4 1}]
night 3|[{1 1} {5 1} {6 1}]
had 1|[{3 1}]
sleeps 1|[{6 1}]
keep 3|[{1 1} {3 1} {5 1}]
big 2|[{2 1} {3 1}]
keeps 3|[{1 1} {5 1} {6 1}]
the 6|[{1 1} {2 1} {3 1} {4 1} {5 1} {6 1}]
never 1|[{4 1}]
and 1|[{6 1}]
And 1|[{6 1}]
in 5|[{1 1} {2 1} {3 1} {5 1} {6 1}]
The 3|[{1 1} {3 1} {5 1}]
sleep 1|[{4 1}]
Where 1|[{4 1}]
town 2|[{1 1} {3 1}]
gown 1|[{2 1}]

构造倒排索引的步骤

  • 1 读取文档
  • 2 分词
  • 3 对分词正规化(normalized)
  • 4 建立包含词频和偏移量的倒排索引

分词

https://www.cnblogs.com/forfuture1978/archive/2010/06/06/1752837.html

Lucene 的堆栈,主要的逻辑都在invert方法里面

incrementToken:48, FilteringTokenFilter (org.apache.lucene.analysis)
invert:812, DefaultIndexingChain$PerField (org.apache.lucene.index)
processField:442, DefaultIndexingChain (org.apache.lucene.index)
processDocument:406, DefaultIndexingChain (org.apache.lucene.index)
updateDocument:250, DocumentsWriterPerThread (org.apache.lucene.index)
updateDocument:495, DocumentsWriter (org.apache.lucene.index)
updateDocument:1594, IndexWriter (org.apache.lucene.index)
addDocument:1213, IndexWriter (org.apache.lucene.index)
indexDoc:198, IndexFiles (com.dinosaur)
visitFile:155, IndexFiles$1 (com.dinosaur)
visitFile:151, IndexFiles$1 (com.dinosaur)
walkFileTree:2670, Files (java.nio.file)
walkFileTree:2742, Files (java.nio.file)
indexDocs:151, IndexFiles (com.dinosaur)
main:113, IndexFiles (com.dinosaur)

Lucene分词的核心在于incrementToken获取token

举个例子

Lucene的标准分词器

  private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);  // final的单例
@Override
public final boolean incrementToken() throws IOException {
...
scanner.getText(termAtt); // scanner 返回一个词并将那个词设置到termAtt上面
...
}