3 Inverted File Index¶
怎么构建搜索引擎¶
How can I find in which retrieved web pages?
Solution 1: Scan each page for the string “Computer Science”
Solution 2: Term-Document Incidence matrix
Solution 3: Compact Version - Inverted File Index
- Index is a mechanism for locating a given term in a text.
- Inverted file contains a list of pointers (e.g. the number of a page) to all occurrences of that term in the text.
Why do we keep “times” (frequency)?
取交集的时候,从频率小的词开始做,求交的时候效率最高
Index Generator
while ( read a document D ) {
while ( read a term T in D ) {
if ( Find( Dictionary, T ) == false )
Insert( Dictionary, T );
Get T’s posting list;
Insert a node to T’s posting list;
}
}
Write the inverted index to disk;
- Token Analyzer
- Vocabulary Scanner
- Vocabulary Insertor
- Memory management
分词¶
Word Stemming
Processing a word so that only its stem or root form is left.
比如说将Process, processing, processes, processed 记录为process
Stop Words
Some words are so common that almost every document contains them, such as “a” “the” “it”. It is useless to index them. They are called stop words. We can eliminate them from the original documents.
文章扫描¶
Solution 1: Search trees (B- trees, B+ trees, Tries, ...)
Solution 2: Hashing
What are the pros and cons of using hashing, comparing to using search trees?
优势是检索单个单词更快,劣势是不能范围查找,对多个连续单词结合起来检索较慢 (scanning in sequential order is not possible)
Distributed indexing¶
—— Each node contains index of a subset of collection
Solution 1: Term-partitioned index
劣势在于容灾能力较弱
Solution 2: Document-partitioned index
前者按照关键词将文件存在不同的主机上,全局建立索引,后者按照文件号将文件存在不同主机上,在每台主机上建立局部索引。前者查找性能更高,而后者更加可靠。
然而索引是动态变化的,索引需要定时更新,并建立主索引和辅助索引。
最简单的更新办法是周期性地对文档集从头开始进行索引重构。如果要求能够及时检索到新文档,那么一种方法是同时保持两个索引:一个是大的主索引,保存在磁盘中,另一个是小的用于存储新文档信息的辅助索引,辅助索引保存在内存中。检索时可以同时遍历两个索引并将结果合并。而文档的删除记录在一个无效位向量中,在返回检索结果之前可以利用它过滤掉已经删除的文档。每当辅助索引变得很大时,就将它合并到主索引中。
Compressing¶
一是将词典看为单一字符串,以消除用定长方法来存储单词所存在的空间浪费;二是docID的存储只记录与上一项docID的差值来减少docID存储长度。
Thresholding (阈值)¶
搜索时只关心前面的x%来提高效率
Document: only retrieve the top x documents where the documents are ranked by weight
- Not feasible for Boolean queries
比如说搜索computer science, 可能computer重要的文章和science重要的文章求交的时候并未搜到合起来的重要的文章。
- Can miss some relevant documents due to truncation
Query: Sort the query terms by their frequency in ascending order; search according to only some percentage of the original query terms
对出现的关键词在term dictionary中的频率进行排序,从频率较低的单词开始搜索(或者说定义一个值:这个词在这篇文章的频率/这个词在所有文章中的频率,这个值越高,那么这个词对于这篇文章越重要)
Measures for a search engine¶
客观指标¶
How fast does it index 建表的速度
- Number of documents/hour
How fast does it search 搜索的速度
- Latency as a function of index size
Expressiveness of query language 复杂搜索的能力
- Ability to express complex information needs
- Speed on complex queries
主观指标¶
Data Retrieval Performance Evaluation (after establishing correctness) 数据检索性能评估
-
Response time 相应时间
-
Index space 索引占用空间
Information Retrieval Performance Evaluation 信息检索性能评估
- How relevant is the answer set?
怎么定义答案集的相关程度?
Relevance measurement requires 3 elements:
- A benchmark document collection
- A benchmark suite of queries
- A binary assessment of either Relevant or Irrelevant for each query-doc pair
Relevant | Irrelevant | |
---|---|---|
Retrieved | \(R_R\) | \(I_R\) |
Not Retrieved | \(R_N\) | \(I_N\) |
Precision: \(P=\frac{R_R}{R_R+I_R}\) (精确度,检索到的东西有多少是有效的)
Recall: \(R=\frac{R_R}{R_R+R_N}\)(所有有意义的文档中,有多少是被检索到的)
评估曲线的一种方法:AUC (Areas under Curves) 曲线下方的面积