Lecture 3-skip Pointers And Phrase Queries.ppt 5163h

Introduction to Information Retrieval

Introduction to

Information Retrieval Faster postings merges: Skip pointers/Skip lists


Recall basic merge  Walk through the two postings simultaneously, in time linear in the total number of postings entries 2

8

2

4

8

41

1

2

3

8

48 11

64 17

128 21

Brutus

31 Caesar

If the list lengths are m and n, the merge takes O(m+n) operations. Can we do better? Yes (if the index isn’t changing too fast).


Augment postings with skip pointers (at indexing time) 128

41

2

4

8

41

64

128

31

11

1

48

2

3

8

11

17

21

31

 Why?  To skip postings that will not figure in the search results.  How?  Where do we place skip pointers?


Query processing with skip pointers 128

41

2

4

8

41

64

128

31

11

1

48

2

3

8

11

17

21

31

Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance. We then have 41 and 11 on the lower. 11 is smaller. But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings.


Where do we place skips?  Tradeoff:  More skips  shorter skip spans  more likely to skip. But lots of comparisons to skip pointers.  Fewer skips  few pointer comparison, but then long skip spans  few successful skips.



Placing skips  Simple heuristic: for postings of length L, use L evenlyspaced skip pointers [Moffat and Zobel 1996]  This ignores the distribution of query .  Easy if the index is relatively static; harder if L keeps changing because of updates.  This definitely used to help; with modern hardware it may not unless you’re memory-based [Bahle et al. 2002]


Positional postings and phrase queries  Many complex or technical concepts and many organization and product names are multiword compounds or phrases.  Most recent search engines a double quotes syntax (“stanford university”) for phrase queries.  As many as 10% of web queries are phrase queries, and many more are implicit phrase queries (such as person names), entered without use of double quotes.


1. Biword indexes  One approach to handling phrases is to consider every pair of consecutive in a document as a phrase. For example, the text Friends, Romans, Countrymen would generate the biwords: friends romans romans countrymen

 In this model, we treat each of these biwords as a vocabulary term.  The concept of a biword index can be extended to longer sequences of words, and if the index includes variable length word sequences, it is generally referred to as a phrase index.


2. Positional indexes  A biword index is not the standard solution. Rather, a positional index is most commonly employed.  Here, for each term in the vocabulary, we store postings of the form docID: {hposition1, position2, . . . } e.g. to, 993427: (1, 6: (7, 18, 33, 72, 86, 231); 2, 5: (1, 17, 74, 222, 255); 4, 5: (8, 16, 190, 429, 433); 5, 2: (363, 367); 7, 3: (13, 23, 191); ..... . . ) be, 178239: (1, 2: (17, 25); 4, 5: (17, 191, 291, 430, 434);


2. Positional indexes  To process a phrase query, we still need to access the inverted index entries for each distinct term.  As before, we would start with the least frequent term and then work to further restrict the list of possible candidates.  In the merge operation, the same general technique is used as before, but rather than simply checking that both are in a document, we also need to check that their positions of appearance in the document are compatible with the phrase query being evaluated.


Example: Satisfying phrase queries Suppose the postings lists for ‘to’ and ‘be’ are as in previous slide, and the query is “to be or not to be”. The postings lists to access are: to, be, or, not. We will examine intersecting the postings lists for ‘to’ and ‘be’. We first look for documents that contain both . Then, we look for places in the lists where there is an occurrence of ‘be’ with a token index one higher than a position of ‘to’, and then we look for another occurrence of each word with token index 4 higher than the first occurrence. In the above lists, the pattern of occurrences that is a possible match is:

to: be:

(. . . ; 4: (. . . ,429,433); . . . ) (. . . ; 4(. . . ,430,434); . . . )

Lecture 3-skip Pointers And Phrase Queries.ppt 5163h

Overview 26281t

More details 6y5l6z

Related Documents 3h463d

Lecture 3-skip Pointers And Phrase Queries.ppt 5163h

Pointers 1g1i5a

Pointers 1g1i5a

Differences Between Noun Phrase And Verb Phrase 4jc6o

Noun And Noun Phrase 712d5l

#14 Pointers 50605p

More Documents from "Yash Gupta" 222v4o

Tata Hr Policies Final Ppt 1a1x42

Sat Physics Subject Test 2-www.cracksat.net.pdf 26p3m

4x1 Mux 171p

Lecture 3-skip Pointers And Phrase Queries.ppt 5163h

Cbse Class 11 English Canterville Ghost Questions Answers 1g341o

Iwcf Level 1 Programme Guide Web (1) y6v5l