I'm reading the book "Introduction to information retrieval" and I have some practical doubt. In the book there is a chapter dedicated to the compression of the index (dictionary and posting list). For the posting list an explanation of the document list is explain.
Example:The term "book" has a posting list for example "1 4 7 19 20 25" which means that it is contained in this documents ids.
In this case the gap between the ids is computed "1 3 3 12 1 5" so then with a linear scan we can retrive each document by summing the previous.
Then is explained that the gap between two DocID will be small in most of the case and it's possible to use an seconder like (Variable byte codes or gamma encoding).
With gamma encoding we obtain:
- 1 -> 0
- 3 -> 101
- 3 -> 101
- 12 -> (binary 1100) -> (length(100) = 3) -> (unary(3) 1110) -> unary + length = 1110100 which is the gamma for 12
- 1 -> 0
- 5 -> 110101
So the result of my compression with gamme encoding for "1 3 3 12 1 5" is "010110111101000110101"
My doubt is how to get in practice this compression? I'm not understanding practically how to store this data to reduce the space complexity. Because now I'm using pickle to store the dictionary that I load from a txt file without the benefit of having a list of ordered integer.
def pickle_data(path: str, posting_list): out = open(path, 'wb') pickle.dump(posting_list, out) out.close()
Someone can explain how use this benefit of an ordered integer with (at most) small integer?