LSH based Similarity Search for Ranking Data using Shingling (Ba/Ma)
Topic for a bachelor/master's thesis
Information retrieval (IR) is one of the most important tasks in the field of information science. The goal of IR is to extract useful information from a potentially large and unstructured database. One relevant form of IR is similarity search, where one seeks to find similar objects within the unstructured database. To this end, the underlying data need to admit a sensible measure of similarity, first to design reasonable algorithmic solutions for similarity search, and second to assess the quality of the latter. Given such a measure of similarity, one could use the k-Nearest-Neighbors (kNN) combined with an efficient index structure such as an M-Tree, R-Tree, or K-D Tree to perform similarity search. However, this approach is known to be computationally costly, especially for high-dimensional data.
An alternative approach is locality-sensitive hashing (LSH), which is a hash-based indexing technique for similarity search yielding satisfactory empirical results if instantiated with suitable hash-functions . Although there are by now a variety of LSH approaches for various types of complex data available, data in the form of preferences or rankings have received very little attention so far. The only work in this regard is . This is somewhat surprising, as preference/ranking data have recently gained importance for practical applications, especially for the improvement of search engines and recommendation systems . The goal of the thesis is to design and implement an LSH approach for ranking data based on an iterative process using the concept of (w-)shingling to extract meaningful multisets of pairwise preferences from the rankings. Based on this, one idea is to adopt the weighted MinHash algorithm using the generalized Jaccard similarity to define a suitable distance measure .
Good background in data mining or machine learning, experience with experimental evaluation, programming skills.
Dr. Viktor Bengs, Prof. Eyke Hüllermeier
-  J. Leskovec, A. Rajaraman, J. D. Ullman. Mining of massive data sets. Cambridge university press, 2020.
-  K. Pal, S. Michel. Efficient similarity search across top-k lists under the Kendall’s tau distance. In Proceedings of the 28th International Conference on Scientific and Statistical Database Management, pp. 1 - 12, 2016.
-  J. Fürnkranz, E. Hüllermeier. Preference Learning. Berlin Heidelberg: Springer Science and Business Media, 2011.
-  W. Wu, B. Li, L. Chen, J. Gao, C. Zhang. A review for weighted minhash algorithms. IEEE Transactions on Knowledge and Data Engineering, 2020.