Detecting Document Versions and Their Ordering In a Collection

Abstract

Given the iterative and collaborative nature of authoring and the need to adapt the documents for different audience, people end up with a large number of versions of their documents. These additional versions of documents increase the required cognitive effort for various tasks for humans (such as finding the latest version of a document, or organizing documents), and may degrade the performance of machine tasks such as clustering or recommendation of documents. To the best of our knowledge, the task of identifying and ordering the versions of documents from a collection of documents has not been addressed in prior literature. We propose a three-stage approach for the task of identifying versions and ordering them correctly in this paper. We also create a novel dataset for this purpose from Wikipedia, which we are releasing to the research community (https://github.com/natwar-modani/versions). We show that our proposed approach significantly outperforms state-of-the-art approach adapted for this task from the closest previously known task of Near Duplicate Detection, which justifies defining this problem as a novel challenge.

Publication
In International Conference on Web Information Systems Engineering 2021
Vaidehi Patil
Vaidehi Patil
First year PhD student

My research interests include Natural Language Processing and Machine Learning.