Near Duplicates Chain Finder

The Near Duplicates Chain Finder software is a Java program, which finds chains of near duplicate documents. It runs on Java platform 1.7 and can be used on Windows, Mac, UNIX, Linux, etc. It is an addition to the Near Duplicates Finder, which searches for near duplicate documents based on the internal text of a document. The Near Duplicates Finder works with different types of documents, including Plain Text, HTML, XML, PDF, Microsoft Office, OpenOffice, RTF, etc. Click here for more information about the Near Duplicates Finder. The chain is an ordered collection of documents, with a root document, sorted by document differences. The last document in a chain can be quite different from the first one, however the software allows you to see the chain of changes in one set.

You can see an example of the XML report of chains of near duplicate documents, based on a collection of Project Gutenberg documents downloaded in October 2014. The collection contains 13,583 files located in 13,333 folders. The total size of the collection is 8.98GB. Each chain starts with the root (original) document, following by exact duplicates and near duplicate documents presented as a tree. The first cluster contains many small documents, which content is comparable by size with the Gutenberg license and the license text causes such documents to be reported as duplicates. On regular home PC it took more than 3 hours to extract and process documents from Gutenberg collection, but it took less than 1.5 min to find all near-duplicates and build chains.

The Chain Finder uses data produced by the Near Duplicates Cluster Finder. The chains are built based on text similarity of processed documents. The report shows two known issues: 1. Project Gutenberg documents have large license text added to each document. If the size of the content is small in relation to the size of the license, the software will report false near-duplicates, because of the matching license text in each document. 2. The Chinese encoding in text files is not properly processed, resulting Chinese text being discarded, which again produces false near-duplicates because of a long license text. The first two largest clusters display this problem in the report.

The home page for the Near Duplicates Finder Software can be found on the SoftCorporation LLC. Web site http://www.softcorporation.com/products/neardup. There you also can find the latest release, as well as all other information you might need regarding this project.

For legal and licensing issues, please read the LICENSE.TXT file. This product uses Derby and Log4J Java Software developed by The Apache Software Foundation (http://www.apache.org/). See Apache License: LICENSE-3RDPARTY.TXT.

For more information send a request to: info@softcorporation.com

Home