Near Duplicates Chain Finder


The Near Duplicates Chain Finder software is a Java program, which finds chains of near duplicate documents. It runs on Java platform 1.7 and can be used on Windows, Mac, UNIX, Linux, etc. It is an addition to the Near Duplicates Finder, which searches for near duplicate documents based on internal text of the document. The Near Duplicates Finder works with different types of documents, including Plain Text, HTML, XML, PDF, Microsoft Office, OpenOffice, RTF, etc. Click here for more information about the Near Duplicates Finder. The chain is an ordered collection of documents, with a root document, sorted by document differences. The last document in a chain can be quite different from the first one, however the software allows you to see the chain of changes in one set.

You can see The example of XML report of chains of near duplicate documents, based on collection of Project Gutenberg documents downloaded in October 2014. The collection contains 13,583 files located in 13,333 folders. Total size of collection is 8.98GB. (Note: The Project Gutenberg collection has no affiliation with the SoftCorporation LLC and was selected only for demo purpose. Some documents may be missing online.) Each chain starts with the root (original) document, following by exact duplicates and near duplicate documents presented as a tree. It took more than 3 hours to extract and process text documents from this collection, but it took less than 1.5 min to find near-duplicates and build chains for all documents in the collection.

The Chain Finder uses data produced by the Near Duplicates Cluster Finder. The chains are built based on text similarity of processed documents. The report shows two known issues: 1. Project Gutenberg documents have large license text added to each document. If the size of the book content is small in relation to the size of the license, the software will report false near-duplicates, because of matching text of the license in each document. 2. The Chinese encoding in text files is not properly processed, resulting Chinese text being discarded, which again produces false near-duplicates because of matching license text. First two largest clusters display this problem in the report.

The home page for the Near Duplicates Finder Software can be found on the SoftCorporation LLC. Web site http://www.softcorporation.com/products/neardup. There you also can find the latest release, as well as all other information you might need regarding this project. For commercial usage of the software please contact us using email: info@softcorporation.com.

For legal and licensing issues, please read the LICENSE.TXT file. This product uses Derby and Log4J Java Software developed by The Apache Software Foundation (http://www.apache.org/). See Apache License: LICENSE-3RDPARTY.TXT.



For more information send request to:  info@softcorporation.com


Home