Near Duplicates Cluster Finder

The Near Duplicates Cluster Finder software is a Java program, which finds clusters of near duplicate documents. It runs on Java platform 1.7 and can be used on Windows, Mac, UNIX, Linux, etc. It is an addition to the Near Duplicates Finder, which searches for near duplicate documents based on internal text of the document. The Near Duplicates Finder works with different types of documents, including Plain Text, HTML, XML, PDF, Microsoft Office, OpenOffice, RTF, etc. Click here for more information about the Near Duplicates Finder.

You can see The example of XML report for clusters of near duplicate documents, based on collection of Enron corporate e-mails. The collections contains 517,431 files located in 3,499 folders. Total size of collection is 1.32GB. The Enron e-mail library was made public by the Federal Energy Regulatory Commission and posted to the web during its investigation. Some emails were removed because of privacy concerns. (Note: The storage of Enron emails on the Internet has no affiliation with the SoftCorporation LLC and was selected here only for demo purpose. Some documents may be missing.) Each cluster starts with the pivot document, following by the list of exact duplicates or near duplicate documents sorted by similarity score. Only internal text of the document is considered for comparisons, i.e. if the same e-mail was sent to different people on different dates, even using different subjects, it still will be considered the same e-mail. The score represents how close the text of the pivot is to the near duplicate. For example, documents enron/maildir/allen-p/all_documents/370 and enron/maildir/smith-m/inbox/133 are found to be near duplicates with score = 0.93.

You also can see the near duplicate documents presented as a chain, which is built by the Near Dupilcates Chain Finder. Click here for more information about the Near Duplicates Chain Finder. The chain is an ordered collection of documents, with a root document, sorted by document differences. The last document in a chain can be quite different from the first one, however the software allows you to see the chain of changes in one set.

Depending on configuration of the Near Duplicates Cluster Finder the task to pre-process all documents from Enron e-mails collection can take from 1 to 3 hours (running on a laptop with a single thread). Further it can take another up to 3 hours to build clusters for all Enron documents. However when the clusters are identified, different reports can be created within a couple of minutes, for example, make a report for all documents similar to the selected one, or make a report sorting documents in the cluster by size, last modified date, or a file name. You also can quickly remove or add documents to an existing collection.

The Near Duplicates Cluster Finder can be used as a background tool integrated with other software. Besides XML report, the output report can be produced in text format suitable for processing by other software:

2013/12/23 20:33:41,1,0,1.00,"\collections\enron\maildir\allen-p\all_documents\356"
2013/12/23 20:33:41,1,1,1.00,"\collections\enron\maildir\allen-p\discussion_threads\551"
2013/12/23 20:33:41,1,2,1.00,"\collections\enron\maildir\allen-p\notes_inbox\45"
2013/12/23 20:33:41,1,3,0.96,"\collections\enron\maildir\grigsby-m\all_documents\111"
2013/12/23 20:33:42,1,4,0.96,"\collections\enron\maildir\grigsby-m\discussion_threads\227"
2013/12/23 20:33:42,1,5,0.96,"\collections\enron\maildir\grigsby-m\notes_inbox\111"
2013/12/23 20:33:42,1,6,0.95,"\collections\enron\maildir\ermis-f\all_documents\224"
2013/12/23 20:33:42,1,7,0.95,"\collections\enron\maildir\ermis-f\discussion_threads\294"
2013/12/23 20:33:42,1,8,0.95,"\collections\enron\maildir\ermis-f\notes_inbox\173"
2013/12/23 20:33:42,1,9,0.95,"\collections\enron\maildir\smith-m\inbox\166"
2013/12/23 20:33:42,1,10,0.94,"\collections\enron\maildir\jones-t\all_documents\11559"
2013/12/23 20:33:42,1,11,0.94,"\collections\enron\maildir\jones-t\notes_inbox\4542"
2013/12/23 20:33:42,1,12,0.94,"\collections\enron\maildir\skilling-j\deleted_items\502"
2013/12/23 20:33:42,1,13,0.94,"\collections\enron\maildir\skilling-j\inbox\370"
2013/12/23 20:33:42,1,14,0.94,"\collections\enron\maildir\mclaughlin-e\all_documents\1010"
2013/12/23 20:33:42,1,15,0.94,"\collections\enron\maildir\mclaughlin-e\discussion_threads\839"
2013/12/23 20:33:42,1,16,0.94,"\collections\enron\maildir\mclaughlin-e\notes_inbox\24"
2013/12/23 20:33:42,1,17,0.94,"\collections\enron\maildir\storey-g\deleted_items\142"
2013/12/23 20:33:42,1,18,0.94,"\collections\enron\maildir\mckay-j\perfmgmt\17"
2013/12/23 20:33:42,1,19,0.94,"\collections\enron\maildir\campbell-l\all_documents\1761"
2013/12/23 20:33:42,1,20,0.94,"\collections\enron\maildir\campbell-l\discussion_threads\1592"
2013/12/23 20:33:42,1,21,0.94,"\collections\enron\maildir\campbell-l\notes_inbox\588"
2013/12/23 20:33:42,1,22,0.94,"\collections\enron\maildir\mckay-j\perfmgmt\18"
2013/12/23 20:33:42,1,23,0.94,"\collections\enron\maildir\hendrickson-s\inbox\6"

The home page for the Near Duplicates Finder Software can be found on the SoftCorporation LLC. Web site http://www.softcorporation.com/products/neardup. There you also can find the latest release, as well as all other information you might need regarding this project. For commercial usage of the software please contact: info@softcorporation.com.

For legal and licensing issues, please read the LICENSE.TXT file. This product uses Derby and Log4J Java Software developed by The Apache Software Foundation (http://www.apache.org/). See Apache License: LICENSE-3RDPARTY.TXT.

For more information send request to: info@softcorporation.com

Home