norcom.de/News Deduplication

ABOUT NORCOM

Intelligent message management / duplicate detection

The task

Huge amounts of news reports from various agencies flow into an editorial system every day. It is often unclear how the reports came about and from which sources the information originally came. However, knowledge of the sources is a prerequisite for checking the authenticity of the information and detecting false reports. Research is made more difficult by the large number of close duplicates, and searches for content often yield too many redundant hits. A solution was created that identifies all reports similar to a report, filters duplicates and close duplicates and uses them to create a report history in the form of a family tree.

The challenge

Searching for similar messages requires a comparison with every other message. Here one already encountered in a relatively small number of messages to the limits of today's computing capacity.

our solution

First, an algorithm was trained that recognizes new messages and automatically assigns them to message categories (sport, business, etc.) based on the message content. However, the search area, which has been reduced in this way, is still too large to efficiently find similar messages using standard methods. A hashing trick was therefore used, which assigns a numerical hash to each message, in such a way that similar messages are assigned similar hashes. By storing the hashes in a look-up table, all messages similar to a message can now be easily called up and a message history can be created on the basis of the data contained therein.

The customer benefit

Thanks to deduplication, editors can focus on the essential news. The arrangement along a message tree supports research into the creation of the information contained therein.

Topic overview

Project-

Characteristics

Our role

Support of the customer by data scientists and data engineers

Our activities

Automation of the preparation and indexing of documents
Establishing analysis and machine learning pipelines to classify the documents
Extraction of information from the unstructured document content

Technologies & methods

Applications: Eagle
Data / databases: Elastic, Hbase
Languages / Frameworks: Python (Anaconda Stack), Hadoop, Spark
Methods: Natural Language Processing, Information Extraction, Machine Learning, Locality Sensitive Hashing

ABOUT NORCOM

Intelligent message management / duplicate detection

Project-

Characteristics

GET IN TOUCH AND LEARN MORE!