ABOUT NORCOM
Intelligent message management / duplicate detection
The task
Huge amounts of news reports from various agencies flow into an editorial system every day. It is often unclear how the reports came about and from which sources the information originally came. However, knowledge of the sources is a prerequisite for checking the authenticity of the information and detecting false reports. Research is made more difficult by the large number of close duplicates, and searches for content often yield too many redundant hits. A solution was created that identifies all reports similar to a report, filters duplicates and close duplicates and uses them to create a report history in the form of a family tree.
The challenge
Searching for similar messages requires a comparison with every other message. Here one already encountered in a relatively small number of messages to the limits of today's computing capacity.
our solution
First, an algorithm was trained that recognizes new messages and automatically assigns them to message categories (sport, business, etc.) based on the message content. However, the search area, which has been reduced in this way, is still too large to efficiently find similar messages using standard methods. A hashing trick was therefore used, which assigns a numerical hash to each message, in such a way that similar messages are assigned similar hashes. By storing the hashes in a look-up table, all messages similar to a message can now be easily called up and a message history can be created on the basis of the data contained therein.
The customer benefit
Thanks to deduplication, editors can focus on the essential news. The arrangement along a message tree supports research into the creation of the information contained therein.
Project-
Characteristics
Our role
Support of the customer by data scientists and data engineers
Our activities
Automation of the preparation and indexing of documents
Establishing analysis and machine learning pipelines to classify the documents
Extraction of information from the unstructured document content
Technologies & methods
Applications: Eagle
Data / databases: Elastic, Hbase
Languages / Frameworks: Python (Anaconda Stack), Hadoop, Spark
Methods: Natural Language Processing, Information Extraction, Machine Learning, Locality Sensitive Hashing