10 May Unlocking Hidden Insights Using Semantic Enrichment
How RhinoDox Uses Natural Language Processing (NLP) for Entity Extraction
When storing documents inside an Intelligent Content Platform, it’s standard practice to identify their key attributes or metadata and then manually extract and store those values alongside the document content for search and retrieval. But what about insights that can only be gathered from examining the document content itself? A simple task for 1 or 2 documents, but when you are dealing with thousands, this quickly becomes unreasonable.
That’s where RhinoDox comes in. Our platform automatically identifies entities like people, places and organizations across different types of documents and systems, which lets you build a complete picture of how all of those things interact with your organization.
When your documents are uploaded to the RhinoDox platform, we extract the content and run it through our NLP pipeline. Using NLP, we perform named entity recognition to find people, places and organizations. This is achieved by analyzing parts-of-speech and tokenization of sentences. NLP can be run with different models allowing for customized analysis of different document types and documents in languages besides English.
NLP is purely language-based, but we can further enrich our entities with external data sources like DBpedia, a programatically queryable version of Wikipedia. We could also introduce a 3rd party datasource like Salesforce to get information about non-public figures like companies, job position or even more sensitive internal data. As an example, my boss could upload a document like my employment contract into our RhinoDox account. From the insights tab, he would be able to see that a person named Marty McKenna is an employee at RhinoDox with a position of Senior Engineer. He can also see a list of related documents that also contain references to “this entity”, like my I-9, W-2, time off requests and more.
Semantically enriching your documents using NLP and external data sources allows you to add context and additional info that simply isn’t available with traditional ECM systems. The result is a powerful platform that allows you to gain new insight from existing documents. Check out our website to learn more.
Marty McKenna is a Senior Developer at RhinoDox. He spends much of his time working on the API that powers the RhinoDox platform. Prior to RhinoDox, he worked in search and semantics software, in the Digital Marketing field. When Marty’s not busy building graphs or optimizing queries, he enjoys writing music.