Applying Human Language Understanding to DLP’s Biggest Challenges
Arjun Sambamoorthy, on Sep 20 2019
At every organization, the mission of security staff is to prevent a data breach – the loss of confidential company and client information to an untrusted external destination. It can happen online, across emails, across shared documents, or across physical assets.
Data Loss Prevention or Data Leak Protection (DLP) should be a priority for a proactive strategy to prevent a data breach, but with today’s security solutions, it is a challenging, burdensome and costly process – and attacks are still getting through.
There are three different types of data leaks:
- Data breaches and theft due to vulnerabilities in their software applications and in the network perimeter.
- Data leak and credential theft resulting from Business Email Compromise (BEC) and targeted spear phishing attacks
- Accidental or malicious data loss, including accidentally sending confidential information to a misaddressed recipient with the same or similar name or address, or insiders sharing confidential data with outside parties.
Organizations employ DLP and security solutions across their platform, from the network to the application layer, in order to protect their assets. These solutions work well to a certain extent. The general methods these solutions use include inspecting data and sensitive content, and applying filters, blocking and other remediation features. They also typically apply policies for the acceptable use of content or data under specific contexts.
But what these solutions lack is the ability to understand the data from a human language and communications point of view. If you think about it, this is important because the majority of the communication within and across organizations is textual data authored by humans – and this is why today’s biggest attacks happen through human communications.
Consider the 2019 Verizon DBIR, which includes information from 41,686 reported security incidents and 2,013 data breaches from 86 countries. It includes clues about the types of attacks that bypass traditional security solutions, featuring statistics like:
- C-Suite executives are 12 times more likely to be targeted in social engineering attacks than other employees
- 90% of malware arrived via email
- 34% of attacks involved insiders
It is highly imperative for security solutions and products to start understanding textual communications better, and applying security that is based on that understanding.
The Limits of Pattern and Keyword Matching
Current DLP products rely heavily on pattern matching regular expressions (regex) to identity and classify data that is sensitive and critical to the organization. Regex is helpful for searching for a specific pattern in a blob of text, but it fundamentally lacks understanding of the semantics and context of the text, resulting in significantly imprecise detection and excessive noise (false positives).
To illustrate this better, consider this statement, "Jane thinks it is really cold at your place, and by the way, her email address is email@example.com. Please invite her to the party and keep your place warm."
Most of the regex-based DLP solutions configured to identify Health Insurance Portability and Accountability Act (HIPAA) violations would flag this as a HIPAA violation because the statement matches keywords for a disease condition (cold) and PII information (Jane and her email address).
On the other hand, a human reading the statement would clearly understand the context of “cold” refers to the temperature, and not a disease condition. This lack of understanding of the context and semantics in this example statement could result in a lot of false positives, leading to alert fatigue for a security analyst.
Pattern matching still has value in identifying structured data, like spreadsheets. The evolutionary path toward better DLP is to have a hybrid approach of using regex along with newer machine learning (ML) techniques – including Natural Language Understanding (NLU), Natural Language Processing (NLP) and deep learning – to help organizations better understand the semantics of the data.
The Challenge of Managing and Enforcing Policies
Complying with data protection regulations such as HIPAA, SOX, or PCI DSS requires organizations to develop and enforce data governance policies that dictate the use, storage and transmission of sensitive data. Enforcing these policies can be challenging.
Organizations typically use multiple tools and services to implement their data governance frameworks. This requires identifying and cataloging the data, monitoring where the data exists, and understanding how different people need to access and/or use the data. Ultimately, all of this has to be translated into security policy configurations on one or more DLP solutions. It takes constant work and rework by the security team to create and update policy rules. Teams often add more policies, without deleting or modifying existing policies because they fear losing protection.
Deep Learning and Language Understanding
As organizations use hundreds of applications to share data, it can be challenging for security admins to maintain the necessary data governance controls because they lack context. Manual policy management workflows cannot keep up with the pace of evolution of modern collaboration apps like Slack, Teams, Dropbox and Box.
The latest advances in deep learning and natural language models make it possible to learn and understand the human language context and to automate policy enforcement for collaboration workflows. This is the approach we’re taking here at Armorblox.
We can bring DLP to a new level of detection efficacy and ease of use because these new ML techniques can automatically classify and learn what is confidential and what needs to be protected based on historical communication patterns. By analyzing the content exchanged and shared between individuals across different applications, the platform can more accurately detect sensitive content with fewer false positives. This makes it easier to protect your organization’s data, enforce data governance policies, and remain compliant with data privacy regulations.
For example, the Armorblox NLU platform would help prevent you from accidentally sharing your W2 information with John Smith in sales instead of John Smith in HR, because it has learned the communication patterns of the different individuals from historical data and over time. The system would know that it is not normal to share this type of information with John Smith in sales, and provide an alert to ask if that is what you meant to do.
It could also create policies that help protect the organization from accidental or malicious data loss. For example, if there is a top secret project to be kept confidential between a group of six people, Armorblox can automatically raise alerts if the data is shared outside that group of people.
NLU also helps reduce false positives because it continuously learns and improves over time, by monitoring platforms for attributes and patterns of communication. Because it analyzes the historical data, learning and adjusting its baselines for normal sharing of sensitive data, it has better detection efficacy, and reduces false positives.