Product Features | 8 min read

Reinventing DLP with Natural Language Understanding


Arjun Sambamoorthy
Arjun Sambamoorthy

Traditional DLP is outdated, costly, and ineffective. Learn how using NLU and DLP together can help solve the growing problem of data breaches.

The mission of every organization’s security staff is to prevent a data breach: the loss of confidential information to an untrusted external destination. It can happen online, across emails, shared documents, or physical assets.

There are three types of data leaks:

  • Data breaches and theft due to software application and network perimeter vulnerabilities.
  • Data leaks and credential theft resulting from Business Email Compromise (BEC) and targeted spear phishing attacks.
  • Accidental or malicious data loss, including sending sensitive information to a misaddressed recipient and sharing confidential data with unauthorized parties, both internal and external.

Data Loss Prevention or Data Leak Protection (DLP) should be a priority to prevent a data breach. However, with today’s security solutions, it is a challenging, burdensome, and costly process, and attacks are still getting through.

Natural Language Understanding, or NLU, is a newer machine learning technique that helps organizations better understand data semantics. NLU goes beyond pattern matching to provide a broader interpretation of threats based on historical data and behavior.

This article will discuss the shortcomings of traditional DLP, challenges in preventing data leaks, and the growing relationship between NLU and DLP in cybersecurity.

DLP: What’s Missing?

Organizations employ DLP and security solutions across their platform, from the network to the application layer, to protect their assets. These solutions work well to a certain extent, including inspecting data and sensitive content, applying filters, blocking, and other remediation features. They also typically apply policies for the acceptable use of content or data under specific contexts.

But what these solutions lack is the ability to understand the data from a human language and communications point of view. This is important because humans author most communication within organizations – and most attacks happen through those communications.

Consider the 2021 Verizon DBIR, which includes information from 79,635 reported security incidents and 5,258 data breaches from 88 countries. It contains clues about the types of attacks that bypass traditional security solutions, featuring statistics like:

  • Email-based phishing attacks accounted for 36% of data breaches.
  • Employees are still the weak link in security; humans were involved in 85% of breaches.
  • BEC used for social engineering attacks increased 15 times (not percent!).
  • 10% of all data breaches involved ransomware.
  • The vast majority of organizations are neglecting threat detection and response.

It is imperative for security solutions and products to better understand textual communications and apply security measures based on that understanding.

The Limits of Pattern and Keyword Matching

Current DLP products rely heavily on pattern matching regular expressions (regex) to identify and classify an organization’s sensitive and critical data. Regex helps search for a specific pattern in a blob of text, but it lacks a fundamental understanding of text semantics and context, resulting in significant false positives and alert fatigue for security analysts

To illustrate this better, consider this statement:

"Jane thinks it is really cold at your place, and by the way, her email address is Please invite her to the party and keep your place warm."

A human reading the statement would clearly understand the context of “cold,” which refers to the temperature, not an illness.

On the other hand, most regex-based DLP solutions configured to identify HIPAA (Health Insurance Portability and Accountability Act) violations would flag this because the statement matches keywords for a disease condition (cold) and PII information (Jane and her email address).

Pattern matching still has value in identifying structured data, like spreadsheets. The evolutionary path toward better DLP incorporates a hybrid regex approach and newer machine learning (ML) techniques, including NLU, Natural Language Processing (NLP), and deep learning.

The Challenge of Managing and Enforcing Policies

Complying with data protection regulations such as HIPAA, SOX, or PCI DSS requires organizations to develop and implement data governance policies dictating sensitive data use, storage, and transmission. However, enforcing these policies can be challenging.

Organizations typically use multiple tools and services to implement their data governance frameworks. This requires:

  • Identifying and cataloging the data
  • Monitoring where the data exists 
  • Understanding various data access and/or usage levels

Ultimately, this has to be translated into security policy configurations on one or more DLP solutions. Therefore, it takes constant work by security teams to create and update policy rules. Unfortunately, new policies are often added without deleting or modifying existing ones because companies fear losing protection.

Deep Learning and Language Understanding

As organizations use hundreds of applications to share data, it can be challenging for security admins to maintain the necessary data governance controls. Manual policy management workflows cannot keep up with the evolutionary pace of modern collaboration apps like Slack, Teams, Dropbox, and Box.

The latest advances in deep learning and natural language models make it possible for technology to learn and understand human language context. This understanding enables policy enforcement automation for collaboration workflows — the approach we’re taking here at Armorblox.

Armorblox brings DLP to a new level of detection efficacy because ML techniques can automatically learn and classify what needs to be protected based on historical communication patterns.

Armorblox’ Advanced Data Loss Prevention Keeps Your Organization Safe

By analyzing the content exchanged and shared between individuals across different applications, Armorblox’s Advanced Data Loss Prevention can more accurately detect sensitive content with fewer false positives. This makes it easier to protect your organization’s data, enforce data governance policies, and remain compliant with data privacy regulations.

Here are two examples of Armorblox Advanced Data Loss Prevention in action:

  • Our DLP helps prevent you from accidentally sharing your W2 information with John Smith in Sales instead of John Smith in HR. Because it has learned each individual’s communication patterns from historical data over time, the system would know that it’s not normal to share this information with John Smith in Sales. It could then provide an alert, asking you to verify that this was your intention.
  • Our DLP creates policies that help protect organizations against accidental and malicious exposure of sensitive data. For example, if there is a top-secret project to be kept confidential between a group of six people, Armorblox can automatically raise alerts if people share data with unauthorized recipients outside the group.

NLU helps reduce false positives because it continuously learns and improves over time. By monitoring for attributes and patterns of communication, analyzing historical data, and consistently adjusting baselines for normal sharing of sensitive data, its detection efficacy continuously improves.

Learn more about how Armorblox is reinventing DLP with its NLU platform, and contact us for a customized demo of our solution.

Take a 5-minute product tour

Experience the Armorblox Difference

Get a Demo