Data leakage in a Google world

By Barry Shteiman

The causes of data leakage range from simple misconfiguration or improper classification of data, which makes it possible for Web servers to publish private and/or sensitive information, to users unwittingly (or not) storing sensitive data where they shouldn’t, Even best business practices such as site scraping — a known method for gathering competitive intelligence in which company A automatically reads company B’s website for available data like price tables, and uses it to cut its own prices and remain on top – can lead to leakage of information.

Search Engines

By their very nature, search engines are the Internet’s biggest and most public Indexers. Search engines analyse websites, indexing them for the benefit of everyone who has ever done an Internet search. One urban legend even states that Google has a complete copy of the entire public Internet on file for data mining and analysis purposes.

As consumers and users of the Internet, we like to believe that most organisations do their best to remove sensitive information from their websites, FTP sites and other front facing business applications, as it turns out, however, this is not always the case.

Google Tables Search

Google has always been a pioneer of search algorithms, search visibility and advanced indexing, remaining one step ahead of other search engines, and introducing new ways to tag images by context, and even FTP sites for content. The recently added Table Search capabilities, however, really brings to surface the idea of impact of data leakage in an indexed world.

This is not to say that Google is causing data leakage, but abilities like “Indexed FTP”, "Search by image”, and now “Table Search” offer new ways to discover and extract data, which would otherwise have remained undiscovered.

Here’s one scary example (you can think of other interesting tables: PII, salaries, CC,…)

We used:  


Which is a structured representation of:  

What is the security takeaway?

The takeaway here is that Web security is more important now than ever before. Obviously, it doesn’t make sense to block Google from indexing your site (a business driver), but you should be aware of what content you are allowing access to, and who is accessing it.

Companies should:

1. Implement web application security to mitigate hacker risk.

2. Validate the content that is accessible via your web servers on a regular basis and/or implement policies to check for outgoing data.

3. Implement policies to mitigate Bots that may scrape your website for available content.

The bottom line is that no one really wants to block Google from indexing your website, however controlling the content that your website serves is important. Organisations should note that once content is up there, the “search giants” will index it and with ever-evolving mechanisms it will become easier to get around leaked information.

About the author

Barry Shteiman is Senior Security Strategist, and Tal Be’ery, Web Security Research Team Leader at Imperva.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s