The media have finally started to write some really nice reports on data sludge. I like this Wall Street Journal article, opening the black box of the science of secretly reading your emails.

If you use a smartphone, it is very likely that you have agreed to some app's terms and conditions, which allow them to download your emails en masse from one or more of your favorite cloud email providers, such as Gmail, Yahoo! Mail, Outlook, etc.

There was already an infamous example of this that came to light last year. is a service that helps you unsubscribe to unwanted mailing lists. When you set up this service, it requests access to your emails. That is the way they find out which services you can be unsubscribed from. It turns out that isn't really about helping you reduce email clutter - in fact, its main business is mining your inbox for shopping receipts, which can be sold to businesses which - you guessed it - want to sell you more stuff, which - you guessed it - probably means you'll receive more spam, net net. Oops. That was the sound when the company's management learned their little data sludge scheme went public.

For the story, I'm linking to this commentary in Venture Beat by someone who audaciously slammed management for audaciously claiming that their data sludge scheme was par for the course in the tech industry. This guy actually screamed: "The analogy between Unroll Me and Google or Facebook is audacious. Not to say haughty." He went on to claim that Google and Facebook keep all their data in-house. That was false in 2017, and looked even worse in light of recent revelations about data practices at those two companies.

Google, for example, has allowed, and recently expanded access by third-party developers to Gmail emails, according to the Wall Street Journal. Just like Facebook, Google has no control over how these third parties use the data. It has some language requiring these developers to agree to certain standards but those are unenforceable, and not enforced.

The WSJ article includes quotes from various participants in this data sludge industry that are false, intentionally or not.

First, they repeatedly claim they don't "read" our emails. Let's do a thought experiment here. I want to know if you are a racist. Unbeknownst to you, I got my hands on all the emails in your Gmail account stretching back 10 years. I write a computer program to look for various keywords like the N word. The program tabulates for me how many times you used each word, which days of the week you tend to say such words, which people you use those words with, the number of variations of each such word you have in your vocabulary, how many of your friends partake in such conversations and how many times they use racist terms, etc. Based on this report, I conclude that you are a racist. I may even conclude that certain friends of yours are also racist. According to various interviewees in the WSJ article, I drew that conclusion without "reading your emails."

(Lest you think the example is far-fetched, we recently heard that Facebook had tagged thousands of users with the label "treason," a segment which can be purchased by advertisers - or anyone willing to pay for this data.)

Second, the companies interviewed for the article e.g. Return Path basically claim that they have only had human beings read emails once or twice. That is simply a lie. You can't build any kind of predictive model without getting intimate with the data. Further, to understand how these models work, you have to review actual cases. Finally, when something unexpected happens, you have to look at the email contents to understand why.

These technologies have some possible benefits. If such benefits outweight the potential harm, then consumers would gladly adopt them. The data industry should be much more transparent. This ensures that the developers maximize benefits while reducing the levels of harm. 


We've been tracking data sludge for years. For more, read this thread.