Last week I was contacted by someone alerting me to the presence of a spam list. A big one. That’s a bit of a relative term though because whilst I’ve loaded “big” spam lists into Have I been pwned (HIBP) before, the largest to date has been a mere 393m records and belonged to River City Media. The one I’m writing about today is 711m records which makes it the largest single set of data I’ve ever loaded into HIBP. Just for a sense of scale, that’s almost one address for every single man, woman and child in all of Europe. This blog posts explains everything I know about it.
Firstly, the guy who contacted me is Benkow moʞuƎq and he’s done some really interesting malware and spambot analysis in the past. During our communication over the last week, I had a read of his piece on Spambot safari #2 – Online Mail System which is a good example of the sort of work he’s been doing (it’s also a good example of how dodgy some of this spammer code is!) He went on to explain how he’d located a machine used by the “Onliner Spambot” and pointed me to a path on an IP address with directory listing enabled:
I’ve obfuscated a bunch of info here because as of the time of writing, the server is still up and I don’t want to give away any information that could result in the data being spread further. The IP address is actually based in the Netherlands and Benkow and I have been in touch with a trusted source there who’s communicating with law enforcement in an attempt to get it shut down ASAP. Until that time, I’m not going to share file names in their entirety although I’ll certainly describe anything of relevance in them.
Before I dive into the data, Benkow has posted a dedicated piece on the mechanics of this spambot that’s worth a read. You can also find a great story on ZDNet from Zack Whittaker which is a good overview of the situation. The gap I want to fill here is to explain what I can about the data because there’ll be a very large number of people finding themselves on HIBP and wondering what an earth is going on. If you haven’t already read Benkow’s piece, there’s 2 important classes of data you need to understand:
- Email addresses. That’s it – just masses and masses of email addresses used to deliver spam to. In some cases, a single file may contain tens or even hundreds of millions of addresses.
- Email addresses and passwords. Benkow explains that these are used in an attempt to abuse the owners’ SMTP server in order to deliver spam. I also believe that many of these may simply be aggregations from various other breach sources I’ll talk about a little later on.
Getting on to the data itself, the first place to start is with an uncomfortable truth: my email address is in there. Twice:
That first file is the 14GB one from the earlier directly listing whilst the second is 131MB. In many cases, I found the same data in both the former larger file and a subsequent smaller one. Interestingly, as you can see from the suffix above, both refer to “UK” (I’m certainly not from the United Kingdom) whilst others refer to “AU” (although I’m not in there). There are no other 2 letter country codes represented in the file names but clearly when we’re talking many hundreds of millions of addresses here, a heap of them are from other locations so take those suffixes with a grain of salt.
One of the files with the “NewFile_” prefix contained over 43k rows associated with the Roads and Maritime Services of my neighbouring state here in Australia:
Every row contains [email protected] in quotes followed by “[email protected]” and then predominantly .com.au domains, albeit with over 13k .ru domains. This email address is used to send notifications relating to the “E-Tag” device installed on your car windscreen so that you can pay tolls. I know this because I’ve received a bunch of them in the past:
I’ll take a stab at it and say that there’s not many legitimate drivers using the New South Wales toll road system with Russian email addresses! Clearly, the constant alias on every one of these accounts is auto-generated. Interestingly, I saw a similar pattern with the B2B USA Businesses spam list I loaded last month with many comments like this:
I received a domain alert on this one. Went through the process, turned out to be an invented address ([email protected]). Address doesn’t exist.
— Peter Bance (@peterbance) July 19, 2017
There’s also some pretty poorly parsed data in there which I suspect may have been scraped off the web. For example, Employe[email protected]bowelcanceruk.org.uk appears twice:
The first file is the same one my own email address was in and the second is the same file name structure albeit with a different number in it. And if you’re wondering why I’ve publicly listed someone else’s address, it’s because it’s already publicly listed:
But of course, the data in the dump has a bunch of junk prefixed to the address, junk which appears to be an HTML file name and may indicate the “address” was scraped off the web and the parsing simply wasn’t done very well. The point here is that there’s going to be a bunch of addresses here that simply aren’t very well-formed so whilst the “711 million” headline is technically accurate, the number of real humans in the data is going to be somewhat less.
And then we get into passwords. One file is named numerically and contains 1.2m rows like this:
A random selection of a dozen different email addresses checked against HIBP showed that every single one of them was in the LinkedIn data breach. Now this is interesting because assuming that’s the source, all those passwords were exposed as SHA1 hashes (no salt) so it’s quite possible these are just a small sample of the 164m addresses that were in there and had readily crackable passwords.
A similar file (with a similar naming structure) contains 4.2m email address and password pairs, this time with every single account having a hit on the massive Exploit.In combo list. This should give you an appreciation of how our data is redistributed over and over again once it’s out there in the public domain.
Other files have equal or even greater numbers; one has 29m rows of email address and password pairs, the former of which consistently shows up in HIBP, albeit without an exclusive data breach pattern per the previous two examples. Another is named in a fashion that suggests it’s a large Aussie set of addresses and true to its name, there’s 12.5m rows in there which would mean roughly one per every 2 people in the country. A large portion of those don’t show up in HIBP at all, including the 379k .gov.au ones which appear to be a mixture of legitimate, fake and malformed addresses.
Yet another file contains over 3k records with email, password, SMTP server and port (both 25 and 587 are common SMTP ports):
This immediately illustrates the value of the data: thousands of valid SMTP accounts give the spammer a nice range of mail servers to send their messages from. There are many files like this too; another one contained 142k email addresses, passwords, SMTP servers and ports.
Some of the data was quite a jumble of info, for example the file with 20m rows of Russian email addresses surrounded by what appears to be file names containing SQL:
And it goes on and on. Email addresses, passwords and SMTP servers and ports spread across tens of gigabytes of files. It took HIBP 110 data breaches over a period of 2 and a half years to accumulate 711m addresses and here we go, in one fell swoop, with that many concentrated in a single location. It’s a mind-boggling amount of data.
The above examples are by no means exhaustive, rather they’re intended to illustrate just how diverse the data is. It helps explain both the massive number of records and the inevitable responses I’ll get of “there are addresses in there which aren’t real”. It also illustrates how broad the sources of data inevitably are; finding yourself in this data set unfortunately doesn’t give you much insight into where your email address was obtained from nor what you can actually do about it. I have no idea how this service got mine, but even for me with all the data I see doing what I do, there was still a moment where I went “ah, this helps explain all the spam I get”. And that’s the unfortunate reality for all of us: our email addresses are a simple commodity that’s shared and traded with reckless abandon, used by unscrupulous parties to bombard us with everything from Viagra offers to promises of Nigerian prince wealth. That, unfortunately, is life on the web today.
All 711 million records are now searchable in HIBP.