Data Science Engineering Manager at WhiteHat Security, Trustee at Farset Labs and Vault Artist Studios
So I have a piece of coursework for a CS module I’m taking at Queen’s University Belfast and one of the focal points of it is the recent RockYou! SQL-injection breach that released 32million passwords into the internet, and I thought I’d have a closer look at that list.
I ‘acquired’ the password list from your regular neighbourhood tracker, and thought I could walk through the process of getting a probability-sorted password dictionary.
(The ‘-S 2048K’ memory restriction on the ‘sort’ program is to avoid Dreamhost locking out my process for being over-memory)
tar -xvzf UserAccount-passwords.tgz
Having a look at the head of the resultant ‘UserAccount-passwords.txt’ file shows:
$ head UserAccount-passwords.txt
32million entries in arbitrary order arn’t really that useful, so I sorted them alphabetically first (-d)
sort -d -S 2048K UserAccount-passwords.txt -o UserAccount-passwords.sorted.txt
And getting a head again gave a whole pile of blank lines, so to get rid of them use this handy sed expression
$ sed ‘/^$/d’ UserAccount-passwords.sorted.txt > UserAccount-passwords.sorted.unblanked.txt
So our first ten passwords are now:
$ head UserAccount-passwords.sorted.unblanked.txt
Loooots of duplicates, so we’ll get rid of them
uniq -cd UserAccount-passwords.sorted.unblanked.txt UserAccount-passwords.uniq.txt
The -d flag means that we only want to know about entries that appear at least twice, and the -c means we only want one line for each password and a count for how often it appears (This reduced the number of lines in the list from 32,603,048 non-blank entries to 2,459,759), giving a first ten of:
Still sorted alphabetically, so sort reverse-numerically to get most popular entries at the top.
sort -nr -S 2048K UserAccount-passwords.uniq.txt -o UserAccount-passwords.uniq.sorted.txt
Giving our top 20 most popular passwords (sorry guys, but this is really depressing)
$ head -20 UserAccount-passwords.uniq.sorted.txt
There really is no hope for us…
More analysis to come when I can be bothered, and potentially some attempts at breaking into a VM with simulated user accounts.