So what can you do with 32 Million Passwords...
Andrew Bolster
Senior R&D Manager (Data Science) at Black Duck Software and Treasurer @ Bsides Belfast and NI OpenGovernment Network
So I have a piece of coursework for a CS module I’m taking at Queen’s University Belfast and one of the focal points of it is the recent RockYou! SQL-injection breach that released 32million passwords into the internet, and I thought I’d have a closer look at that list.
I ‘acquired’ the password list from your regular neighbourhood tracker, and thought I could walk through the process of getting a probability-sorted password dictionary.
(The ‘-S 2048K’ memory restriction on the ‘sort’ program is to avoid Dreamhost locking out my process for being over-memory)
tar -xvzf UserAccount-passwords.tgz
Having a look at the head of the resultant ‘UserAccount-passwords.txt’ file shows:
$ head UserAccount-passwords.txt
password
mekster11
mekster11
mekster11
progr4sm
khas8950
emilio1
holiday2
caitlin1
purblanca
32million entries in arbitrary order arn’t really that useful, so I sorted them alphabetically first (-d)
sort -d -S 2048K UserAccount-passwords.txt -o UserAccount-passwords.sorted.txt
And getting a head again gave a whole pile of blank lines, so to get rid of them use this handy sed expression
$ sed ‘/^$/d’ UserAccount-passwords.sorted.txt > UserAccount-passwords.sorted.unblanked.txt
So our first ten passwords are now:
$ head UserAccount-passwords.sorted.unblanked.txt
!
!!!!
!!!!!
!!!!!
!!!!!
!!!!!
!!!!!
!!!!!
!!!!!
!!!!!
Loooots of duplicates, so we’ll get rid of them
uniq -cd UserAccount-passwords.sorted.unblanked.txt UserAccount-passwords.uniq.txt
The -d flag means that we only want to know about entries that appear at least twice, and the -c means we only want one line for each password and a count for how often it appears (This reduced the number of lines in the list from 32,603,048 non-blank entries to 2,459,759), giving a first ten of:
$head UserAccount-passwords.uniq.txt
12 !!!!!
67 !!!!!!
3 !!!!!!!
3 !!!!!!!!
8 !!!!!!!!!!
2 !!!”"”£££
2 !!!$$$
2 !!!???
2 !!!@@@
2 !!”“££
Still sorted alphabetically, so sort reverse-numerically to get most popular entries at the top.
sort -nr -S 2048K UserAccount-passwords.uniq.txt -o UserAccount-passwords.uniq.sorted.txt
Giving our top 20 most popular passwords (sorry guys, but this is really depressing)
$ head -20 UserAccount-passwords.uniq.sorted.txt
290729 123456
79076 12345
76789 123456789
59462 password
49952 iloveyou
33291 princess
21725 1234567
20901 rockyou
20553 12345678
16648 abc123
16227 nicole
15308 daniel
15163 babygirl
14726 monkey
14331 lovely
14103 jessica
13984 654321
13981 michael
13488 ashley
13456 qwerty
There really is no hope for us…
More analysis to come when I can be bothered, and potentially some attempts at breaking into a VM with simulated user accounts.