Users Password Analysis

As a data engineer it is always interesting to work on large unique data sets.  With recently released Yahoo users details (453K) many insightful info can be gleaned from the data.  For example, even though password hacking is well known for long time still large number of users use simple passwords, sometime as simple as "password" or "123456" or similar.  Here are top 10 passwords and number of users who had used them!


123456         1667
password       780
welcome        437
ninja              333
abc123          250
123456789   222
12345678     208
sunshine        205
princess         202
qwerty          172 

It is interesting to see how many users had unique passwords which was not used by anyone in this data set.  There were 10.6K users with no password which might be due to data issue and ignored for many of calculations and only ~304K (69%) users with unique passwords.

Another interesting insight is if password is used by more than one user, there is likely hood that it is some kind of latin word or words ("whatever", "iloveyou") or proper name ("jordon", "ginger") or some number (123321) or what can easily be guessed (for example, "q1w2e3r4" for qwerty keyboard or "asdfgh", etc.).  Even when two users used the same password there was some certainty that it is a guessable password! With each additional user the certainty increases quite quickly.  Under these circumstances, even if a password is encrypted (by md5 or sha or other encryptions) by service providers, with brute force application one can find out the password for these users.

By also looking into how users from different email service providers had their passwords setup showed the following.  As expected, Yahoo had more users (x-axis) while smaller companies ("others" in the chart) had more number of users (71.7%) with unique passwords.  At the same time gmail and live users' password length is more than 8.87.  Length of the passwords is represented by size of the bubble.


Having bigger bubble size and higher up in the Y-axis is better as it represents more users using unique passwords with longer password strings.  See table below for more details.



Even more interesting analysis can be done including people's or places' names in their password.  One could be able to use popular names from US Social Security Administration's and names' list go as back as 1880! There were lot more passwords that simply used these names!  Lot more matches can be found with minor modifications like changing i to 1 or o to 0 (zero), etc.

With many users using simple passwords service providers or websites should force each user to have stronger password by enforcing them during the registration or each login.  Users should also be forced change them once in few months.  It might be even better each computer equipped with finger or eye reader that can be used for user authentication thus avoiding this whole password mess.