With the world cup fever of 2014 around it is interesting to do some analysis and dig deeper through stats. Here is an attempt during a weekend.
I pulled some publicly available data of all world cups from 1930 to 2006 and after cleaning it up for my purpose it had the following entries for each match/game:
As in any statistical analysis it is bit of challenge to decide how to handle missing values. In the above data, fields like "Shots on Goal, Shots Wide, Free Kicks, Corners" were not available up until 2002. Either these values can be set to 0 or handle with mean of the available data (over the available period) with function like
I pulled some publicly available data of all world cups from 1930 to 2006 and after cleaning it up for my purpose it had the following entries for each match/game:
Country, Year, FIFA_Winner, Country_Code,
Goals_For, Goals_Against, Matches, Penalties,
Won, Drawn, Lost, Corners, Offsides,
Shots_On_Goal, Free_Kicks, etc.
My first attempt was to take a look at how the countries cluster together and it would also be easy to validate the clustering with some prior knowledge of world cup. For example, one would expect Brazil, Germany, Argentina and few others possibly cluster together.As in any statistical analysis it is bit of challenge to decide how to handle missing values. In the above data, fields like "Shots on Goal, Shots Wide, Free Kicks, Corners" were not available up until 2002. Either these values can be set to 0 or handle with mean of the available data (over the available period) with function like
mean_vec < - function(vec) {
m <- b="">-> mean(vec, na.rm = TRUE)
vec[is.na(vec)] <- b="">-> m
return(vec)
}
where you replace 'NA' with mean. It could be used either column-wise or row-wise through apply function. It is grand mean of each column which introduces its own errors into model. Better would be to have mean at country level (a simple and straight forward and works better for data with Gaussian distribution characteristics) or other techniques including regression substitution, most probable value sub., etc. For some more details see http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html
Running the sum-squared-error (SSE) yielded the below chart. With the elbow/bend between 4 and 6 it would be sufficient to have minimum 4 clusters. I choose 10 for below analysis.
With 10 clusters it resulted in following dendogram:
How do the Soccer power houses like Brazil, Germany and few others (cluster 7 from left in the above diagram) would compare with few others. One metric is how many goals do they score in each match while allowing some. Density plots would be one visualization where I plotted 3 dimensional density with "Goals For" in X axis and "Goals Against" in Y axis. I left Sweden from list for now. Here is a twin peak with 1 and 2 goals in favor while ~0.5 goals against per game. Contrast this with one other countries below.
Comparing with the 7 other countries from the last cluster (#10 in the above dendogram), I get different density plot where peak happens with ~0.6 goals in favor while ~2 goals against per game.
PS: Note the difference in scales between these two plots. It will be interesting super impose one above the other with the same scale along 3 dimensions.
Use of heat map is another visualization with more details including deviation of each variable (represented by light blue vertical lines below). Compare below "Games Lost and Goals Against" with "Games Won and Goals For" for the two clusters. Also Shots on Goal.
More (part II) analysis at: http://www.hiregion.com/2014/06/world-cup-data-analysis-for-fun-part-ii.html
No comments:
Post a Comment