World Cup Data Analysis For Fun - Part II

Continuing from Part I ( http://www.hiregion.com/2014/06/world-cup-data-analysis-for-fun-i.html ), following chart shows density of number of goals scored by country in a world cup tournament.  Black line in the fore ground is the average density of goals.

Some interesting facts:
* Purple peak is Wales with four goals in 1958 and that is the only year they played.
* Organge-yellowish peak is Bolivia scoring no goals twice and one goal once
* Large percentage (~80%) score no more than 10 goals in each tournament





Goals For Summary (per country per cup):
  • Min. :        0.0
  • 1st Qu.:     2.0
  • Median :   4.0
  • Mean :      5.7
  • 3rd Qu.:    8.0
  • Max. :     27.0
Goal Against Summary (per country per cup):

  • Min.   :     0.0
  • 1st Qu.:    4.0
  • Median :  5.0
  • Mean   :   5.7                                                                         
  • 3rd Qu.:   7.0  
  • Max.   :  17.0

While it is low number of goals scored in a each world cup (see chart above) it is also interesting to see the trend over many decades of all goals (scored + allowed) per game.  Here I applied the LOWESS (locally weighted scatter plot smoothing) non-parametric regression to better fit the data (blue line).


  
Though early in early years there were lot more goals each game, in the recent past (after 1970) it has stabilized around 2.7 goals per game.  But how do soccer power houses (Argentina, Brasil, Germany, etc.) compare with seven other countries chosen from another cluster (See part 1).  As one would expect you have to score more than you allow :) and represented by gray dashed line on Y-axis i.e,

Goals Scored / Goals Allowed > 1




The colored year shows the winner of the World Cup on that year while the size of the bubble shows the total goals (Scored plus Allowed).  Six countries won all world cups between 1930 and 2006 except for the years 1930 and 1950 when Uruguay won and there were no world cups during 1942, 1946.

The outlier you see at the left top screen (BR, 10) is when Brazil scored 10 goals but allowed only 1 goal in 1986 in 5 matches while Argentina was the world cup winner scoring 14 goals and allowing 5 goals in 7 matches.

And the bottom (US, 0.14) big dot is for when US scored 1 goal and allowed 7 goals in 1934.

World Cup Data Analysis For Fun - Part I

With the world cup fever of 2014 around it is interesting to do some analysis and dig deeper through stats.  Here is an attempt during a weekend.

I pulled some publicly available data of all world cups from 1930 to 2006 and after cleaning it up for my purpose it had the following entries for each match/game:
Country, Year, FIFA_Winner, Country_Code,
Goals_For, Goals_Against, Matches, Penalties,
Won, Drawn, Lost, Corners, Offsides,
Shots_On_Goal, Free_Kicks, etc.


My first attempt was to take a look at how the countries cluster together and it would also be easy to validate the clustering with some prior knowledge of world cup. For example, one would expect Brazil, Germany, Argentina and few others possibly cluster together.

As in any statistical analysis it is bit of challenge to decide how to handle missing values.  In the above data, fields like "Shots on Goal, Shots Wide, Free Kicks, Corners" were not available up until 2002.  Either these values can be set to 0 or handle with mean of the available data (over the available period) with function like

mean_vec < - function(vec) {
    m <- b=""> mean(vec, na.rm = TRUE)
    vec[is.na(vec)] <- b=""> m
    return(vec)
}


where you replace 'NA' with mean.  It could be used either column-wise or row-wise through apply function.  It is grand mean of each column which introduces its own errors into model.  Better would be to have mean at country level (a simple and straight forward and works better for data with Gaussian distribution characteristics) or other techniques including regression substitution, most probable value sub., etc.  For some more details see http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html

Running the sum-squared-error (SSE) yielded the below chart. With the elbow/bend between 4 and 6 it would be sufficient to have minimum 4 clusters. I choose 10 for below analysis.




With 10 clusters it resulted in following dendogram:



How do the Soccer power houses like Brazil, Germany and few others (cluster 7 from left in the above diagram) would compare with few others.  One metric is how many goals do they score in each match while allowing some.  Density plots would be one visualization where I plotted 3 dimensional density with "Goals For" in X axis and "Goals Against" in Y axis. I left Sweden from list for now.  Here is a twin peak with 1 and 2 goals in favor while ~0.5 goals against per game.  Contrast this with one other countries below.






Comparing with the 7 other countries from the last cluster (#10 in the above dendogram), I get different density plot where peak happens with ~0.6 goals in favor while ~2 goals against per game.

PS: Note the difference in scales between these two plots.  It will be interesting super impose one above the other with the same scale along 3 dimensions.




Use of heat map is another visualization with more details including deviation of each variable (represented by light blue vertical lines below).  Compare below "Games Lost and Goals Against" with "Games Won and Goals For" for the two clusters.  Also Shots on Goal.



More (part II) analysis at: http://www.hiregion.com/2014/06/world-cup-data-analysis-for-fun-part-ii.html