Wow, it’s been almost a year since the last post. Here’s some non-science for getting me in shape.

It’s been a couple of weeks since the FIFA World Cup final. Amazing falls of favorites and rises of underdogs are what people always look for in such events. Croatia, who had barely qualified for the tournament showed a great performance until the very last game, losing only to the young France team. Discussing the team’s chances one can’t help but compare their countries’ population sizes. Indeed, isn’t it amazing that Croatia, a nation of 4.1 million people, could leave behind England (55 million), Russia (144 million), Argentina (44 million), and Nigeria (186 million)? Or is it not? Below is some semi-serious attempt to analyze population – football performance relationship.

I’ve started with two primary data sources: FIFA ranking, population etc. (the latter was randomly googled, I believe there should be better sources out there).

Some data pre-processing involved renaming countries to synchronize two data sets and excluding Great Britain that is represented in FIFA by four football federations (England, Scotland, Wales, and Northern Ireland) and Denmark (Denmark + Faroe Islands + FIFA-inactive Greenland that would skew population metrics). Then re-ranking the countries after these exclusions.

The easiest thing to compare is the actual ranks of countries in FIFA ranking and by population.

So there seems to be some appreciable direct correlation between population rank and FIFA rank (r = 0.497). However, one can easily see a dense cluster of points in the upper right corner. What if we just omit those countries with population rank > 150? Pearson correlation coefficient drops significantly but still isn’t zero (r = 0.165). If you are interested, the country #150 by population is Bahrain with 1.57 million population and FIFA rank #107.

But rank might be not the best way of comparing performance. After all, it’s discrete and, as correctly was noticed on twitter, the difference in performance between rank 1 (Germany) and 44 (Bulgaria) is not necessarily the same as between 151 (Kuwait) and 194 (San Marino). In population, the differences are even more dramatic because the distribution is close to exponential.

So let’s get more quantitative. For the football performance FIFA points seem to be a good metric. Let’s explore how they correlate with populational variables.

One could immediately jump to a conclusion that population size and FIFA points don’t correlate (r = 0.0797). But we are not going to give up so easily. Distributions of population size, population density, and land area are highly skewed. So log-transform might be a good way to correct for it. For the sake of completeness, I’ve applied it to FIFA points, too. But before that note the weak reverse correlation between FIFA points and fertility rate (r = −0.295) and direct correlation of FIFA points with median age (r = 0.403) and percent of urban population (r = 0.381). So yes, developed countries tend to play football better.

To analyze the data further, I transformed the variables in the following way:

logpop = log10(pop) logden = log10(Density) logland = log10(Land.Area) migr = log10((pop + Migrants)/pop) upop = log10(UrbanPop * pop) logpts = log10(Pts) # excluded 0 pts countries

Now all population metrics became much more symmetrically distributed. Also almost all correlations are stronger in log-log scale as opposed to semi-log scale (2nd and 1st rows in correlation plot, respectively):

Wow, now the correlation between football performance and population is quite noticeable (r = 0.562)! What’s even more interesting, the net urban population correlates even better (r = 0.601). So rural folks don’t contribute as much to football skill of country, probably because they have better things to do. No correlation with migration was found, despite of what wicked tongues could say about French team.

But again, if the smallest countries are excluded (>150 rank by population) the correlation coefficient falls back to r = 0.098 (red line on the plot below). On that plot, blue circle is France, red square is Croatia, red square with blue outline – Iceland (the best team for its population size), and green circle is Pakistan (the worst team with huge population).

So clearly there is a positive effect of population on the performance in football but it disappears once a country reaches some critical size. What is this size? To answer this question, I calculated correlation coefficients between log-population and log-FIFA points for a range of country sizes from 10,000 to 300,000,000. Below is the result:

Correlation coefficient gradually decreases with the lowest limit for the population size and reaches zero around 7 million. So that’s a good estimate for the critical size of the country, at which additional gain in population won’t increase their chances of being better in football. At this point they’d better find other ways to improve. Actually, being larger than 20 million but smaller than 100 million looks a bit of a disadvantage.

So what about Croatia? For countries larger than or equal in size to Croatia, correlation coefficient is 0.043. Is it meaningful? Well, by the medical research standards, everything below 0.3 is negligible. From the plot above it means that for countries bigger than 200,000 people no meaningful correlation between football skill and population exists. So even Iceland would not be outstanding by medical research standards. Now you see why drug discovery is so damn difficult?

UPDATE.

*By popular demand, here are correlations of FIFA points with some economic stats: Human development index (dataset from UN, latest data from 2015) and log-GDP (dataset from World Bank). Since some countries have no data for the last two metrics they were omitted and correlation coefficients between log-FIFA pts and population is different from above. Fun fact new for me: FIFA has 18 more members than UN!*