Simulation project data

Data used and background research

The World Happiness Report is produced by the United Nations Sustainable Development Solutions Network in partnership with the Ernesto Illy Foundation. The first World Happiness Report was published in 2012, the latest in 2019. The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be.

Each year the report has focused in on a different aspect of the report such as how the new science of happiness explains personal and national variations in happiness and how well-being is a critical component of how the world measures its economic and social development. Over the years it looked at changes in happiness levels in the countries studies and the underlying reasons, the measurement and consequences of inequality in the distribution of well-being among countries and regions. The 2017 report emphasized the importance of the social foundations of happiness while the 2018 report focused on migration. The latest World Happiness Report (2019) focused on happiness and the community and happiness has evolved over the past dozen years. It focused on the technologies, social norms, conflicts and government policies that have driven those changes.

Increasingly, happiness is considered to be the proper measure of social progress and the goal of public policy. Happiness indicators are being used by governments, organisations and civil society to help with decision making. Experts believe that measurements of well-being can be used to assess the progress of nations. The World Happiness reports review the state of happiness in the world and show how the new science of happiness explains personal and national variations in happiness.

The underlying source of the happiness scores in the World Happiness Report is the Gallup World Poll - a set of nationally representative undertaken in many countries across the world. The main life evaluation question asked in the poll is based on the Cantril ladder. Respondents are asked to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The rankings are from nationally representative samples, for the years 2016-2018. The overall happiness scores and ranks were calculated after a study of the underlying variables.

Happiness and life satisfaction are considered as central research areas in social sciences.

The variables on which the national and international happiness scores are calculated are very real and quantifiable. These include socio-economic indicators such as gdp, life expectancy as well as other life evaluation questions regarding freedom, perception of corruption, family or social support. Differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score according to the World Happiness Reports.

The variables used reflect what has been broadly found in the research literature to be important in explaining national-level differences in life evaluations. Some important variables, such as unemployment or inequality, do not appear because comparable international data are not yet available for the full sample of countries. The variables are intended to illustrate important lines of correlation rather than to reflect clean causal estimates, since some of the data are drawn from the same survey sources, some are correlated with each other (or with other important factors for which we do not have measures), and in several instances there are likely to be two-way relations between life evaluations and the chosen variables (for example, healthy people are overall happier, but as Chapter 4 in the World Happiness Report 2013 demonstrated, happier people are overall healthier).

The World Happiness Reports and data are available from the Worldhappiness website. The latest report is The World Happiness Report 2019[7]. The World Happiness Report is available for each year from 2012 to 2019 containing data for the prior year. For each year there is an excel file with several sheets including one sheet with annual data for different variables over a number of years and other sheets containing the data for the calculation of the World Happiness score for that year. Some of the data such as Log GDP per capita are forecast from the previous years where the data was not yet available at the time of the report. Kaggle also hosts part of the World Happiness datasets for the reports from 2015 to 2019.

The full datasets are available online in excel format by following a link under the Downloads section on the World Happiness Report[8] website to https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls.

Kaggle Datasets[9] also has some datasets in csv format.

I have downloaded both the latest csv and excel files to the /data folder in this repository. The data from the excel file is the most comprehensive as it includes data for several years from 2008 to 2019. There are two main excel sheets in the 2019 data. The sheet named Figure2.6 contains the main variables used as predictors of the happiness score in the 2019 World Happiness Report. The data is contained in columns A to K while the remaining columns contain actual tables and figures used in the World Happiness annual report. A second sheet called Table2.1 contains all the available data from 2008 up to 2019 and provides more detail.

The happiness scores and the world happiness ranks are recorded in Figure 2.6 of the 2019 report with a breakdown of how much each individual factor impacts or explains the happiness of each country studied rather than actual measurements of the variables. The actual variables themselves are in Table 2.1 of the World Happiness Report. There are some other variables included in the report which have a smaller effect on the happiness scores.

In this project I will focus on the main determinants of the Happiness scores as reported in the World Happiness reports. These are income, life expectancy, social support, freedom, generosity and corruption. The happiness scores and the world happiness ranks are recorded in Figure 2.6 of the 2019 report with a breakdown of how much each individual factor impacts or explains the happiness of each country studied rather than actual measurements of the variables. The columns in df6 dataframe correspond to the columns in Figure 2.6 data from the World Happiness Report of 2019. The values in the columns describe the extent to which the 6 factors contribute in evaluating the happiness in each country.

The actual values of these variables are in Table 2.1 data of the World Happiness Report. There are some other variables included in the report which have a smaller effect on the happiness scores.

Life Ladder
Log GDP per capita / Income
Social Support / Family
Healthy Life Expectancy at birth
Freedom to make life choices
Generosity
Perceptions of Corruption
Life Ladder The variable named Life Ladder is a Happiness score or subjective well-being from the Gallup World Poll It is the national average response to the question of life evaluations. The English wording of the question is “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?” This measure is also referred to as Cantril life ladder. [10](Statistical Appendix 1 for Chapter 2 of World Happiness Report 2019, by John F. Helliwell, Haifang Huang and Shun Wang) The values in the dataset are real numbers representing national averages. They could vary between 0 and 10 but in reality the range is much smaller in between these numbers.
Log GDP per capita:

GDP per capita is a measure of a country’s economic output that accounts for its number of people. It divides the country’s gross domestic product by its total population. That makes it a good measurement of a country’s standard of living. It tells you how prosperous a country feels to each of its citizens.11

Gross Domestic Product per capita is an approximation of the value of goods produced per person in the country, equal to the country’s GDP divided by the total number of people in the country. It is usually expressed in local currency or a standard unit of currency in international markets such as the US dollar. GDP per capita is an important indicator of economic performance and can be used for cross-country comparisons of average living standards. To compare GDP per capita between countries, purchasing power parity (PPP) is used to create parity between different economies by comparing the cost of a basket of similar goods.

GDP per capita can be used to compare the prosperity of countries with different population sizes.

There are very large differences in income per capita across the world. As average income increases over time the distribution of gdp per capita gets wider. Therefore the log of income per capita is taken when the growth is approximately proportional. When $x (t)$ grows at a proportional rate, $log x (t)$ grows linearly. [12]Introduction to Economic Growth Lecture MIT.

Per capita GDP is a unimodal but skewed distribution. The log of GDP per capita = log (Total GDP per capita/ population) is a more symmetrical distribution. [13]Sustainable Development Econometrics Lecture

The natural log is often used in economics as it can make it easier to see the trends in the data and the log of the values can fit a normal distribution.

Healthy Life Expectancy at Birth.

Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository.

Healthy life expectancy (HALE) is a form of health expectancy that applies disability weights to health states to compute the equivalent number of years of good health that a newborn can expect. The indicator Healthy Life Years (HLY) at birth measures the number of years that a person at birth is still expected to live in a healthy condition. HLY is a health expectancy indicator which combines information on mortality and morbidity. [14]Eurostat.

Overall, global HALE at birth in 2015 for males and females combined was 63.1 years, 8.3 years lower than total life expectancy at birth. In other words, poor health resulted in a loss of nearly 8 years of healthy life, on average globally. Global HALE at birth for females was only 3 years greater than that for males. In comparison, female life expectancy at birth was almost 5 years higher than that for males. HALE at birth ranged from a low of 51.1 years for African males to 70.5 years for females in the WHO European Region. The equivalent “lost” healthy years (LHE = total life expectancy minus HALE) ranged from 13% of total life expectancy at birth in the WHO African Region to 10% the WHO Western Pacific Region. [15]www.who.int.

Social Support is the national average of the binary responses (either 0 or 1) to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”. [10](Statistical Appendix 1 for Chapter 2 of World Happiness Report 2019, by John F. Helliwell, Haifang Huang and Shun Wang)

I have read in the data from the excel sheet by selecting certain columns. There are 1704 rows in the excel sheet Table 2.1 and 156 rows in sheet Figure2.6. Table 2.1 contains the data on which the happiness scores in Figure 2.6 of the World Happiness Report for 2019 are calculated. When the Table2.1 data is filtered for 2018 data there are only 136 rows. The report does note that where values were not yet available at the time, values were interpolated based on previous years.

Here I am going to merge the two spreadsheets and just filter for the columns I need. I then need to cleanup the dataframe names throoughout this project.

To make the dataframe easier to work and also to merge the two dataframes I will rename the columns.

Adding Region information:

The World Happiness Report report does note somewhere that for the world as a whole, the distribution of happiness is normally distributed but when the global population is split into ten geographic regions, the resulting distributions vary greatly in both shape and average values.

Therefore I think it is important to look at the geographic regions when studying the properties of this dataset. The region data is not available in the data available from the World Happiness Report but it was included in the csv file for 2015 and 2016 on Kaggle in addition to the country. (For some reason the datasets for later years on Kaggle had a single Country or region variable that only contained the country). The following segment of code is used to add the geographic regions to the data files I am using for this project. First I had to rename the columns containing the country and then using pandas merge to join the dataframes based on the country names being equal to get the geographic regions into my dataset. This dataframe was then written to csv file.

For merging dataframes I referred to the pandas documentation and a blogpost by Chris Albon,pandas merge with a right join[16]. A Region variable has been added to the data in order to look at the distributions across regions.

I now have a dataframe merged containing the data from Table 2.1 in the World Happiness Report of 2019 with the geographic regions added as well as the 2018 Happiness Score from the Figure 2.6 data. However some countries were missing a region value as these countries were not included in the 2015 dataset. In this case I looked up the geographic region and added them to the dataframe to replace the NaN values.

Add the missing region values:

I added in a region code but this is no longer of use. I will leave it in for now but will remove it later. To create the region code I looked at the proportion of the overall number of countries that are in each region and then created a new column with the region number. The numbers assigned are not ordered as such - I just started with 1 for the region with the greatest number of countries. To do this I will add a new column for the RegionCode that has the Region name to start and then replace the string with the Region name in this new column with an integer from 1 to 10. (There is probably a better way of doing this - where did not work for me when there was 10 options to choose from). First I look at how the countries are split between geographic regions and then assign a number from 1 to 10 for the region with the highest number of countries although there is no order as such.

Write the datasets with the Regions added to csv files:

Read in the prepared datasets for analysis:

Note that Table 2.1 data includes some rows where there are some missing values as there were some countries added to the World Happiness Report in recent years for which the data was not available. Also some of the data in Table 2.1 was not available for 2018 at the time of the 2019 report being published. Some imputation was used or some interpolation from previous years values. Statistical Appendix 1 for Chapter 2[10] of the World Happiness Report for 2019 outlines how imputation is used for missing values when trying to decompose a country’s average ladder score into components explained by the 6 hypothesized underlying determinants (GDP per person, healthy life expectancy, social support, perceived freedom to make life choice, generosity and perception of corruption).

All the data I am using to figure out the distribution of variables have now been prepared

df6: This contains the data from Figure 2.6 of the World Happiness Report with region values added
df: This contains the data from Table 2.1 of the World Happiness Report with region values added
df18: This contains the data from Table 2.1 filtered for 2018.

Subset the datasets to work with variables of interest.

Here I first create a smaller dataframe from df18 called dfh containing the variables of interest. This is just to make the dataframes easier to work with and print. The initial work on this project was done looking only at 2018 data. However there are few missing values that were not available at the time. Therefore I create another dataframe containing only the relevent variables for each year from 2012 to 2018.

Below I subset the larger file (from Table 2.1 data in World Happiness Report 2019 which contains data for several years prior to 2018) to include only the main variables of interest for this project.

Missing values

The dataframe calleddf_years contains all the data from Table 2.1 of the 2019 World Happiness Report data for the years between 2011 and 2018. There are some missing values for some columns. I will write this to csv. To see only the rows containing missing data I use isnull().any(axis=1) as per stackoverflow[17]. There are not many missing values overall.