title: "Simulation" date: 2021-04-17T00:29:16Z draft: true author : Angela C year: "2021" month: "2021/04" categories:

  • NumPy
  • Simulation

tags:

  • jupyter

Programming for Data Analysis Project 2019.

In [ ]:
 

1. Introduction and Project Overview:

This notebook contains the body of the work for my submission for the Programming in Data Analysis Project 2019 as part of the Higher Diploma in Data Analytics at GMIT.

Objectives of the Project:

The problem statement from the Programming for Data Analysis Project 2019 instructions [1] is as follows:

For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose. Specifically, in this project you should:

  • Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
  • Investigate the types of variables involved, their likely distributions, and their relationships with each other.
  • Synthesise/simulate a data set as closely matching their properties as possible.
  • Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook. Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set.

The pdf document is included in this repository for reference.


About this notebook and python libraries used.

This project was mainly developed using the python [2] and the following packages:

  • Seaborn [3] is a Python data visualization library for making attractive and informative statistical graphics in Python.
  • Pandas [4] provides data analysis tools and is designed for working with tabular data that contains an ordered collection of columns where each column can have a different value type.
  • Numpy.random [5] is a subpackage of the NumPy package for working with random numbers. NumPy is one of the most important packages for numerical and scientific computing in Python.

The goal of the project

The end goal of this project is to simulate a real-world phenomenon across at least one hundred data points across at least 4 different variables. A dataset must be simulated or synthesised. The instructions note that it is ok to base the synthesised dataset on an actual real-world dataset but the main task is to create a synthesised data set.

1. Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.

The real-world phenomenon I have chosen is the World Happiness Score, in particular the main determinants of happiness at country levels across the world as reported in the World Happiness Report [6].

The variables on which the national and international happiness scores are calculated are very real and quantifiable. These include socio-economic indicators such as Gross Domestic Product (GDP), life expectancy as well as some life evaluation questions regarding freedom, perception of corruption, family or social support. Differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score according to the World Happiness Reports.

The aim of the World Happiness report is to see what countries or regions rank the highest in overall happiness and each of the six factors contributing to happiness. Over the years the reports looked at how country ranks and scores changed and whether any country experienced a significant increase or decrease in happiness.

The researchers studied how 6 different factors contribute to the happiness scores and the extent of each effect. These are economic production, social support, life expectancy, freedom, absence of corruption, and generosity. They looked at how these factors contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. While these factors have no impact on the total score reported for each country, they were analysed to explain why some countries rank higher than others. These factors describe the extent to which these factors contribute in evaluating the happiness in each country.

2. Investigate the types of variables involved, their likely distributions, and their relationships with each other.

First I will investigate the variables in the datasets used by the World Happiness Reports. I will study their distributions by looking at descriptive statistics and plots such as histograms and boxplots. I will explore relationships that may exist between the variables using visualisations such as scatterplot, pairplots etc and statistics such as correlation and covariance statistics.

With this information in mind, I will try to create a simulated dataset that is as close to the real world phenomenon as possible.

3. Synthesise/simulate a data set as closely matching their properties as possible.

Having studied the distributions of the real dataset by looking at statistics and plot I will use Python to simulate the data, focusing on using the numpy.random package as much as possible but using other Python libraries as may be required. I will look at how simulation is performed and what must be considered when simulating a dataset such as this one. I will look at how each of the variables are distributed and how they could be simulated. I will also consider the relationships between the variables. While it might be relatively straightforward to simulate a single variable, modelling the real-world correlations between the variables will be challenging. As there is much inequality in the world, this will be reflected in the distribution of the variables that model factors such as income and life expectancy. Therefore I will need to look at regional variations and how this would affect the simulation of data. The distributions are likely to vary between regions, particularly between the lesser developed countries and the countries of the more developed world.

4. Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook.

All the analysis will be documented in this notebook which means that the document is quite lengthy and might take some time to load. The first section of code involves reading in the real world dataset and getting it into a state where it is ready to be analysed. The data is available in excel and csv files which is left unchanged. As the files containing the 2018 did not include the geographic regions of the countries studied, I had to add these to the data by merging with an earlier dataset. Some other manipulation such as renaming columns, dropping unnecessary columns, adding region codes etc is documented below. The end result of this is written to a csv files to the data folder in this repository.


About data simulation.

The goad of this project is to simulate a dataset. Simulating data is used for a number of reasons. Monte carlo simulation are used to simulate real world problems using repeated random sampling while simulated data is very useful for learning and demonstration purposes. Data can be simulated before the real world data is collected to help identify the type of tests and programs that need to be run. Collecting data requires resources of time and money whereas data can be simulated easily using computer programs.

Statistical analysis can be performed on the simulated data in advance of collecting the real data this process can be repeated as many times as needed. By studying simulated data you can become more familiar with the different kinds of data distributions and be in a better position to make decisions about the data and what to do with it such as how to measure it and how much is required. Simulations produce multiple sample outcomes. Experiments can be run by modifying inputs and seeing how this changes the output. The process of generating a random sample can be repeated many many times which will allow you to see how often you would expect to get the outcomes you get. Repeating the process gives multiple outcomes which can then be averaged across all simulations.

When data is collected, it is often only a small sample of data from the overall population of interest. The researchers of the World Happiness Reports did not collect all the data about the variables of interest. The typical sample size used per country was 1000 people while some countries had more than one survey per year and others had less. A sample is a subset of numbers from a distribution and the bigger the sample size the more it resembles the distribution from which it is drawn. Depending on the distribution the data is drawn from, some numbers will occur more often than others. Sample statistics are descriptions of the data that can be calculated from the sample dataset and then be used to make inferrences about the population. The population parameters are of most interest. These are the characteristics of the actual population from which a sample dataset is taken. Samples are used to estimate the parameters of the population. The sample mean is the mean of the numbers in the sample while the population mean is the mean of the entire population but it is not always possible to study the entire population directly. The law of large numbers refers to how as a sample size increases, the sample mean gets closer to the true population mean. Under the law of large numbers the more data that is collected, the closer the sample statistics will get to the actual true population parameters.

The sampling distribution of the sample means is when you collect many samples from the population and calculate the sample means on each sample. If you know the type of distribution you could sample some data from this distribution, calculate the means or any other sample statistic of the samples and plot them using a histogram to show the distribution of the sample statistic. The sampling distributions can tell you what to expect from your data.

Simulation can be used to find out what the sample looks like if it comes from that particular distribution. This information can be used to make inferences about whether the sample came from particular distribution or not. The sampling distribution of a statistic varies as a function of sample size. Small sample taken from the distribution will probably have sample statistics such as sample means that vary quite a bit from sample to sample and therefore the sampling distribution will be quite wide. Larger samples are more likely to have similar statistics and a narrower sampling distribution.

As the size of the samples increases, the mean of the sampling distribution approaches the mean of the population. The sampling distribution is itself a distribution and has some variance. The standard deviation of the sampling distribution is known as the standard error. As the sample size increases, the standard error of the sample mean decreases. According to the central limit theorem, as the sample size increases the sampling distribution of the mean begins to look more like a normal distribution, no matter what the the shape of the population distribution is.

Large experiments are considered more reliable than smaller ones. If you take a big enough sample, the sample mean gives a very good estimate of the population mean.

When simulating a random variable, you first need to define the possible outcomes of the random variable. To do this you can use the sample statistics from the sample dataset. Using simulated data therefore allows you to identify coding errors as you know what the outcomes should be.

Resampling methods are another way of simulating data and involve resampling with replacements. Bootstrap resampling is the most common method.

For this section I referred to an online book called Answering Questions with Data (Textbook): Introductory Statistics for Psychology Students by Matthew J C Crump [6].


2. Investigate the types of variables involved, their likely distributions, and their relationships with each other

Having identified a real-world phenomenon to simulate, the next step is to investigate the types of variables involved, their likely distributions, and their relationships with each other. To do so I need data.

Available Data and research

The World Happines Report is produced by the United Nations Sustainable Development Solutions Network in partnership with the Ernesto Illy Foundation. The first World Happiness Report was published in 2012, the latest in 2019. The World Happiness Report is a landmark survey of the state of global happiness that ranks 156 countries by how happy their citizens perceive themselves to be. Each year the report has focused in on a different aspect of the report such as how the new science of happiness explains personal and national variations in happiness and how well-being is a critical component of how the world measures its economic and social development. Over the years it looked at changes in happiness levels in the countries studies and the underlying reasons, the measurement and consequences of inequality in the distribution of well-being among countries and regions. The 2017 report emphasized the importance of the social foundations of happiness while the 2018 report focused on migration. The latest World Happiness Report (2019) focused on happiness and the community and happiness has evolved over the past dozen years. It focused on the technologies, social norms, conflicts and government policies that have driven those changes.

Increasingly, happiness is considered to be the proper measure of social progress and the goal of public policy. Happiness indicators are being used by governments, organisations and civil society to help with decision making. Experts believe that measurements of well-being can be used to assess the progress of nations. The World Happiness reports review the state of happiness in the world and show how the new science of happiness explains personal and national variations in happiness.

The underlying source of the happiness scores in the World Happiness Report is the Gallup World Poll - a set of nationally representative undertaken in many countries across the world. The main life evaluation question asked in the poll is based on the Cantril ladder. Respondents are asked to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The rankings are from nationally representative samples, for the years 2016-2018. The overall happiness scores and ranks were calculated after a study of the underlying variables.

Happiness and life satisfaction are considered as central research areas in social sciences.

The variables on which the national and international happiness scores are calculated are very real and quantifiable. These include socio-economic indicators such as gdp, life expectancy as well as other life evaluation questions regarding freedom, perception of corruption, family or social support. Differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score according to the World Happiness Reports.

The variables used reflect what has been broadly found in the research literature to be important in explaining national-level differences in life evaluations. Some important variables, such as unemployment or inequality, do not appear because comparable international data are not yet available for the full sample of countries. The variables are intended to illustrate important lines of correlation rather than to reflect clean causal estimates, since some of the data are drawn from the same survey sources, some are correlated with each other (or with other important factors for which we do not have measures), and in several instances there are likely to be two-way relations between life evaluations and the chosen variables (for example, healthy people are overall happier, but as Chapter 4 in the World Happiness Report 2013 demonstrated, happier people are overall healthier).

The World Happiness Reports and data are available from the Worldhappiness website. The latest report is The World Happiness Report 2019[7]. The World Happiness Report is available for each year from 2012 to 2019 containing data for the prior year. For each year there is an excel file with several sheets including one sheet with annual data for different variables over a number of years and other sheets containing the data for the calculation of the World Happiness score for that year. Some of the data such as Log GDP per capita are forecast from the previous years where the data was not yet available at the time of the report. Kaggle also hosts part of the World Happiness datasets for the reports from 2015 to 2019.

The full datasets are available online in excel format by following a link under the Downloads section on the World Happiness Report[8] website to https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls.

Kaggle Datasets[9] also has some datasets in csv format.

I have downloaded both the latest csv and excel files to the /data folder in this repository. The data from the excel file is the most comprehensive as it includes data for several years from 2008 to 2019. There are two main excel sheets in the 2019 data. The sheet named Figure2.6 contains the main variables used as predictors of the happiness score in the 2019 World Happiness Report. The data is contained in columns A to K while the remaining columns contain actual tables and figures used in the World Happiness annual report. A second sheet called Table2.1 contains all the available data from 2008 up to 2019 and provides more detail.

The happiness scores and the world happiness ranks are recorded in Figure 2.6 of the 2019 report with a breakdown of how much each individual factor impacts or explains the happiness of each country studied rather than actual measurements of the variables. The actual variables themselves are in Table 2.1 of the World Happiness Report. There are some other variables included in the report which have a smaller effect on the happiness scores.

In this project I will focus on the main determinants of the Happiness scores as reported in the World Happiness reports. These are income, life expectancy, social support, freedom, generosity and corruption. The happiness scores and the world happiness ranks are recorded in Figure 2.6 of the 2019 report with a breakdown of how much each individual factor impacts or explains the happiness of each country studied rather than actual measurements of the variables. The columns in df6 dataframe correspond to the columns in Figure 2.6 data from the World Happiness Report of 2019. The values in the columns describe the extent to which the 6 factors contribute in evaluating the happiness in each country.

The actual values of these variables are in Table 2.1 data of the World Happiness Report. There are some other variables included in the report which have a smaller effect on the happiness scores.

  • Life Ladder
  • Log GDP per capita / Income
  • Social Support / Family
  • Healthy Life Expectancy at birth
  • Freedom to make life choices
  • Generosity
  • Perceptions of Corruption
  • Life Ladder The variable named Life Ladder is a Happiness score or subjective well-being from the Gallup World Poll It is the national average response to the question of life evaluations. The English wording of the question is “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?” This measure is also referred to as Cantril life ladder. 10 The values in the dataset are real numbers representing national averages. They could vary between 0 and 10 but in reality the range is much smaller in between these numbers.

  • Log GDP per capita:

GDP per capita is a measure of a country's economic output that accounts for its number of people. It divides the country's gross domestic product by its total population. That makes it a good measurement of a country's standard of living. It tells you how prosperous a country feels to each of its citizens.11

Gross Domestic Product per capita is an approximation of the value of goods produced per person in the country, equal to the country's GDP divided by the total number of people in the country. It is usually expressed in local currency or a standard unit of currency in international markets such as the US dollar. GDP per capita is an important indicator of economic performance and can be used for cross-country comparisons of average living standards. To compare GDP per capita between countries, purchasing power parity (PPP) is used to create parity between different economies by comparing the cost of a basket of similar goods.

GDP per capita can be used to compare the prosperity of countries with different population sizes.

There are very large differences in income per capita across the world. As average income increases over time the distribution of gdp per capita gets wider. Therefore the log of income per capita is taken when the growth is approximately proportional. When $x (t)$ grows at a proportional rate, $log x (t)$ grows linearly. [12]Introduction to Economic Growth Lecture MIT.

Per capita GDP is a unimodal but skewed distribution. The log of GDP per capita = log (Total GDP per capita/ population) is a more symmetrical distribution. [13]Sustainable Development Econometrics Lecture

The natural log is often used in economics as it can make it easier to see the trends in the data and the log of the values can fit a normal distribution.

  • Healthy Life Expectancy at Birth.

Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository.

Healthy life expectancy (HALE) is a form of health expectancy that applies disability weights to health states to compute the equivalent number of years of good health that a newborn can expect. The indicator Healthy Life Years (HLY) at birth measures the number of years that a person at birth is still expected to live in a healthy condition. HLY is a health expectancy indicator which combines information on mortality and morbidity. [14]Eurostat.

Overall, global HALE at birth in 2015 for males and females combined was 63.1 years, 8.3 years lower than total life expectancy at birth. In other words, poor health resulted in a loss of nearly 8 years of healthy life, on average globally. Global HALE at birth for females was only 3 years greater than that for males. In comparison, female life expectancy at birth was almost 5 years higher than that for males. HALE at birth ranged from a low of 51.1 years for African males to 70.5 years for females in the WHO European Region. The equivalent “lost” healthy years (LHE = total life expectancy minus HALE) ranged from 13% of total life expectancy at birth in the WHO African Region to 10% the WHO Western Pacific Region. [15]www.who.int.

  • Social Support is the national average of the binary responses (either 0 or 1) to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”. 10

Load Python Libraries:

Here I load the python libraries and set up some formatting for the notebook.

In [1]:
# import python libraries using common alias names
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# display matplotlib plots inline
%matplotlib inline

# check what version of packages are installed.
print("NumPy version",np.__version__, "pandas version ",pd.__version__, "seaborn version",sns.__version__  )  # '1.16.2'

# set print options with floating point precision if 4, summarise long arrays using threshold of 5, suppress small results
np.set_printoptions(precision=4, threshold=5, suppress=True)  # set floating point precision to 4
 # set options to display max number of rows
pd.options.display.max_rows=8 
# I did change the max number of rows to display when developing the project to see more rows as necessary.
NumPy version 1.18.5 pandas version  1.2.3 seaborn version 0.11.0
In [2]:
import warnings
warnings.filterwarnings('ignore')
In [3]:
#!ls data # see what files are in the data folder of this repository

Prepare the dataset for analysis

Here I load some real-world dataset for the World Happiness Report into python. The data is available in the data folder of this repository which is what I use for this project. The dataset can also be read in directly from the URL. Data for other years is also available by following the links from the World Happiness Report website.

# read the data directly from the url or alternatively from the data folder in this repository
url="https://s3.amazonaws.com/happiness-report/2019/Chapter2OnlineData.xls"
# The entire data from Table2.1 sheet
WH = pd.read_excel(url, sheet_name='Table2.1')
# The data from the sheet Figure2.6, columns A to K
whr18 = pd.read_excel(url,sheet_name='Figure2.6', usecols="A:K")
In [4]:
# import the data from the data folder in this repository
Table2_1 = pd.read_excel('data/Chapter2OnlineData2019.xls', sheet_name='Table2.1', usecols="A:I")
# The data from the sheet Figure2.6, selected columns between A and K
Fig2_6 = pd.read_excel('data/Chapter2OnlineData2019.xls',sheet_name='Figure2.6', usecols="A:B, F:K")
# read in only selected columns from Figure2.6
Fig2_6_x = pd.read_excel('data/Chapter2OnlineData2019.xls',sheet_name='Figure2.6', usecols="A:B")
# the 2019 data, same values as Kaggle data except not including the rank but including the whiskers or intervals
print("The shape of the data from Table2.1 is: \n",Table2_1.shape)
print("The shape of the data from Figure 2.6 is: \n",Fig2_6.shape)
The shape of the data from Table2.1 is: 
 (1704, 9)
The shape of the data from Figure 2.6 is: 
 (156, 8)

I have read in the data from the excel sheet by selecting certain columns. There are 1704 rows in the excel sheet Table 2.1 and 156 rows in sheet Figure2.6. Table 2.1 contains the data on which the happiness scores in Figure 2.6 of the World Happiness Report for 2019 are calculated. When the Table2.1 data is filtered for 2018 data there are only 136 rows. The report does note that where values were not yet available at the time, values were interpolated based on previous years.

Here I am going to merge the two spreadsheets and just filter for the columns I need. I then need to cleanup the dataframe names throoughout this project.

In [5]:
#Table2_1.columns
In [6]:
# look at columnes for Figure 2.6 data
#Fig2_6.columns

To make the dataframe easier to work and also to merge the two dataframes I will rename the columns.

In [7]:
# only contains the country and happiness score for 2018
Fig2_6_x.columns
Out[7]:
Index(['Country', 'Happiness score'], dtype='object')
In [8]:
# rename the columns
Table2_1.rename(columns={'Country name':'Country', 'Life Ladder':'Life_Satisfaction','Log GDP per capita':'Log_GDP_per_cap',
                         'Social support':'Social_Support', 'Healthy life expectancy at birth':'Life_Expectancy','Freedom to make life choices':'Freedom',
                        'Perceptions of corruption':'Corruption_perceptions'}, inplace=True)
# look at the columns
#Table2_1.columns
In [9]:
# merge the two dataframes that came from the two spreadsheets on the Country variable.
merged= pd.merge(Fig2_6_x,Table2_1, on="Country",how="right")
# as the figure 2.6 data is for 2018 only and I am only interested in the Happiness Score I will chnage the column name and drop all the other columsn from Figure 2.6
merged =merged.rename(columns={'Happiness score':'Happiness_Score_2018'})

Adding Region information:

The World Happiness Report report does note somewhere that for the world as a whole, the distribution of happiness is normally distributed but when the global population is split into ten geographic regions, the resulting distributions vary greatly in both shape and average values.

Therefore I think it is important to look at the geographic regions when studying the properties of this dataset. The region data is not available in the data available from the World Happiness Report but it was included in the csv file for 2015 and 2016 on Kaggle in addition to the country. (For some reason the datasets for later years on Kaggle had a single Country or region variable that only contained the country). The following segment of code is used to add the geographic regions to the data files I am using for this project. First I had to rename the columns containing the country and then using pandas merge to join the dataframes based on the country names being equal to get the geographic regions into my dataset. This dataframe was then written to csv file.

For merging dataframes I referred to the pandas documentation and a blogpost by Chris Albon,pandas merge with a right join[16]. A Region variable has been added to the data in order to look at the distributions across regions.

In [10]:
# read in the kaggle dataset for 2015 as this contains the Regions as well as the country names.
k15 = pd.read_csv("data/2015.csv")
# extract the country and regions from the 2015 file:
Regions = k15.loc[:,['Country','Region']]
#Regions.shape # 158 rows
# see how many unique regions and countries
Regions.describe(include="object")
#merge the dataframes on country variable
merged=pd.merge(Regions, merged,on='Country',how='right')
#merged.describe()
# There should now be two non-numeric variables in the dataset. Country name and Region
#merged.describe(include="object")
# a peak at the dataset
print("\t DataFrame merged")
merged.head(3)
	 DataFrame merged
Out[10]:
Country Region Happiness_Score_2018 Year Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy Freedom Generosity Corruption_perceptions
0 Afghanistan Southern Asia 3.2033 2008 3.723590 7.168690 0.450662 50.799999 0.718114 0.177889 0.881686
1 Afghanistan Southern Asia 3.2033 2009 4.401778 7.333790 0.552308 51.200001 0.678896 0.200178 0.850035
2 Afghanistan Southern Asia 3.2033 2010 4.758381 7.386629 0.539075 51.599998 0.600127 0.134353 0.706766

I now have a dataframe merged containing the data from Table 2.1 in the World Happiness Report of 2019 with the geographic regions added as well as the 2018 Happiness Score from the Figure 2.6 data. However some countries were missing a region value as these countries were not included in the 2015 dataset. In this case I looked up the geographic region and added them to the dataframe to replace the NaN values.

In [11]:
# write to csv
merged.to_csv("data/merged.csv")
In [12]:
# create a dataframe called df from the merged dataframe
df=merged

Add the missing region values:

In [13]:
# find how many rows have missing values for Region
df['Region'].isna().sum()
# print the rows with missing Region valuesBelize
df.loc[df.loc[:,'Region'].isna()]
# update the value of Region for the following countries
df.loc[df['Country']=="Taiwan Province of China",['Region']]="Southeastern Asia"
df.loc[df['Country']=="Namibia",['Region']]="Sub-Saharan Africa"
df.loc[df['Country']=="Somalia",['Region']]="Sub-Saharan Africa"
df.loc[df['Country']=="Hong Kong S.A.R. of China",['Region']]="Southeastern Asia"
df.loc[df['Country']=="South Sudan",['Region']]="Sub-Saharan Africa"
df.loc[df['Country']=="Gambia",['Region']]="Sub-Saharan Africa"
df.loc[df['Country']=="Belize",['Region']]="Latin America and Caribbean"
df.loc[df['Country']=="Cuba",['Region']]="Latin America and Caribbean"
df.loc[df['Country']=="Guyana",['Region']]="Latin America and Caribbean"
# checking to make sure all regions have values now
df['Region'].isna().sum()
Out[13]:
0

I added in a region code but this is no longer of use. I will leave it in for now but will remove it later. To create the region code I looked at the proportion of the overall number of countries that are in each region and then created a new column with the region number. The numbers assigned are not ordered as such - I just started with 1 for the region with the greatest number of countries. To do this I will add a new column for the RegionCode that has the Region name to start and then replace the string with the Region name in this new column with an integer from 1 to 10. (There is probably a better way of doing this - where did not work for me when there was 10 options to choose from). First I look at how the countries are split between geographic regions and then assign a number from 1 to 10 for the region with the highest number of countries although there is no order as such.

In [14]:
# just looking again at the proportion of countries over the 10 regions
df.Region.value_counts()/len(df.Region)
Out[14]:
Sub-Saharan Africa             0.221244
Central and Eastern Europe     0.200117
Latin America and Caribbean    0.151995
Western Europe                 0.135563
                                 ...   
Southern Asia                  0.045775
Eastern Asia                   0.029343
North America                  0.015258
Australia and New Zealand      0.014085
Name: Region, Length: 10, dtype: float64
In [15]:
# add a new column called RegionCode with the Region 
df['RegionCode']=df['Region']
df.head()
# replace the new regionCode with a number for each region as follows:
df['RegionCode']=df["RegionCode"].replace("Sub-Saharan Africa",1)
df['RegionCode']=df["RegionCode"].replace("Central and Eastern Europe",2)
df['RegionCode']=df["RegionCode"].replace("Western Europe",3)
df['RegionCode']=df["RegionCode"].replace("Latin America and Caribbean",4)
df['RegionCode']=df["RegionCode"].replace("Middle East and Northern Africa",5)
df['RegionCode']=df["RegionCode"].replace("Southeastern Asia",6)
df['RegionCode']=df["RegionCode"].replace("Southern Asia",7)
df['RegionCode']=df["RegionCode"].replace("Eastern Asia",8)
df['RegionCode']=df["RegionCode"].replace("Australia and New Zealand",9)
df['RegionCode']=df["RegionCode"].replace("North America",10)

# convert to integer - might use this later for correlations.
#df["RegionCode"] = pd.to_numeric(dfh["RegionCode"])

Write the datasets with the Regions added to csv files:

In [16]:
# write the dataframes to a csv files:
df.to_csv("data/Table2_1.csv")

Read in the prepared datasets for analysis:

Note that Table 2.1 data includes some rows where there are some missing values as there were some countries added to the World Happiness Report in recent years for which the data was not available. Also some of the data in Table 2.1 was not available for 2018 at the time of the 2019 report being published. Some imputation was used or some interpolation from previous years values. Statistical Appendix 1 for Chapter 2[10] of the World Happiness Report for 2019 outlines how imputation is used for missing values when trying to decompose a country's average ladder score into components explained by the 6 hypothesized underlying determinants (GDP per person, healthy life expectancy, social support, perceived freedom to make life choice, generosity and perception of corruption).

All the data I am using to figure out the distribution of variables have now been prepared

  • df6: This contains the data from Figure 2.6 of the World Happiness Report with region values added
  • df: This contains the data from Table 2.1 of the World Happiness Report with region values added
  • df18: This contains the data from Table 2.1 filtered for 2018.
In [17]:
# read in the Table 2.1 data back in, set the index_col to be the first column
df = pd.read_csv("data/Table2_1.csv", index_col=0)
# look at top and bottom rows to see it all looks ok
df.tail(2)
df.head(2)
# Create a dataframe with 2018 data from the Table 2.1 of the World Happiness Report 2019:
df18=df.loc[df.loc[:,'Year']==2018]
print("\n \t The dataframe df18 containing the Table 2.1 data from the 2019 World Happiness Report.\n")
df18.head(2)
 	 The dataframe df18 containing the Table 2.1 data from the 2019 World Happiness Report.

Out[17]:
Country Region Happiness_Score_2018 Year Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy Freedom Generosity Corruption_perceptions RegionCode
10 Afghanistan Southern Asia 3.2033 2018 2.694303 7.494588 0.507516 52.599998 0.373536 -0.084888 0.927606 7
21 Albania Central and Eastern Europe 4.7186 2018 5.004403 9.412399 0.683592 68.699997 0.824212 0.005385 0.899129 2
In [18]:
# see the dimensions of the dataframes
print(df.shape,df18.shape) #  
(1704, 12) (136, 12)

Subset the datasets to work with variables of interest.

Here I first create a smaller dataframe from df18 called dfh containing the variables of interest. This is just to make the dataframes easier to work with and print. The initial work on this project was done looking only at 2018 data. However there are few missing values that were not available at the time. Therefore I create another dataframe containing only the relevent variables for each year from 2012 to 2018.

Below I subset the larger file (from Table 2.1 data in World Happiness Report 2019 which contains data for several years prior to 2018) to include only the main variables of interest for this project.

In [19]:
# drop year column as this only contains 2018 anyway.
## create a data frame dfh that contains only a subset of the columns from df
df18 = df18.loc[:,['Country','Region','Life_Satisfaction','Log_GDP_per_cap','Social_Support','Life_Expectancy', 'RegionCode']]
df18.head()
Out[19]:
Country Region Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy RegionCode
10 Afghanistan Southern Asia 2.694303 7.494588 0.507516 52.599998 7
21 Albania Central and Eastern Europe 5.004403 9.412399 0.683592 68.699997 2
28 Algeria Middle East and Northern Africa 5.043086 9.557952 0.798651 65.900002 5
45 Argentina Latin America and Caribbean 5.792797 9.809972 0.899912 68.800003 4
58 Armenia Central and Eastern Europe 5.062449 9.119424 0.814449 66.900002 2
In [20]:
# using .loc to create a dataframe containing only the required columns for this project
dfx= df.loc[:,['Year','Country','Region','Life_Satisfaction','Log_GDP_per_cap','Social_Support','Life_Expectancy', 'RegionCode']]
# using .loc to subset only the rows contains years from 2011 to 2018.
df_years = dfx.loc[dfx.loc[:,'Year'].isin([2018,2017,2016,2015,2014,2013,2012,2011])]
# look at the columns
df_years.columns
print("\t DataFrame called df_years \n")
df_years.head()
	 DataFrame called df_years 

Out[20]:
Year Country Region Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy RegionCode
3 2011 Afghanistan Southern Asia 3.831719 7.415019 0.521104 51.919998 7
4 2012 Afghanistan Southern Asia 3.782938 7.517126 0.520637 52.240002 7
5 2013 Afghanistan Southern Asia 3.572100 7.522238 0.483552 52.560001 7
6 2014 Afghanistan Southern Asia 3.130896 7.516955 0.525568 52.880001 7
7 2015 Afghanistan Southern Asia 3.982855 7.500539 0.528597 53.200001 7

Missing values

The dataframe calleddf_years contains all the data from Table 2.1 of the 2019 World Happiness Report data for the years between 2011 and 2018. There are some missing values for some columns. I will write this to csv. To see only the rows containing missing data I use isnull().any(axis=1) as per stackoverflow[17]. There are not many missing values overall.

In [21]:
# to see how many observations for each year
#df_years.groupby('Year').count()
In [22]:
# write the csv file to csv 
df_years.to_csv("data/WHR_2012_2018.csv")
# read back in the csv file.
df_years= pd.read_csv("data/WHR_2012_2018.csv", index_col=0)
df_years.head()
Out[22]:
Year Country Region Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy RegionCode
3 2011 Afghanistan Southern Asia 3.831719 7.415019 0.521104 51.919998 7
4 2012 Afghanistan Southern Asia 3.782938 7.517126 0.520637 52.240002 7
5 2013 Afghanistan Southern Asia 3.572100 7.522238 0.483552 52.560001 7
6 2014 Afghanistan Southern Asia 3.130896 7.516955 0.525568 52.880001 7
7 2015 Afghanistan Southern Asia 3.982855 7.500539 0.528597 53.200001 7
In [23]:
# look at only rows with missing values: https://stackoverflow.com/a/30447205
# any(axis=1) to check for any missing values per row, then filter with boolean mask
nulldata=df_years[df_years.isnull().any(axis=1)]
# see how rows with missing values
print(f"There are {nulldata.shape[0]} rows with missing values out of a total of {df_years.shape[0]} rows for data from 2011 to 2018.")
There are 44 rows with missing values out of a total of 1138 rows for data from 2011 to 2018.
In [ ]:
 

</id>

The distribution of the data: What does the real data look like?

In order to be able to simulate data I need to know more about the data and what it looks like. I will go through each of the variables in a sub-section of their own but I will first look at the summary statistics and some visualisations of the distribution of the most important variables in the datasets. I will look a the similarities in the data and the differences in the data.

The distributions can be plotted to summarise the data visually using histograms and kernel density estimates plots. The histogram plots the distribution of the frequency counts across the bins. Distributions can have very different shapes and describe the data. Histograms can show how some of the numbers group together, the location and shape of the data. The height of the bars on a histogram indicate how much data there is, the minimum and maximum values show the range of the data. The width of the bars can be controlled by changing the bin sizes. Kernel Density Estimation[18] is a non-parametric way to estimate the probability density function of a random variable. Inferences about the population are made, based on a finite data sample.

The central tendency measures shows how common certain numbers are, how similar data points are to each other and where the most data tends to be located while the variance shows the spread of the data and how different the data points are.

Central Tendency measures:

  • The most frequently occurring number in the dataset is the mode.
  • The median is the middle number (s) in the data when they are ordered from smallest to largest.
  • The mean is the average of the data.
  • The mean can be influenced by very large or small numbers while the mode and median are not sensitive to larger numbers that do not occur very often. The mean is the balancing point of the data - the location in the data where the numbers on one side sum to the same amount as the numbers on the other side.

Variance measures:

  • The range is the width of the variation in the data, between the minimum and maximum or boundaries of the data.
  • The variance is the mean of the sum of the squared deviations of the data where the deviations is how far each values is from the mean.
  • The standard deviation is the square root of the variance and is in the same size as the data itself.

Correlation of the data variables.

  • Measures such as the covariance and correlation can show how the data variables might be related to each.
  • Scatterplots can be used to see how two variables might be related to each other and the strength and directions of any such relationships that exist.

Correlation is not the same as causation while lack of an obvious correlation does not mean there is no causation. Correlation between two variables could be due to a confounding or third variable that is not directly measured. Correlations can also be caused by random chance - these are called spurious correlations. These are all things to consider when looking at data and when attempting to simulate data.

Visualising the Distribution of the dataset to see the distributions and relationships between variables.

Pair Plots of the main variables in the dataset:

A seaborn pairplot can show if there are any obvious relationships between variables. The univariate distribution of each variable are shown on the diagonal. The pairplot below is for the data for 2018. The scatterplots in the pairplot shows a positive linear relationship between each of the variables Income (Log GDP per capita), Healthy Life Expectancy, Social support and satisfactions with life (Life Ladder). While the life ladder distribution looks more normally distributed than the other variables it has 2 distinct peaks. The other distributions are left skewed. The 2016 World Happiness Report[10] notes how for the world as a whole, the distribution of world happiness is very normally distributed about the median answer of 5, with the population-weighted mean being 5.4 but when the global population is split into ten geographic regions, the resulting distributions vary greatly in both shape and average values. Only two regions—the Middle East and North Africa, and Latin America and the Caribbean— have more unequally distributed happiness than does the world as a whole.

Therefore taking this into account it is important to look at the data on a regional basis. The second pairplot uses the hue semantic to colour the points by region and shows the distributions for each variables for each region on the diagonal. The distribution plots show very different shapes for each variable when broken down by regions. Sub-Saharan Africa stands out as a region that has far different levels of life satisfaction, income and life expectancy than most other regions. Regions such as Western Europe, North America and Australia and New Zealand have distributions that are centred around much higher values than regions such as Sub-Saharan Africa and South Asia. The distributions for North America, Australia and New Zealand are very narrow as these regions contain a very small number of countries compared to the other regions.

On a regional level the distributions look more normally distributed for all the four variables but with different locations and spreads. This means that any simulation of the variables must take the separate distributions into account.

In [24]:
# pairplot of the variables, drop missing values
print("\nDistribution of Life satisfaction, Income, Social Support and Healthy Life Expectancy globally  and then by region\n")
sns.pairplot(df18.iloc[:,:6].dropna());

# could also plot the pairplot for all years from 2011 to 2018.
#sns.pairplot(df_years.iloc[:,1:7].dropna());

# pairplots showing the Region as the colour of the dots.
sns.pairplot(df18.iloc[:,:6].dropna(), hue="Region", palette="bright");
Distribution of Life satisfaction, Income, Social Support and Healthy Life Expectancy globally  and then by region

The pairplots show a distinction between the different geographic regions for most variables. Sub-Saharan Africa stands out as a region that has far different levels of life satisfaction, income and life expectancy than most other regions. The histograms for each variable are plotted again individually to see the distributions more clearly for each variable.

Distribution of each variable.

A closer look at each of the variables in the histograms below.

In [25]:
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.histplot(df18['Life_Satisfaction'].dropna(), ax=axes[0,0], bins=10, color="r");
# set axes title
axes[0,0].set_title("Distribution of Life Ladder globally 2018");
sns.histplot(df18['Log_GDP_per_cap'].dropna(), ax=axes[0,1], bins=10, color="g");
axes[0,1].set_title("Distribution of Income globally 2018");
sns.histplot(df18['Social_Support'].dropna(), ax=axes[1,0], bins=10, color="b");
axes[1,0].set_title("Distribution of Social support globally 2018");
sns.histplot(df18['Life_Expectancy'].dropna(), ax=axes[1,1], bins=10, color="y");
axes[1,1].set_title("Distribution of Life expectancy globally 2018");  
plt.tight_layout();

The distribution of Life Ladder variable looks to be normally distributed whiile there is some left skew in the other variables. Normally distributed data is considered the easiest to work with as normal distributions can be compared by looking at their means and standard deviations. Many statistical methods assume variables are normally distributed and others work better with normality. -Sustainable Development lectures[13]


Central Tendency and variance statistics of the variables.

Here are the summary statistics for the variables for 2018 and also over the period from 2012 to 2018 inclusive. The 2018 statistics are similar enough to the amalgamated 2012 to 2018 statistics.

In [26]:
# summary statistics of the dataset (2018)- just showing the main variables of interest
print("The data available for 2018 only: \n")
df18.iloc[:,0:6].describe()
The data available for 2018 only: 

Out[26]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
count 136.000000 127.000000 136.000000 132.000000
mean 5.502134 9.250394 0.810544 64.670832
std 1.103461 1.186589 0.116332 6.728247
min 2.694303 6.541033 0.484715 48.200001
25% 4.721326 8.346278 0.739719 59.074999
50% 5.468088 9.415703 0.836641 66.350002
75% 6.277691 10.166517 0.905608 69.075001
max 7.858107 11.453928 0.984489 76.800003
In [27]:
# summary statistics of data from 2012 to 2018
print("Summary statistics for all years from 2012 to 2018 based on Table 2.1 data. \n")
df_years.iloc[:,3:7].describe()
Summary statistics for all years from 2012 to 2018 based on Table 2.1 data. 

Out[27]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
count 1138.000000 1112.000000 1132.000000 1112.000000
mean 5.426942 9.259777 0.806205 63.625778
std 1.127066 1.180980 0.119364 7.228575
min 2.661718 6.465948 0.290184 36.860001
25% 4.596455 8.370898 0.741112 58.375001
50% 5.358060 9.448528 0.828651 65.480000
75% 6.267663 10.197885 0.903454 68.500000
max 7.858107 11.770276 0.987343 76.800003

Boxplots to show the central tendancy, symmetry, skew and outliers.

Boxplots can show the central tendency, symmetry and skew of the data as well as any outliers. The rectangular boxes are bounded by the hinges representing the lower (1st) quartile and upper (3rd quartile). The median of 50th percentile is shown by the line through the box. The whiskers show the minimum and maximum values of the data excluding any outliers. A distribution is symmetric if the median is in the centre of the box and the whiskers are the same length.
While none of the variables look symmetric, the distribution of Life ladder is less asymmetric than the other variables. A skewed distribution has the median closer to the shorter whisker as is the case for Healthy Life Expectancy, Social support and Log GDP per capita. A positive/right skewed distribution has a longer top whisker than bottom while a negatively skewed / left skewed distribution has a longer lower whisker as is the case for Log GDP per capita, Social support and Healthy life expectancy.

In [28]:
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(1,4, figsize=(12,3))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.boxplot(x=df18['Life_Satisfaction'], ax=axes[0], color="r");
sns.boxplot(x=df18['Log_GDP_per_cap'], ax=axes[1], color="g");
sns.boxplot(x=df18['Social_Support'], ax=axes[2],color="b");
sns.boxplot(x=df18['Life_Expectancy'], ax=axes[3],color="y");
plt.tight_layout();

Distribution of the data by geographic region.

As the pairplot showed earlier, there are wide variations between regions and for this region I will look at the distributions of each variable on a geographic regional basis. Below are the mean and standard deviation statistics showing how the central tendency and spread of the distributions vary across regions. The mean and standard deviations at a global level are then shown. There is quite a difference in the statistics on a regional basis which is lost in the overall statistics.

Mean and Standard Deviation by Region:

In [29]:
# look at 2018 statistics by Region, round for printing
# just look at specific columns
print("The mean and standard deviation for the variables on a regional basis. \n")
df18.iloc[:,:6].groupby(['Region']).agg([np.mean, np.std]).round(4)
The mean and standard deviation for the variables on a regional basis. 

Out[29]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
mean std mean std mean std mean std
Region
Australia and New Zealand 7.2736 0.1367 10.6112 0.1552 0.9470 0.0097 73.4000 0.2828
Central and Eastern Europe 5.6472 0.6164 9.6812 0.6087 0.8722 0.0817 66.7673 2.2000
Eastern Asia 5.5575 0.3296 10.0508 0.5844 0.8533 0.0737 70.0500 5.7076
Latin America and Caribbean 5.9530 0.7548 9.2870 0.6722 0.8475 0.0820 66.9333 3.3755
... ... ... ... ... ... ... ... ...
Southeastern Asia 5.5089 0.6609 9.1068 0.6375 0.8222 0.0609 64.6889 5.7106
Southern Asia 4.2989 0.9597 8.3931 0.6878 0.6888 0.1109 61.0333 5.1613
Sub-Saharan Africa 4.5200 0.6775 7.9125 0.8863 0.6918 0.0986 55.8273 3.7124
Western Europe 6.8981 0.6808 10.7237 0.3230 0.9118 0.0451 72.8105 0.7593

10 rows × 8 columns

In [30]:
# mean and standard deviation at the global level in the dataset for 2018
print("The mean and standard deviation at a global level for 2018: \n")
df18.iloc[:,1:6].agg([np.mean, np.std])
The mean and standard deviation at a global level for 2018: 

Out[30]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
mean 5.502134 9.250394 0.810544 64.670832
std 1.103461 1.186589 0.116332 6.728247
In [31]:
# mean and standard deviation at the global level in the dataset for 2012 to 2018
print("The mean and standard deviation at a global level for years from 2012 to 2018 \n")
df_years.iloc[:,1:7].agg([np.mean,np.std])
The mean and standard deviation at a global level for years from 2012 to 2018 

Out[31]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
mean 5.426942 9.259777 0.806205 63.625778
std 1.127066 1.180980 0.119364 7.228575

The statistics for 2018 and for all years from 2012 to 2018 are very similar. I am focusing on 2018 but if I need extra data I can use the dataset with all years from 2012 to 2018.

In [32]:
df18.groupby(['Region']).describe().round(4)
Out[32]:
Life_Satisfaction Log_GDP_per_cap ... Life_Expectancy RegionCode
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Region
Australia and New Zealand 2.0 7.2736 0.1367 7.1770 7.2253 7.2736 7.3220 7.3703 2.0 10.6112 ... 73.500 73.6 2.0 9.0 0.0 9.0 9.0 9.0 9.0 9.0
Central and Eastern Europe 26.0 5.6472 0.6164 4.6206 5.1844 5.6662 6.1360 7.0342 25.0 9.6812 ... 68.425 71.1 26.0 2.0 0.0 2.0 2.0 2.0 2.0 2.0
Eastern Asia 4.0 5.5575 0.3296 5.1314 5.3813 5.6291 5.8052 5.8402 4.0 10.0508 ... 73.950 75.0 4.0 8.0 0.0 8.0 8.0 8.0 8.0 8.0
Latin America and Caribbean 18.0 5.9530 0.7548 3.6149 5.7993 6.0558 6.3491 7.1411 18.0 9.2870 ... 68.725 71.3 18.0 4.0 0.0 4.0 4.0 4.0 4.0 4.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Southeastern Asia 10.0 5.5089 0.6609 4.4106 5.1653 5.3396 5.9760 6.4670 8.0 9.1068 ... 67.200 76.8 10.0 6.0 0.0 6.0 6.0 6.0 6.0 6.0
Southern Asia 6.0 4.2989 0.9597 2.6943 3.9636 4.4497 4.8074 5.4716 6.0 8.3931 ... 64.100 67.2 6.0 7.0 0.0 7.0 7.0 7.0 7.0 7.0
Sub-Saharan Africa 34.0 4.5200 0.6775 3.3346 4.0488 4.4510 4.9260 5.8817 34.0 7.9125 ... 57.900 66.4 34.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
Western Europe 20.0 6.8981 0.6808 5.4093 6.5157 7.0403 7.4081 7.8581 17.0 10.7237 ... 73.400 74.4 20.0 3.0 0.0 3.0 3.0 3.0 3.0 3.0

10 rows × 40 columns

In [33]:
print("Central Tendency, Skew and outliers of  Life Satisfaction, Income, Social Support and Life Expectancy by region \n ")
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(12,8))
sns.boxplot(y="Region", x="Life_Satisfaction", data=df18, ax=axes[0,0]);
sns.boxplot(y="Region", x="Log_GDP_per_cap", data=df18, ax=axes[0,1]);
sns.boxplot(y="Region", x="Social_Support", data=df18, ax=axes[1,0]);
sns.boxplot(y="Region", x="Life_Expectancy", data=df18, ax=axes[1,1]);
plt.tight_layout();
Central Tendency, Skew and outliers of  Life Satisfaction, Income, Social Support and Life Expectancy by region 
 

When the distributions of each of the variables representing life satisfaction, gdp per capita, social support and life expectancy at birth are broken down by regional level, they tell a very different story to the distributions of the variables at a global level. There are some regions that overlap with each other and there are regions at complete opposite ends of the spectrum. I think this further clarifies the need to look at the variables on a regional level when simulating the data.

For example the boxplots show that the median values of Log GPD per capita fall into roughly 3 groups (similar to the boxplots for social support) by regions with Western Europe, North America and Australia and New Zealand having the highest scores, while Southern Asia and Sub Saharan Africa have the lowest median scores and are the most variable along with the Middle East and Nortern Africa region. There is no overlap at all between the measurements for the richer regions (such as Western Europe, Australia and North America) and the poorer regions (Sub_saharan Africa and Southern Asia). The above boxplots by geographic regions show that there is great variations between regions in the distribution of GDP per capita. The boxplots also show some outliers where there are a few countries from each geographic regions which are more similar to countries in another geographic region than in their own region.


Clustering to see groups in the countries.

Colouring the points by regions in the plots above showed that some regions had very similar distributions to other regions and very disimilar to other regions. For instance Southern Asia and Sub_Saharan Africa are always close to each other and their distribution barely overlap if at all with regions such as Western Europe, North America and Austrlia and New Zealand. Therefore I am having a quick look at using Kmeans clustering to see how it would cluster the points rather than simulating data for the 10 regions as it is a small enough dataset.

Here I refer to a pythonprogramming blogpost on flat clustering[19]. (I just looked briefly at clustering to see if the countries could be separated into clear groups of regions and did not do a thorough clustering analysis. I chose 3 as the number of clusters to use based on the plots above but a full clustering analysis would indicate the best number of clusters).

In [34]:
##Adapted from code at: https://pythonprogramming.net/flat-clustering-machine-learning-python-scikit-learn/).
from sklearn.cluster import KMeans
# select the data to use. drop any missing values
x=df_years.loc[:,['Year','Life_Satisfaction','Log_GDP_per_cap','Social_Support','Life_Expectancy','RegionCode']]
x=x.dropna()
# initialise kmeans to be the KMeans algorithm (flat clustering) with the number of clusters.
kmeans=KMeans(n_clusters=3)
# fit the data
kmeans.fit(x)
# create a copy of the data
clusters=x.copy()
# add column with the predictions from the kmeans model
clusters['cluster_pred']=kmeans.fit_predict(x)

Look at the clusters to see how the regions are clustered:

In [35]:
clusters[['RegionCode','cluster_pred']]
Out[35]:
RegionCode cluster_pred
3 7 1
4 7 1
5 7 1
6 7 1
... ... ...
1700 1 1
1701 1 1
1702 1 1
1703 1 1

1094 rows × 2 columns

Plot the clusters and then plot the actual observations to visually compare with the geographic regions:

In [36]:
# set the figure size
plt.rcParams["figure.figsize"] = (12,3)
# use seaborn scatterplot, hue by cluster group
g= sns.scatterplot(x=clusters['Life_Satisfaction'],y=clusters['Social_Support'],hue=clusters['cluster_pred'],palette='rainbow')
# place legends outside the box - # https://stackoverflow.com/q/53733755 to move legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
# add title
plt.title("Clusters: Social support and Life Ladder");
In [37]:
# scatterplot of life ladder and social support
g =sns.scatterplot(y = df_years['Social_Support'],x= df_years['Life_Satisfaction'],hue=df_years['Region'],palette="bright")
# https://stackoverflow.com/q/53733755 to move legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
# add title
plt.title("Scatterplot of Social support vs Life Ladder");
In [38]:
clusters.columns
Out[38]:
Index(['Year', 'Life_Satisfaction', 'Log_GDP_per_cap', 'Social_Support',
       'Life_Expectancy', 'RegionCode', 'cluster_pred'],
      dtype='object')
In [39]:
# plot the clusters gdp versus life ladder
g= sns.scatterplot(x=clusters['Life_Satisfaction'],y=clusters['Log_GDP_per_cap'],hue=clusters['cluster_pred'],palette='rainbow')
# legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
# add title
plt.title("Clustering of Life Ladder vs Log GDP");
In [40]:
# scatterplot of actual log gdp vs life ladder
g =sns.scatterplot(y = df_years['Log_GDP_per_cap'],x= df_years['Life_Satisfaction'],hue=df_years['Region'], palette='bright')
# https://stackoverflow.com/q/53733755 to move legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
# add title
plt.title("Scatterplot of Life Ladder vs Log GDP per capita");
In [41]:
# plot the clusters Life ladder vs Life Expectancy
g=sns.scatterplot(x=clusters['Life_Satisfaction'],y=clusters['Life_Expectancy'],hue=clusters['cluster_pred'],palette='rainbow')
plt.title("Clustering of Life Ladder vs Life Expectancy");
# place legends outside box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);
In [42]:
# scatterplot of life expectancy vs life ladder
g =sns.scatterplot(y = df_years['Life_Expectancy'],x= df_years['Life_Satisfaction'],hue=df_years['Region'], palette="bright")
# https://stackoverflow.com/q/53733755 to move legends outside of box
g.legend(loc='center left', bbox_to_anchor=(1.10, 0.5), ncol=1);

Relationships between the variables.

Correlation in the data:

The covariance and correlation statistics can be used to determine if there is a linear relationship between variables and if one variable tends to occur with large or small values of another variable. Covariance is a measure of the joint variability of two random variables while the Pearson correlation coefficient is the normalised version of the covariance and it shows by it's magnitude the strength of the linear relationship between two random variables. The covariance is a measure of how much two variables vary with each other and in what direction a variable changes when another variable changes. Pandas cov function can be used to get the covariances between pairs of variables. The covariances below are all positive which means that when one of the variables is above it's mean, the other variable is also likely to be above it's mean.

The correlation coefficients are easier to interpret than the covariances. If there is a strong positive relationship between two variables then the value of the correlation coefficient will be close to 1 while a strong negative relationship between two variables will have a value close to -1. If the correlation coefficient is zero this indicates that there is no linear relationship between the variables. Pandas corr function is used to get the correlations between pairs of variables. There seems to be a strong positive correlation between each of the pairings between the four variables with the strongest relationship between Healthy Life Expectancy and Log GDP per capita.

As noted the World Happiness Report researchers found that differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score. Income in the form of GDP per capita for a country has a strong and positive correlation with the averages country scores of the response to the Cantril Ladder question.

In [43]:
print("The covariance of the variables in the dataset: \n")
# pandas cov function on the 4 variables in the dataset
df18.iloc[:,:6].cov()
The covariance of the variables in the dataset: 

Out[43]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
Life_Satisfaction 1.217627 0.992087 0.092331 5.549072
Log_GDP_per_cap 0.992087 1.407993 0.112260 6.782165
Social_Support 0.092331 0.112260 0.013533 0.595622
Life_Expectancy 5.549072 6.782165 0.595622 45.269314
In [44]:
print("The correlation between variables in the dataset: \n")
# pandas corr function on the 4 variables 
df18.iloc[:,:6].corr()
The correlation between variables in the dataset: 

Out[44]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
Life_Satisfaction 1.000000 0.760206 0.719267 0.744264
Log_GDP_per_cap 0.760206 1.000000 0.792945 0.852848
Social_Support 0.719267 0.792945 1.000000 0.751514
Life_Expectancy 0.744264 0.852848 0.751514 1.000000
In [45]:
print("The correlation between variables in the dataset across all years from 2011 to 2018: \n")
df_years.iloc[:,3:7].corr()
The correlation between variables in the dataset across all years from 2011 to 2018: 

Out[45]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
Life_Satisfaction 1.000000 0.778394 0.715501 0.751513
Log_GDP_per_cap 0.778394 1.000000 0.680559 0.840854
Social_Support 0.715501 0.680559 1.000000 0.632336
Life_Expectancy 0.751513 0.840854 0.632336 1.000000

Seaborn regplot or lmplot functions can be used to visualize a linear relationship between variables as determined through regression. Statistical models are used to estimate a simple relationship between sets of observations that can be quickly visualised. These can be more informative than looking at the numbers alone. The regression plots below show a positive linear relationship between the pairs of variables.

In [46]:
df18.columns
Out[46]:
Index(['Country', 'Region', 'Life_Satisfaction', 'Log_GDP_per_cap',
       'Social_Support', 'Life_Expectancy', 'RegionCode'],
      dtype='object')
In [47]:
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=df18, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=df18, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=df18, ax=axes[2])
# set the axis limits for this subplot.
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90)
plt.tight_layout();
In [48]:
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(x ="Life_Expectancy",y="Log_GDP_per_cap", data=df18, ax=axes[0])
sns.regplot(x ="Life_Expectancy",y="Social_Support", data=df18, ax=axes[1])
sns.regplot(x ="Life_Expectancy",y="Life_Satisfaction", data=df18, ax=axes[2])
plt.tight_layout();
In [49]:
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(x ="Log_GDP_per_cap",y="Life_Expectancy", data=df18, ax=axes[0])
sns.regplot(x ="Log_GDP_per_cap",y="Social_Support", data=df18, ax=axes[1])
sns.regplot(x ="Log_GDP_per_cap",y="Life_Satisfaction", data=df18, ax=axes[2])
axes[2].set_ylim(0,12)
plt.tight_layout();
In [50]:
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(x ="Social_Support",y="Life_Expectancy", data=df18, ax=axes[0])
sns.regplot(x ="Social_Support",y="Log_GDP_per_cap", data=df18, ax=axes[1])
sns.regplot(x ="Social_Support",y="Life_Satisfaction", data=df18, ax=axes[2])
axes[2].set_ylim(0,12)
plt.tight_layout();

The covariance and correlation statistics suggest a strong linear relationship between the pairs of variables and the regression plots confirm this. Life satisfactions are positively correlated with income levels (Log GDP per capita), social support and healthy life expectancy. Income and life expectancy are very highly correlated.

(I might be interpreting the correlations incorrectly using the region codes. What I wanted to see was if the region the country is correlated with the variables as it would seem to be the case. Life expectancy and income are generally related to the region.)


3. Synthesise/simulate a data set as closely matching their properties as possible.

Having collected some data for the real-world phenomenon of the World Happiness Scores, I looked at the type of variables involved, their distributions and the relationships between the variables. In this section I will look more closely at the distributions of each of the variables and then simulate some data that hopefully will have properties that match the actual variables in the real sample dataset.

The aim of the World Happiness Scores is to see what countries or regions rank the highest in overall happiness and how each of the six factors (economic production, social support, life expectancy, freedom, absence of corruption, and generosity) contribute to happiness levels. The World Happiness Report researchers studied how the different factors contribute to the happiness scores and the extent of each effect. While these factors have no impact on the total score reported for each country, they were analysed to explain why some countries rank higher than others and they describe the extent to which these factors contribute in evaluating the happiness in each country.

As the differences in social support, incomes and healthy life expectancy are the three most important factors in determining the overall happiness score, this is what I focus on in this project.

Therefore I will simulate the following variables:

  • Regions and Countries

  • Life Ladder / Life Satisfaction

  • Log GDP per capita

  • Social Support

  • Healthy Life Expectancy at birth


Simulate Regions and countries.

There are countries from 10 geographic regions in the sample dataset for 2018. While the region was not in the actual dataset for 2018, I added it in from a dataset for a previous year (2015). This allowed me to see the how the values for each variable in the dataset differ by region.

Look at the clustered groups for 2018:

The rough clustering earlier divided the countries into 3 clustered regions. Of the clustered countries, 10 values were lost due to missing values. If I had more time then the missing values could be imputed based on previous years but for now there is at least 100 which is sufficient for this project.

In [51]:
#clusters for 2018 data
c18=clusters.loc[clusters.loc[:,'Year']==2018]
len(c18) # only 126 countries due to missing values in some rows
Cy=clusters # all the data for all years 
# the number of clusters for all years
len(Cy)
Out[51]:
1094

Look at the distribution of countries across the geographic regions.

Below I look at the actual proportions of countries in each of the 10 geographic regions which are then plotted. The countplot shows the distribution of countries in the real dataset over the 10 regions. To avoid the axes labels overlapping, I have followed advice on a blogpost drawing from data[20] to rotate the axes labels. Also to show the plot by descending order of the region with the most countries.

In [52]:
# see how many countries in each Region
df18.Region.value_counts()
# find how many values in total
print(f"There are {len(df18.Region)} countries in the dataset. ")
# calculate the proportion of countries in each region
prob=(df18.Region.value_counts()/len(df18.Region)).round(3)
#print(f"These countries are split over 10 geographic regions as follows: \n{df18.Region.value_counts()/len(df18.Region)}.3f %", sep="\n")
# make sure they sum to 1. 
prob.sum()
There are 136 countries in the dataset. 
Out[52]:
1.0
In [53]:
# set up the matplotlib plot
plt.rcParams["figure.figsize"] = (10,4)
# set the tick size
plt.rcParams["xtick.labelsize"] = 10
sns.set_palette("pastel")
# https://stackoverflow.com/q/46623583 order countplot by count.
chart= sns.countplot(x=df18.Region, order= df18['Region'].value_counts().index)
# rotate axes labels to avoid overlap - see <https://www.drawingfromdata.com/how-to-rotate-axis-labels-in-seaborn-and-matplotlib>
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)
plt.title("Distribution of Countries in Geographic Regions in in the dataset 2018");

Distribution of countries by clustered region.

Here I will use the dataframe c18 created above which has the clustered regions added. All other data in it is the same as the original dataframes. The dataframe clusters contains all the data from df_yearss but with the addition of the predicted clusters.

In [54]:
# see the top of the dataframe
clusters.head()
Out[54]:
Year Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy RegionCode cluster_pred
3 2011 3.831719 7.415019 0.521104 51.919998 7 1
4 2012 3.782938 7.517126 0.520637 52.240002 7 1
5 2013 3.572100 7.522238 0.483552 52.560001 7 1
6 2014 3.130896 7.516955 0.525568 52.880001 7 1
7 2015 3.982855 7.500539 0.528597 53.200001 7 1
In [55]:
# find how many values in total
print(f"There are {len(c18.cluster_pred)} countries in the dataset. ")
# calculate the proportion of countries in each region
CRprob=(c18.cluster_pred.value_counts()/len(c18.cluster_pred)).round(3)
print(f"These countries are split over 10 geographic regions as follows: \n{c18.cluster_pred.value_counts()/len(c18.cluster_pred)}.3f %", sep="\n")
# mnake sure they sum to 1. 
CRprob.sum()
There are 126 countries in the dataset. 
These countries are split over 10 geographic regions as follows: 
0    0.444444
2    0.309524
1    0.246032
Name: cluster_pred, dtype: float64.3f %
Out[55]:
1.0

Plot the distribution of countries in the clustered groups:

In [56]:
sns.set_palette("pastel")
# https://stackoverflow.com/q/46623583 order countplot by count.
chart= sns.countplot(x=c18.cluster_pred, order= c18['cluster_pred'].value_counts().index)
# rotate axes labels to avoid overlap - see <https://www.drawingfromdata.com/how-to-rotate-axis-labels-in-seaborn-and-matplotlib>
chart.set_xticklabels(chart.get_xticklabels(), rotation=45)
plt.title("Distribution of Countries in Geographic Regions in in the dataset 2018");

Simulate some geographic regions.

Here I use the numpy.random.choice function to generates a random sample of regions from a one dimensional array of made-up regions using the proportions from the real dataset.

In [57]:
# makeup region names:
Sim_regions = ['SimReg0','SimReg1','SimReg2']
# assign countries to region based on proportions in actual dataset.
p=[c18.cluster_pred.value_counts()/len(c18.cluster_pred)]
# simulate regions using the probabilities based on proportion in actual dataset
CR= np.random.choice(Sim_regions, 136, p=[p[0][0],p[0][1],p[0][2]])
CR
Out[57]:
array(['SimReg2', 'SimReg1', 'SimReg2', ..., 'SimReg1', 'SimReg0',
       'SimReg0'], dtype='<U7')

Make up some country names:

Here I am just going to create some country names by appending a digit to the string 'Country_'. I don't have the imagination to start creating over 100 new countries!

In [58]:
# create a list of countries by appending number to 'country'
countries =[]
# use for loop and concatenate number to string
for i in range(136):
    countryname = 'Sim_Country_'+str(i)
    # append countryname  to list of countries
    countries.append(countryname)

Create a dataframe to hold the simulated data:

I now have countries and regions for the simulated dataset which I will add to a dataframe.

In [59]:
# create a dataframe to hold the simulated variables.
sim_df = pd.DataFrame(data={'Sim_Country':countries, 'Sim_Region':CR})
sim_df
Out[59]:
Sim_Country Sim_Region
0 Sim_Country_0 SimReg2
1 Sim_Country_1 SimReg1
2 Sim_Country_2 SimReg2
3 Sim_Country_3 SimReg0
... ... ...
132 Sim_Country_132 SimReg1
133 Sim_Country_133 SimReg1
134 Sim_Country_134 SimReg0
135 Sim_Country_135 SimReg0

136 rows × 2 columns

Plot of the distribution of simulated regions:

In [60]:
# set the figure size
plt.rcParams["figure.figsize"] = (10,4)
sns.set_palette("muted")
# countplot of simulated region, order by counts
sns.countplot(x=sim_df.Sim_Region, order= sim_df['Sim_Region'].value_counts().index)
# add title
plt.title("Distribution of Simulated Regions in the Simulated dataset");

The countplot above shows the simulated regions in the simulated dataset, ordered by the count of the countries in each region. Comparing it to the actual dataset shows that the split looks good.


Simulate Life Satisfaction / Life Ladder variable:

The Life Ladder scores and rankings are based on answers to the Cantril Ladder question from the Gallup World Poll where the participants rate their own lives on a scale of 0 to 10 with the best possible life for them being a 10 and the worst a 0. The rankings are from nationally representative samples for the years between 2016 and 2018 and are based entirely on the survey scores using weights to make the estimates representative.

The World Happiness Report shows how levels of GDP, life expectancy, generosity, social support, freedom, and corruption - contribute to making life evaluations higher in each country than they are in Dystopia, a hypothetical country that has values equal to the world’s lowest national averages for each of the six factors. The report itself looks at life evaluations for the 2016 to 2018 period and the rankings across all countries in the study. Here are the statistics for the Life Ladder variable for the 2018 dataset.

Life Ladder / Life Satisfaction is a non-negative real number with statistics as follows. The minimum possible value is 0 and the maximum is 10 although the range of values in the actual dataset is smaller. The values in the dataset are national averages of the answers to Cantril Ladder question. The range of values vary across regions.

Descriptive statistics of the Life Ladder variable in the dataset for 2018:

In [61]:
df18['Life_Satisfaction'].describe()
Out[61]:
count    136.000000
mean       5.502134
std        1.103461
min        2.694303
25%        4.721326
50%        5.468088
75%        6.277691
max        7.858107
Name: Life_Satisfaction, dtype: float64

Plot of the distribution of Life Ladder for 2018:

The distribution of Life Ladder is plotted below. The plot on the left shows the distribution for the dataframe containing 2018 data only and the plot on the right shows the distribution for the dataframe containing all years.

In [62]:
f,axes=plt.subplots(1,2, figsize=(12,3))
# distribution of the Life Ladder for 2018
sns.distplot(df18['Life_Satisfaction'], label="Life Ladder 2018", ax=axes[0])
# distribution of life ladder from the dataframe with the clusters. some missing values
sns.distplot(c18['Life_Satisfaction'], label="Life Ladder - 2018", ax=axes[0])
axes[0].set_title("Distribution of Life Ladder for 2018 only")
#axes[0].legend()
# kdeplot of the life ladder for all years in the extended dataset (WHR Table2.1)
sns.distplot(df_years['Life_Satisfaction'], label="Life Ladder - all years", ax=axes[1])
sns.distplot(c18['Life_Satisfaction'], label="Life Ladder - 2018",ax=axes[1])
axes[1].set_title("Distribution of Life Ladder for 2012 to 2018");
#axes[1].legend();

The distribution plots for the Life Ladder variable does suggest that it is normally distributed but there are some tests that can be used to clarify this. A blogpost on tests for normality[21] at machinelearningmastery.com outlines how it is important when working with a sample of data to know whether to use parametric or nonparametric statistical methods. If methods used assume a Gaussian distribution when it is not the case then findings can be incorrect or misleading. In some cases it is enough to assume the data is normal enough to use parametric methods or to transform the data to be normal enough.

Parametric statistical methods assume that the data has a known and specific distribution, often a Gaussian distribution. If a data sample is not Gaussian, then the assumptions of parametric statistical tests are violated and nonparametric statistical methods must be used.[21]

Normality tests can be used to check if your data sample is from a Gaussian distribution or not.

  • Statistical tests calculate statistics on the data to quantify how likely it is that the data was drawn from a normal distribution.
  • Graphical methods plot the data to qualitatively evaluate if the data looks Gaussian.

Tests for normality include the Shapiro_Wilk Normality Test, the D'Agostino and Pearson's Test, the Anderson-Darling Test.

The histograms above show the characteristic bell-shape curve of the normal distribution and indicates that the Life Ladder variable is gaussian or approximately normally distributed.

The Quantile-Quantile Plot (QQ plot) is another plot that can be used to check if the distribution of a data sample. It generates its own sample of the idealised distribution that you want to compare your sample of data to. The idealised samples are divided into groups and each data point in the sample is paired with a similar member from the idealised distribution at the same cumulative distribution.

A scatterplot is drawn with the idealised values on the x-axis and the data sample on the y-axis. The resulting points are plotted as a scatter plot with the idealized value on the x-axis and the data sample on the y-axis. If the result is a straight line of dots on the diagonal from the bottom left to the top right this indicates a perfect match for the distribution whereas if the dots deviate far from the diagonal line.

QQ plot of Life Ladder:

Here a QQ plot is created for the Life Ladder sample compared to a Gaussian distribution (the default).

In [63]:
# import the qqplot function 
from statsmodels.graphics.gofplots import qqplot
# Life Ladder observations from the dataset. can use c18 dataframe or dfh
data = c18['Life_Satisfaction']
# plot a q-q plot, draw the standardised line
qqplot(data, line='s') 
plt.title("QQ plot to test for normality of Life Ladder");

Tests for normality:

The QQ plot does seem to indicate to me that the Life Ladder is normally distributed. The scatter plots of points do mostly follow the diagonal pattern for a sample from a Gaussian distribution. I will use some of the normality tests as outlined on the blogpost on tests for normality[21]. These tests assume the sample was drawn from a Gaussian distribution and test this as the null hypothesis. A test-statistic is calculated and a p-value to interpret the test statistic. A threshold level called alpha is used to interpret the test. This is typically 0.05. If the p-value is less than or equal to alpha, then reject the null hypothesis. If the p-value is greater than alpha then fail to reject the null hypothesis. In general a larger p-value indicates that the sample was likely to have been drawn from a Gaussian distribution. A result above 5% doesn't mean the null hypothesis is true but that it is very likely true given the evidence available.

Shapiro-Wilk Normality Test

In [64]:
# adapted from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
# import the shapiro test from scipy stats
from scipy.stats import shapiro
# calculate the test statistic and the p-value to interpret the test on the Life Ladder sample
stat, p = shapiro(df18['Life_Satisfaction'])
# interpret the test
alpha= 0.05
print('stat=%.3f, p=%.3f' % (stat, p))
alpha = 0.05
if p > alpha:
    print('The sample of Life Satisfaction/ Life Ladder looks Gaussian (fail to reject H0)')
else:
    print('The sample of Life Ladder does not look Gaussian (reject H0)')
stat=0.991, p=0.520
The sample of Life Satisfaction/ Life Ladder looks Gaussian (fail to reject H0)

D’Agostino’s K^2 Test

This test calculates the kurtosis and skewness of the data to see if the data distribution is not normal like. The skew is a measure of asymmetry in the data while hurtosis how much of the distribution is in the tails.

In [65]:
# D'Agostino and Pearson's Test adapted from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/
# import from scipy.stats
from scipy.stats import normaltest
# normality test
stat, p = normaltest(df18['Life_Satisfaction'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
# interpret
alpha = 0.05
if p > alpha:
    print('The sample of Life Ladder looks Gaussian (fail to reject H0)')
else:
    print('The sample of Life Ladder does not look Gaussian (reject H0)')
Statistics=2.435, p=0.296
The sample of Life Ladder looks Gaussian (fail to reject H0)

Anderson-Darling Test:

This is a statistical test that can be used to evaluate whether a data sample comes from one of among many known data samples adn can be used to check whether a data sample is normal. The test is a modified version of the Kolmogorov-Smirnov test[22] which is a nonparametric goodness-of-fit statistical test. The Anderson-Darling test returns a list of critical values rather than a single p-value. The test will check against the Gaussian distribution (dist=’norm’) by default.

In [66]:
# Anderson-Darling Test - adapted from https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/

from scipy.stats import anderson
# normality test
result = anderson(df18['Life_Satisfaction'])
print('Statistic: %.3f' % result.statistic)
p = 0
# checking for a range of critical values
for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < result.critical_values[i]:
        # fail to reject the null that the data is normal if test statistic is less than the critical value.
        print('%.3f: %.3f, data looks normal (fail to reject H0)' % (sl, cv))
    else:
        # reject the null that the data is normal if test statistic is greater than the critical value.
        print('%.3f: %.3f, data does not look normal (reject H0)' % (sl, cv))
Statistic: 0.235
15.000: 0.560, data looks normal (fail to reject H0)
10.000: 0.638, data looks normal (fail to reject H0)
5.000: 0.766, data looks normal (fail to reject H0)
2.500: 0.893, data looks normal (fail to reject H0)
1.000: 1.062, data looks normal (fail to reject H0)

The sample of Life Ladder for 2018 does appear to be normally distributed based on the above tests. Therefore I can go ahead and simulate data for this variable using the normal distribution using the sample mean and standard deviation statistics.

Simulate Life ladder variable:

I will now simulate 'Life Ladder' using the numpy.random.normal function. This takes 3 arguments - loc for the mean or centre of the distributon, scale for the spread or width of the distribution - the standard deviation, and size for the number of samples to draw from the distribution. Without setting a seed the actual values will be different each time but the shape of the distribution should be the same.

In [67]:
print(f" The mean of the sample dataset for 2018 is {c18['Life_Satisfaction'].mean():.3f} ")
print(f" The standard deviation of the sample dataset for 2018 is {c18['Life_Satisfaction'].std():.3f} ")
c18.shape
 The mean of the sample dataset for 2018 is 5.500 
 The standard deviation of the sample dataset for 2018 is 1.098 
Out[67]:
(126, 7)
In [68]:
df_years.loc[df_years.loc[:,'Year']==2016]
Out[68]:
Year Country Region Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy RegionCode
8 2016 Afghanistan Southern Asia 4.220169 7.497038 0.559072 53.000000 7
19 2016 Albania Central and Eastern Europe 4.511101 9.337532 0.638411 68.099998 2
26 2016 Algeria Middle East and Northern Africa 5.340854 9.541166 0.748588 65.500000 5
43 2016 Argentina Latin America and Caribbean 6.427221 9.830088 0.882819 68.400002 4
... ... ... ... ... ... ... ... ...
1665 2016 Vietnam Southeastern Asia 5.062267 8.672080 0.876324 67.500000 6
1676 2016 Yemen Middle East and Northern Africa 3.825631 7.299221 0.775407 55.099998 5
1688 2016 Zambia Sub-Saharan Africa 4.347544 8.203072 0.767047 54.299999 1
1701 2016 Zimbabwe Sub-Saharan Africa 3.735400 7.538829 0.768425 54.400002 1

142 rows × 8 columns

Creating subsets of the data for each in order to be able to see how the distribution varies from year to year.

In [69]:
# subset of the data for each year
c17= Cy.loc[Cy.loc[:,'Year']==2017]
c16= Cy.loc[Cy.loc[:,'Year']==2016]
c15= Cy.loc[Cy.loc[:,'Year']==2015]
c14= Cy.loc[Cy.loc[:,'Year']==2014]
In [70]:
# plot the distribution of life satisfaction variable for each of the years 2014 to 2018
sns.kdeplot(c14['Life_Satisfaction'], label="2014")
sns.kdeplot(c15['Life_Satisfaction'], label="2015")
sns.kdeplot(c16['Life_Satisfaction'], label="2016")
sns.kdeplot(c17['Life_Satisfaction'], label="2017")
sns.kdeplot(c18['Life_Satisfaction'], label="2018")
plt.title("Distribution of Life Satisfaction for each year 2014 to 2017");

Plot some simulated data.

Here some data is simulated for the Life Satisfaction/Ladder variable is simulated using the mean and standard deviation for the 2018 data. As the variable appears to be normally distributed it would be ok to do this. However as mnay of the other varaibles are not normally distributed at the dataset level but do look gaussian at the clustered region level I will be simulating the other variables this way.

In [71]:
sns.kdeplot(c18['Life_Satisfaction'], label="Life Satisfaction 2018", shade="True")
# simulate data based on statistics from the 2018 sample dataset
# simulate life ladder using sample mean and standard deviation
for i in range(10):
        x= np.random.normal(c18['Life_Satisfaction'].mean(),c18['Life_Satisfaction'].std(),130)
        sns.kdeplot(x)
# plot the distribution of the actual dataset

# plot the simulated data 
plt.title("Actual data versus the Simulated data for Life Ladder 2018")
plt.legend();

The above plots of the distribution of Life_Satisfaction illustrate how the distribution changes slightly for each simulation due to the randomness.

Next plot the distribution for the 4 clustered regions.

In [72]:
c18.groupby('cluster_pred').count()
Out[72]:
Year Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy RegionCode
cluster_pred
0 56 56 56 56 56 56
1 31 31 31 31 31 31
2 39 39 39 39 39 39

The distributions of Life Ladder for each of the 3 clustered regions.

In [73]:
# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)

# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Life_Satisfaction'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Life_Satisfaction'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Life_Satisfaction'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Life Ladder for the 3 clusters for 2017");
plt.legend();
In [74]:
# separate out the clustered regions for all years from 2011 to 2018
cluster0 =clusters.loc[clusters.loc[:,'cluster_pred']==0]
cluster1 =clusters.loc[clusters.loc[:,'cluster_pred']==1]
cluster2 =clusters.loc[clusters.loc[:,'cluster_pred']==2]
## separate out the clustered regions for 2018 data
c18_0 =c18.loc[c18.loc[:,'cluster_pred']==0]
c18_1 =c18.loc[c18.loc[:,'cluster_pred']==1]
c18_2 =c18.loc[c18.loc[:,'cluster_pred']==2]

The distribution of the Life Satisfaction (Life Ladder) variable looks quite normal when taken over all the countries in the dataset. When shown for each of the 3 clustered regions, the distributions are all still normal looking but with very different shapes and locations.

Mean and Standard deviation for each region group:

In [75]:
# cluster 0
print(f" The mean of the sample dataset for 2018 is {cluster0['Life_Satisfaction'].mean():.3f} ")
print(f" The standard deviation of the sample dataset for 2018 is {cluster0['Life_Satisfaction'].std():.3f} ")
c18.shape
 The mean of the sample dataset for 2018 is 5.410 
 The standard deviation of the sample dataset for 2018 is 0.741 
Out[75]:
(126, 7)

simulate for each cluster based on their statistics

In [76]:
LLc0= np.random.normal(c18_0['Life_Satisfaction'].mean(),c18_0['Life_Satisfaction'].std(),31)
LLc1= np.random.normal(c18_1['Life_Satisfaction'].mean(),c18_1['Life_Satisfaction'].std(),39)
LLc2= np.random.normal(c18_2['Life_Satisfaction'].mean(),c18_2['Life_Satisfaction'].std(),56)
In [77]:
# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(LLc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(LLc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(LLc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Life Satisfaction for 3 region clusters");

Illustrating how the distributions changed for each simulation due to random variation.

In [78]:
f,axes=plt.subplots(1,3, figsize=(13,3))
for i in range(5):
        sns.kdeplot(np.random.normal(c18_0['Life_Satisfaction'].mean(),c18_0['Life_Satisfaction'].std(),31), shade=True, ax=axes[0])
    
for i in range(5):
        sns.kdeplot(np.random.normal(c18_1['Life_Satisfaction'].mean(),c18_1['Life_Satisfaction'].std(),39), shade=True, ax=axes[1])
for i in range(5):
        sns.kdeplot(np.random.normal(c18_2['Life_Satisfaction'].mean(),c18_2['Life_Satisfaction'].std(),59), shade=True, ax=axes[2])

plt.suptitle("Distributions of 5 simulations for Life Satisfaction for each of the 3 groups")
       
Out[78]:
Text(0.5, 0.98, 'Distributions of 5 simulations for Life Satisfaction for each of the 3 groups')
In [ ]:
 

Simulating Social Support

Social Support:

The variable 'Social support' was the result of a question in the Gallop World Poll with the national average of the binary responses for each country to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”.

The distribution from the dataset shows that it is left skewed. The scatterplots above showed that it is positively correlated with life satisfaction, income levels and healthy life expectancy. The boxplots below show that the median values fall into roughly 3 groups by regions with Western Europe, North America and Australia and New Zealand having the highest scores, while Southern Asia and Sub Saharan Africa have the lowest median scores and a wider spread.

Like Life ladder variable, the values are the result of national averages to questions in the Gallup World Poll. In this case the question had a binary answer 0 or 1 and therefore the actual values in the dataset range between 0 and 1. Social Support is a non-negative real values. The statistics vary from region to region.

In [79]:
df18['Social_Support'].describe()
Out[79]:
count    136.000000
mean       0.810544
std        0.116332
min        0.484715
25%        0.739719
50%        0.836641
75%        0.905608
max        0.984489
Name: Social_Support, dtype: float64

Plot of the distribution of Social Support for 2018:

The distribution of Life Ladder is plotted below. The plot on the left shows the distribution for the dataframe containing 2018 data only and the plot on the right shows the distribution for the dataframe containing all years.

In [80]:
f,axes=plt.subplots(1,2, figsize=(12,3))
# distribution of the Social support for 2018
sns.distplot(df18['Social_Support'], label="Social Support 2018", ax=axes[0])
# distribution of social support from the dataframe with the clusters. some missing values
sns.distplot(c18['Social_Support'], label="Social Support - 2018", ax=axes[0])
axes[0].set_title("Distribution of Social Support for 2018 only")
#axes[0].legend()
# kdeplot of the Social Support for all years in the extended dataset (WHR Table2.1)
sns.distplot(df_years['Social_Support'].dropna(), label="Social_Support - all years", ax=axes[1])
sns.distplot(c18['Social_Support'], label="Social_Support - 2018",ax=axes[1])
axes[1].set_title("Distribution of Social_Support for 2012 to 2018");
#axes[1].legend();
In [81]:
# plot the distribution of life satisfaction variable for each of the years 2014 to 2018
sns.kdeplot(c14['Social_Support'], label="2014")
sns.kdeplot(c15['Social_Support'], label="2015")
sns.kdeplot(c16['Social_Support'], label="2016")
sns.kdeplot(c17['Social_Support'], label="2017")
sns.kdeplot(c18['Social_Support'], label="2018")
plt.title("Distribution of Social_Support' for each year 2014 to 2017");
In [82]:
# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)

# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Social_Support'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Social_Support'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Social_Support'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Social Support variable for the 3 clusters for 2017");
plt.legend();

The distribution of Social Support for all the countries in 2018 is a left-skewed distribution with a long tail to the left. When this is broken down by cluster groups of regions, the distribution of each cluster looks more normal shaped although the centres of the distribution and the spread vary widely. Cluster region 0 (green) has social support values centred around different values.

Mean and standard deviation for Social Support by clustered group:

In [83]:
c18.groupby('cluster_pred').mean()
Out[83]:
Year Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy RegionCode
cluster_pred
0 2018.0 5.322920 9.294176 0.820008 64.951786 3.946429
1 2018.0 4.415971 7.732756 0.669998 54.945161 1.290323
2 2018.0 6.614540 10.401640 0.900387 71.515385 4.102564

Simulate data for the Social Support variable for each of the cluster groups:

In [84]:
SSc0= np.random.normal(c18_0['Social_Support'].mean(),c18_0['Social_Support'].std(),31)
SSc1= np.random.normal(c18_1['Social_Support'].mean(),c18_1['Social_Support'].std(),39)
SSc2= np.random.normal(c18_2['Social_Support'].mean(),c18_2['Social_Support'].std(),56)

Plot the distributions of the simulated Social Support variable for the 3 groups:

In [85]:
# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(SSc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(SSc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(SSc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Social Support for 3 region clusters");

Simulating Social Support variable using non-parametric methods:

When I first looked at simulating the social support variable I considered non-parametric ways of working with the data such as the bootstrap resampling method. Bootstrapping is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples. Samples are constructed by drawing observations from a large data sample one at a time and then returning the observation so that it could be drawn again. Therefore any given observation could be included in the sample more than once while some observations might never be drawn. I referred to a blogpost on Machinelearningmastery.com[23] which outlines the steps to implement it using the scikit-learn resample function[24] which takes as arguments the data array, whether or not to sample with replacement, the size of the sample, and the seed for the pseudorandom number generator used prior to the sampling.

Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution function of the observed data. In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples with replacement, of the observed dataset (and of equal size to the observed dataset). The basic idea of bootstrapping is that inference about a population from sample data can be modelled by resampling the sample data and performing inference about a sample from resampled data. As the population is unknown, the true error in a sample statistic against its population value is unknown. In bootstrap-resamples, the 'population' is in fact the sample, and this is known; hence the quality of inference of the 'true' sample from resampled data (resampled → sample) is measurable. Wikipedia wiki on Bootstrapping)[25]

The process for building one sample is:

  • choose the size of the sample
  • While the size of the sample is less than the chosen size
    1. Randomly select an observation from the dataset
    2. Add it to the sample The number of repetitions must be large enough that meaningful repetitions can be calculated on the sample.

Based on the information above I use sampling with replacement to simulate the Social support variable using the sample size the same as the original dataset which is 136 observations. numpy.random.choice function can be used for this purpose. The mean of the Social support variable in the dataset can be considered as a single estimate of the mean of the population of social support while the standard deviation is an estimate of the variability. The simplest bootstrap method would involve taking the original data set of N(136) Social support values and sampling from it to form a new sample - the 'resample' or bootstrap sample that is also of size 136. If the bootstrap sample is formed using sampling with replacement from the original data sample with a large enough size then there should be very little chance that the bootstrap sample will be the exact same as the original sample. This process is repeated many times (thousands) and the mean computed for each bootstrap sample to get the bootstrap estimates which can then be plotted on a histogram which is considered an estimate of the shape of the distribution.

Sampling with replacement using np.random.choice:

Here I use a loop to draw multiple random samples with replacements from the dataset. Additionally the means for each sample can be calculated and then all the means from the different samples are plotted to show their distribution but for this project I just need a simulated data set.

In [86]:
# using the Social support data from the dataset.
social=df18['Social_Support']
# create a list to store the means of the samples, set the number of simulations
mean_social_sim, sims = [], 20
# use loop to create 100 samples - takes very long to do 1000 

# plot the distribution of social support variable from the dataset, add title
sns.kdeplot(social, label="Actual Data", shade=True)
for _ in range(sims):
    # draw a random sample from social with replacement and store it in social_sample
    social_sample=np.random.choice(social, replace=True, size=len(social))

    # plot the distribution of each sample
    sns.kdeplot(social_sample)
    # add title
    plt.title("Simulating using random sample with replacement")
    social_mean = np.mean(social_sample)
    # append the mean of each sample to mean_social
    mean_social_sim.append(social_mean)    

The main use of the bootstrap is making inferences about an estimate for a population parameter on sample data. For the purposes of this project I just need to simulate a single dataset. However by calculating the bootstrap means and comparing them to the dataset I am attempting to replicate shows that it is a suitable method here when the data does not follow a particular distribution.

In [87]:
df18['Social_Support'].describe()
Out[87]:
count    136.000000
mean       0.810544
std        0.116332
min        0.484715
25%        0.739719
50%        0.836641
75%        0.905608
max        0.984489
Name: Social_Support, dtype: float64
In [88]:
np.mean(mean_social_sim)
Out[88]:
0.8080325175953262

Simulating Log GDP per capita

The GDP per capita in the World Happiness Report dataset are in purchasing power parity at constant 2011 international dollar prices which are mainly from the World Development Indicators in 2018. Purchasing power parity is necessary when looking to compare GDP per capita between countries which is what the World Happiness Report seeks to do. Nominal GDP would be fine when just looking at a single country. The log of the GDP figures is taken.

Per capita GDP is the Total Gross Domestic Product for a country divided by its population and breaks down a country's GDP per person. As the World Happiness Report states this is considered a universal measure for gauging the prosperity of nations.

The earlier plots showed that the distribution of log GPD per capita in the dataset is not normally distributed. Per capita GDP is generally a unimodal but skewed distribution. There are many regional variations in income across the world. The distribution of Log GPD per capita appears somewhat left skewed.

The scatterplots above showed that it is positively correlated with life satisfaction, social support and healthy life expectancy.

The log of GDP per capita is used to represent the Gross Domestic Product per capita. It is a non-negative real value. Because the log of the GDP per capita figures are used the range is between 6 and 12 although this varies from region to region. The

In [89]:
df['Log_GDP_per_cap'].describe()
Out[89]:
count    1676.000000
mean        9.222456
std         1.185794
min         6.457201
25%         8.304428
50%         9.406206
75%        10.193060
max        11.770276
Name: Log_GDP_per_cap, dtype: float64
In [90]:
f,axes=plt.subplots(1,2, figsize=(12,3))

# distribution of social support from the dataframe with the clusters. some missing values
sns.distplot(c18['Log_GDP_per_cap'].dropna(), label="Log_GDP_per_cap - 2018", ax=axes[0])
axes[0].set_title("Distribution of Log_GDP_per_cap for 2018 only")
#axes[0].legend()
# kdeplot of the Social Support for all years in the extended dataset (WHR Table2.1)
sns.distplot(clusters['Log_GDP_per_cap'].dropna(), label="Log_GDP_per_cap - all years", ax=axes[1])
axes[1].set_title("Distribution of Log_GDP_per_cap for 2012 to 2018");
#axes[1].legend();
In [ ]:
 
In [91]:
# plot the distribution of Log GDP per capita variable for each of the years 2014 to 2018
sns.kdeplot(c14['Log_GDP_per_cap'], label="2014")
sns.kdeplot(c15['Log_GDP_per_cap'], label="2015")
sns.kdeplot(c16['Log_GDP_per_cap'], label="2016")
sns.kdeplot(c17['Log_GDP_per_cap'], label="2017")
sns.kdeplot(c18['Log_GDP_per_cap'], label="2018")
plt.title("Distribution of Log_GDP_per_cap' for each year 2014 to 2017");
In [92]:
# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)

# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Log_GDP_per_cap'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Log_GDP_per_cap'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Log_GDP_per_cap'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Log_GDP_per_cap variable for the 3 clusters for 2017");
plt.legend();

When the distribution of Log GDP per capita is broken down by cluster groups of regions, the distribution of each cluster looks more normal shaped although the centres of the distribution and the spread vary widely as with the other variables.

Mean and standard deviation for Log GDP per capita by clustered group:

In [93]:
c18.groupby('cluster_pred').mean()
Out[93]:
Year Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy RegionCode
cluster_pred
0 2018.0 5.322920 9.294176 0.820008 64.951786 3.946429
1 2018.0 4.415971 7.732756 0.669998 54.945161 1.290323
2 2018.0 6.614540 10.401640 0.900387 71.515385 4.102564

Simulate data for the Log GDP per capita variable for each of the cluster groups:

In [94]:
GDPc0= np.random.normal(c18_0['Log_GDP_per_cap'].mean(),c18_0['Log_GDP_per_cap'].std(),31)
GDPc1= np.random.normal(c18_1['Log_GDP_per_cap'].mean(),c18_1['Log_GDP_per_cap'].std(),39)
GDPc2= np.random.normal(c18_2['Log_GDP_per_cap'].mean(),c18_2['Log_GDP_per_cap'].std(),56)

Plot the distributions of the simulated Log GDP per capita variable for the 3 groups:

In [95]:
# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(GDPc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(GDPc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(GDPc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Log GDP per capita for 3 region clusters");

Healthy Life Expectancy

According to the World Happiness Report, Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository. Some interpolation and exterpolation was used to get the data for the years covered in the 2018 report. As some countries were not covered in the WHO data, other sources were used by the researchers.

The values are non-negative real numbers between 48 and 77 years but this varies from year to year as the statistics showed.

In [96]:
df18['Life_Expectancy'].describe()
Out[96]:
count    132.000000
mean      64.670832
std        6.728247
min       48.200001
25%       59.074999
50%       66.350002
75%       69.075001
max       76.800003
Name: Life_Expectancy, dtype: float64
In [97]:
c18.columns
Out[97]:
Index(['Year', 'Life_Satisfaction', 'Log_GDP_per_cap', 'Social_Support',
       'Life_Expectancy', 'RegionCode', 'cluster_pred'],
      dtype='object')
In [98]:
f,axes=plt.subplots(1,2, figsize=(12,3))

# distribution of Life_Expectancy from the dataframe with the clusters. some missing values
sns.distplot(c18['Life_Expectancy'].dropna(), label="Life_Expectancy - 2018", ax=axes[0])
axes[0].set_title("Distribution of Life_Expectancy for 2018 only")
#axes[0].legend()
# kdeplot of the Life_Expectancy from 2012 to 2018)
sns.distplot(clusters['Life_Expectancy'].dropna(), label="Life_Expectancy - all years", ax=axes[1])
axes[1].set_title("Distribution of Life_Expectancy for 2012 to 2018");
#axes[1].legend();
In [ ]:
 
In [99]:
# plot the distribution of Life_Expectancy variable for each of the years 2014 to 2018
sns.kdeplot(c14['Life_Expectancy'], label="2014")
sns.kdeplot(c15['Life_Expectancy'], label="2015")
sns.kdeplot(c16['Life_Expectancy'], label="2016")
sns.kdeplot(c17['Life_Expectancy'], label="2017")
sns.kdeplot(c18['Life_Expectancy'], label="2018")
plt.title("Distribution of Life_Expectancy' for each year 2014 to 2017");
In [100]:
# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)

# plot for countries in these regions, set axes to use, labels to use
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==0]['Life_Expectancy'], label="Cluster region 0", color="g", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==1]['Life_Expectancy'], label="Cluster region 1", color="b", shade=True)
sns.kdeplot(c18.loc[c18.loc[:,'cluster_pred']==2]['Life_Expectancy'], label="cluster region 2", color="r", shade=True)
#sns.distplot(clusters.loc[clusters.loc[:,'cluster_pred']==3]['Life Ladder'], ax=axes[3], label="cluster region 3", color="y")
# add title
plt.suptitle("Actual Life_Expectancy variable for the 3 clusters for 2018");
plt.legend();

Looking at the Life Expectancy variable by cluster group, the distribution looks normal for the cluster region 1 while the other two groups are less so. There seems to be two peaks in the distribution for cluster group 0 which probably needs to be broken out more.

Mean and standard deviation for Healthy Life Expectancy at birth by clustered group:

In [101]:
c18.groupby('cluster_pred').mean()
Out[101]:
Year Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy RegionCode
cluster_pred
0 2018.0 5.322920 9.294176 0.820008 64.951786 3.946429
1 2018.0 4.415971 7.732756 0.669998 54.945161 1.290323
2 2018.0 6.614540 10.401640 0.900387 71.515385 4.102564

Simulate data for the Life Expectancy variable for each of the cluster groups:

In [102]:
LEc0= np.random.normal(c18_0['Life_Expectancy'].mean(),c18_0['Life_Expectancy'].std(),31)
LEc1= np.random.normal(c18_1['Life_Expectancy'].mean(),c18_1['Life_Expectancy'].std(),39)
LEc2= np.random.normal(c18_2['Life_Expectancy'].mean(),c18_2['Life_Expectancy'].std(),56)

Plot the distributions of the simulated Healthy Life Expectancy at birth variable for the 3 groups:

In [103]:
# set up the matploptlib figure, 
# https://seaborn.pydata.org/examples/distplot_options.html?highlight=tight_layout#distribution-plot-options
sns.set(style="white", palette="muted", color_codes=True)
# life satisfaction distribution for cluster region 0
sns.kdeplot(LEc0,color="y", shade=True)
# life satisfaction distribution for cluster region 1
sns.kdeplot(LEc1, color="skyblue", shade=True)
# life satisfaction distribution for cluster region 2
sns.kdeplot(LEc2, color="pink", shade=True)
# add title
plt.suptitle("Distribution of simulated Life_Expectancy for 3 region clusters");

Create a dataset with the simulated variables.

Now that the individual variables have been simulated, the next step is to assemble the variables into a dataframe and ensure that the overall distributions and relationships are similar to an actual dataset. Because of the variations between the three groups I will create a dataframe for each group and then concatenate the three dataframes together.

Simulated Dataframe for Group0:

Here I create 31 countries and add the simulated data created earlier using numpy.random functions.

In [104]:
# create a list of countries by appending number to 'country'
G0_countries =[]
# use for loop and concatenate number to string
for i in range(31):
    countryname = 'Sim_Country_'+str(i)
    # append countryname  to list of countries
    G0_countries.append(countryname)
In [105]:
# create a dataframe and add the columns
G0=pd.DataFrame()
G0['Country']=G0_countries
In [106]:
# add columns based on simulations
G0['Life_Satisfaction']=LLc0
G0['Log_GDP_per_cap']=GDPc0
G0['Social_Support']=SSc0
G0['Life_Expectancy']=LEc0
# add group column
G0['Group']='Group0'
In [107]:
G0
Out[107]:
Country Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy Group
0 Sim_Country_0 5.306706 8.921793 0.787441 66.048883 Group0
1 Sim_Country_1 5.651307 8.654395 0.809210 66.436883 Group0
2 Sim_Country_2 5.661618 9.308995 0.693635 63.622131 Group0
3 Sim_Country_3 4.861526 10.605480 0.782290 64.910602 Group0
... ... ... ... ... ... ...
27 Sim_Country_27 5.131967 8.575882 0.721815 65.468924 Group0
28 Sim_Country_28 6.445590 8.746669 1.001303 61.517664 Group0
29 Sim_Country_29 5.824301 9.300203 0.848021 62.704594 Group0
30 Sim_Country_30 5.016133 9.535686 0.969046 64.939119 Group0

31 rows × 6 columns

Simulated dataframe for group 1

Here I create a dataframe for the 39 group 1 simulated data points.

In [108]:
# create a list of countries by appending number to 'country'
G1_countries =[]
# use for loop and concatenate number to string
for i in range(39):
    countryname = 'Sim_Country_'+str(i)
    # append countryname  to list of countries
    G1_countries.append(countryname)
In [109]:
# create a dataframe and add the columns
G1=pd.DataFrame()
G1['Country']=G1_countries
# add columns based on simulations
G1['Life_Satisfaction']=LLc1
G1['Log_GDP_per_cap']=GDPc1
G1['Social_Support']=SSc1
G1['Life_Expectancy']=LEc1
# add group column
G1['Group']='Group1'
G1
Out[109]:
Country Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy Group
0 Sim_Country_0 4.671344 6.724028 0.641005 50.339410 Group1
1 Sim_Country_1 4.761840 7.475304 0.676521 54.053141 Group1
2 Sim_Country_2 5.851653 8.594777 0.591575 52.256857 Group1
3 Sim_Country_3 4.299704 9.815454 0.723505 54.739286 Group1
... ... ... ... ... ... ...
35 Sim_Country_35 4.904859 8.355346 0.576573 58.568376 Group1
36 Sim_Country_36 5.592356 6.747997 0.658175 59.476083 Group1
37 Sim_Country_37 4.909638 7.444697 0.721786 51.673735 Group1
38 Sim_Country_38 4.673152 7.425028 0.586048 55.680735 Group1

39 rows × 6 columns

In [ ]:
 

Simulated dataframe for group 2

Here I create a dataframe for the 56 group 1 simulated data points.

In [110]:
# create a list of countries by appending number to 'country'
G2_countries =[]
# use for loop and concatenate number to string
for i in range(56):
    countryname = 'Sim_Country_'+str(i)
    # append countryname  to list of countries
    G2_countries.append(countryname)
In [111]:
# create a dataframe and add the columns
G2=pd.DataFrame()
G2['Country']=G2_countries
# add columns based on simulations
G2['Life_Satisfaction']=LLc2
G2['Log_GDP_per_cap']=GDPc2
G2['Social_Support']=SSc2
G2['Life_Expectancy']=LEc2
# add group column
G2['Group']='Group2'
G2
Out[111]:
Country Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy Group
0 Sim_Country_0 6.650582 10.486408 0.957040 71.204546 Group2
1 Sim_Country_1 5.948507 10.530889 0.832581 70.661832 Group2
2 Sim_Country_2 7.115427 9.467759 0.945339 69.148849 Group2
3 Sim_Country_3 5.896939 11.135984 0.843904 73.312323 Group2
... ... ... ... ... ... ...
52 Sim_Country_52 6.862937 10.019423 0.928102 71.754594 Group2
53 Sim_Country_53 5.862264 10.268521 0.907081 71.617507 Group2
54 Sim_Country_54 6.288760 11.645804 0.942329 76.936191 Group2
55 Sim_Country_55 6.509125 10.598213 0.901340 70.859460 Group2

56 rows × 6 columns

Create one dataframe from the three group dataframes:

Now I add the three dataframes together using the pandas pd.concat function.

In [112]:
# using the pandas concat function to add the dataframes together.
result=pd.concat([G0,G1,G2])
result.head()
Out[112]:
Country Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy Group
0 Sim_Country_0 5.306706 8.921793 0.787441 66.048883 Group0
1 Sim_Country_1 5.651307 8.654395 0.809210 66.436883 Group0
2 Sim_Country_2 5.661618 9.308995 0.693635 63.622131 Group0
3 Sim_Country_3 4.861526 10.605480 0.782290 64.910602 Group0
4 Sim_Country_4 5.348437 9.782862 0.785385 66.617080 Group0
In [ ]:
 
In [113]:
sns.pairplot(result, hue='Group', palette="colorblind");
In [114]:
data2018 =c18.drop(['Year', 'RegionCode'], axis=1)
data2018.groupby('cluster_pred').count()
Out[114]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
cluster_pred
0 56 56 56 56
1 31 31 31 31
2 39 39 39 39
In [115]:
result.groupby('Group').describe()
Out[115]:
Life_Satisfaction Log_GDP_per_cap ... Social_Support Life_Expectancy
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Group
Group0 31.0 5.427230 0.576599 4.029621 5.049565 5.269363 5.799328 6.668034 31.0 9.285359 ... 0.870293 1.053083 31.0 65.301443 2.677880 57.921714 63.916421 65.468924 66.804120 69.822780
Group1 39.0 4.623613 0.684148 3.280790 4.271048 4.548910 4.955297 6.220091 39.0 7.982179 ... 0.698752 0.808642 39.0 55.698451 2.796176 48.970274 54.109545 55.680735 57.816293 61.551827
Group2 56.0 6.662692 0.790043 5.351351 5.998424 6.644296 7.096606 8.860939 56.0 10.469700 ... 0.928630 1.023850 56.0 71.560481 2.236689 63.705882 70.587843 71.637011 72.599097 77.238667

3 rows × 32 columns

In [116]:
data2018
data2018['cluster_pred']=data2018["cluster_pred"].replace(0,'Group2')
data2018['cluster_pred']=data2018["cluster_pred"].replace(1,'Group0')
data2018['cluster_pred']=data2018["cluster_pred"].replace(2,'Group1')
In [117]:
sns.pairplot(data2018, hue='cluster_pred', palette="bright", hue_order=['Group0','Group1','Group2']);

Simulated Life Satisfaction / Life Ladder variable:

Compare statistics from actual data for 2018 to simulated data:

In [118]:
data2018.groupby('cluster_pred').mean()
Out[118]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
cluster_pred
Group0 4.415971 7.732756 0.669998 54.945161
Group1 6.614540 10.401640 0.900387 71.515385
Group2 5.322920 9.294176 0.820008 64.951786
In [119]:
result.groupby('Group').mean()
Out[119]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
Group
Group0 5.427230 9.285359 0.811026 65.301443
Group1 4.623613 7.982179 0.649658 55.698451
Group2 6.662692 10.469700 0.898550 71.560481

Plot actual vs simulated Life Satisfaction:

In [120]:
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Life_Satisfaction'], ax=axes[0,0], shade=True, label="simulated")
sns.kdeplot(data2018['Life_Satisfaction'], ax=axes[0,0], shade=True,label="actual")
# set axes title
axes[0,0].set_title("Distribution of Life Ladder")
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Log_GDP_per_cap'], ax=axes[0,1], shade=True, label="simulated")
sns.kdeplot(data2018['Log_GDP_per_cap'].dropna(), ax=axes[0,1], shade=True, label="actual")
# axes title
axes[0,1].set_title("Distribution of Log GDP per capita")
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Social_Support'].dropna(), ax=axes[1,0], shade=True, label="simulated")
sns.kdeplot(data2018['Social_Support'].dropna(), ax=axes[1,0], shade=True, label="simulated")
axes[1,0].set_title("Distribution of Social Support");
# plot the distributions of each of the main variables. first simulated and then actual for 2018
sns.kdeplot(result['Life_Expectancy'], ax=axes[1,1],shade=True, label="Simulated");
sns.kdeplot(data2018['Life_Expectancy'], ax=axes[1,1],shade=True, label="Simulated");
axes[1,1].set_title("Distribution of Simulated Life expectancy");  
plt.tight_layout();

They are not perfect matches but it will do for now! Every time a numpy.random function is run it will produce different numbers from the same distribution unless the seed is set. Now I will just check the correlations and any relationships in the data.

Comparing Relationships between the variables.

Relationships between the simulated variables:

In [121]:
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,4))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=result, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=result, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=result, ax=axes[2])
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90)
plt.tight_layout();

Relationships between the actual variables:

In [122]:
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,4))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=data2018, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=data2018, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=data2018, ax=axes[2])
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90)
plt.tight_layout();
In [123]:
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.distplot(result['Life_Satisfaction'].dropna(), ax=axes[0,0], bins=10, color="r");
# set axes title
axes[0,0].set_title("Distribution of Simulated Life Ladder");
sns.distplot(result['Log_GDP_per_cap'].dropna(), ax=axes[0,1], bins=10, color="g");
axes[0,1].set_title("Distribution of Simulated Income");
sns.distplot(result['Social_Support'].dropna(), ax=axes[1,0], bins=10, color="b");
axes[1,0].set_title("Distribution of Simulated Social Support");
sns.distplot(result['Life_Expectancy'].dropna(), ax=axes[1,1], bins=10, color="y");
axes[1,1].set_title("Distribution of Simulated Life expectancy");  
plt.tight_layout();
In [124]:
# set up the subplots, style and palette
sns.set(style="ticks", palette="colorblind")
f,axes=plt.subplots(2,2, figsize=(9,9))
# plot the distributions of each of the main variables. At global level first. Look at Regional after
sns.distplot(data2018['Life_Satisfaction'].dropna(), ax=axes[0,0], bins=10, color="r");
# set axes title
axes[0,0].set_title("Distribution of Actual Life Ladder");
sns.distplot(data2018['Log_GDP_per_cap'].dropna(), ax=axes[0,1], bins=10, color="g");
axes[0,1].set_title("Distribution of Actual Income");
sns.distplot(data2018['Social_Support'].dropna(), ax=axes[1,0], bins=10, color="b");
axes[1,0].set_title("Distribution of Actual Social Support");
sns.distplot(data2018['Life_Expectancy'].dropna(), ax=axes[1,1], bins=10, color="y");
axes[1,1].set_title("Distribution of Actual Life expectancy");  
plt.tight_layout();
In [125]:
data2018
Out[125]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy cluster_pred
10 2.694303 7.494588 0.507516 52.599998 Group0
21 5.004403 9.412399 0.683592 68.699997 Group1
28 5.043086 9.557952 0.798651 65.900002 Group2
45 5.792797 9.809972 0.899912 68.800003 Group1
... ... ... ... ... ...
1654 5.005663 9.270281 0.886882 66.500000 Group2
1667 5.295547 8.783416 0.831945 67.900002 Group2
1690 4.041488 8.223958 0.717720 55.299999 Group0
1703 3.616480 7.553395 0.775388 55.599998 Group0

126 rows × 5 columns

The simulated data:

The final simulated dataframe containing the 6 variables is called result.

In [126]:
result
Out[126]:
Country Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy Group
0 Sim_Country_0 5.306706 8.921793 0.787441 66.048883 Group0
1 Sim_Country_1 5.651307 8.654395 0.809210 66.436883 Group0
2 Sim_Country_2 5.661618 9.308995 0.693635 63.622131 Group0
3 Sim_Country_3 4.861526 10.605480 0.782290 64.910602 Group0
... ... ... ... ... ... ...
52 Sim_Country_52 6.862937 10.019423 0.928102 71.754594 Group2
53 Sim_Country_53 5.862264 10.268521 0.907081 71.617507 Group2
54 Sim_Country_54 6.288760 11.645804 0.942329 76.936191 Group2
55 Sim_Country_55 6.509125 10.598213 0.901340 70.859460 Group2

126 rows × 6 columns

In [ ]:
 

Using the multivariate normal distribution function.

An alternative way to simulating the dataset could be to use the multivariate normal distribution function. While the variables are not all normally distributed at the dataset level, when broken down by the regions they do appear to be gaussian with each groups distributions having very different centres and spreads to each other.

There is a numpy.random.multivariate function that can generated correlated data.

I found a thread at stack overflow referencing how to use the multivariate function to [generate correlated data] (https://stackoverflow.com/a/16026231)[26] Here I filter the data for each group and calculate the covariance on the first three variables.

In [127]:
data2018.groupby('cluster_pred').describe()
Out[127]:
Life_Satisfaction Log_GDP_per_cap ... Social_Support Life_Expectancy
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
cluster_pred
Group0 31.0 4.415971 0.730419 2.694303 3.997857 4.379262 4.924668 5.819827 31.0 7.732756 ... 0.739598 0.864215 31.0 54.945161 2.912598 48.200001 53.450001 55.299999 57.100000 59.599998
Group1 39.0 6.614540 0.740771 5.004403 6.109656 6.665904 7.205219 7.858107 39.0 10.401640 ... 0.930485 0.965962 39.0 71.515385 2.043286 68.300003 69.450001 72.300003 73.199997 75.000000
Group2 56.0 5.322920 0.704197 3.561047 4.840304 5.296465 5.902972 6.626592 56.0 9.294176 ... 0.882849 0.984489 56.0 64.951786 2.692346 58.500000 63.574999 65.950001 67.025000 68.199997

3 rows × 32 columns

In [128]:
# select clusters for each group
group0 =data2018.loc[data2018.cluster_pred=='Group0']
group1 =data2018.loc[data2018.cluster_pred=='Group1']
group2 =data2018.loc[data2018.cluster_pred=='Group2']
In [129]:
# get the covariances for each cluster group
group0cov = group0.iloc[:,:4].cov()
group1cov = group1.iloc[:,:4].cov()
group2cov = group2.iloc[:,:4].cov()
In [130]:
# get the mean of the variables for each group:
group0mean=group0.iloc[:,:4].mean()
group1mean=group1.iloc[:,:4].mean()
group2mean=group2.iloc[:,:4].mean()

Group0

In [131]:
# use number of samples based on number of countries in the grup
num_samples=31
# using group0 means of each variable
mu=group0mean
# using group0 covariances for each grouo
r=group0cov
# generate the random samples using `multivariate` function
g0=np.random.multivariate_normal(mu,r,size=num_samples)

Group1

In [132]:
# use number of samples based on number of countries in the grup
num_samples=39
# using group0 means of each variable
mu=group1mean
# using group0 covariances for each grouo
r=group1cov
# generate the random samples using `multivariate` function
g1=np.random.multivariate_normal(mu,r,size=num_samples)
g1.shape
Out[132]:
(39, 4)

Group2

In [133]:
# use number of samples based on number of countries in the grup
num_samples=56
# using group0 means of each variable
mu=group1mean
# using group0 covariances for each grouo
r=group1cov
# generate the random samples using `multivariate` function
g2=np.random.multivariate_normal(mu,r,size=num_samples)
g2.shape
Out[133]:
(56, 4)
In [134]:
# create dataframe from multivariate normal, u
MV0 =pd.DataFrame(g0)
# using countries from earlier as simulated earlier
MV0['Country']=G0_countries
# add a group column
MV0['Group']='group0'
# create dataframe
MV1 =pd.DataFrame(g1)
# using countries from earlier as simulated earlier
MV1['Country']=G1_countries
# add a group column
MV1['Group']='group1'
# create dataframe
MV2 =pd.DataFrame(g2)
# using countries from earlier as simulated earlier
MV2['Country']=G2_countries
# add a group column
MV2['Group']='group2'
# concat the three dataframes
MV = pd.concat([MV0,MV1,MV2])
# get the column names
MV.columns
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html#pandas-dataframe-rename
# rename the columns
MV=MV.rename(columns={0: "Life_Satisfaction", 1: "Log_GDP_per_cap",2:"Social_Support",3:"Life_Expectancy"})
MV
Out[134]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy Country Group
0 4.784215 10.202274 0.753394 57.429434 Sim_Country_0 group0
1 6.244088 8.875449 0.754857 49.449422 Sim_Country_1 group0
2 4.650546 6.575641 0.638315 60.905908 Sim_Country_2 group0
3 4.671591 9.285341 0.579979 55.996092 Sim_Country_3 group0
... ... ... ... ... ... ...
52 6.859352 10.595678 0.956506 73.005038 Sim_Country_52 group2
53 5.800952 10.107949 0.875437 69.808952 Sim_Country_53 group2
54 6.320052 9.724057 0.870894 70.904885 Sim_Country_54 group2
55 7.067898 10.041613 0.896035 70.222111 Sim_Country_55 group2

126 rows × 6 columns

In [135]:
MV.corr()
Out[135]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
Life_Satisfaction 1.000000 0.874035 0.882631 0.804469
Log_GDP_per_cap 0.874035 1.000000 0.908389 0.894697
Social_Support 0.882631 0.908389 1.000000 0.861688
Life_Expectancy 0.804469 0.894697 0.861688 1.000000
In [136]:
data2018.iloc[:,:4].corr()
Out[136]:
Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy
Life_Satisfaction 1.000000 0.762152 0.731221 0.740537
Log_GDP_per_cap 0.762152 1.000000 0.792845 0.852848
Social_Support 0.731221 0.792845 1.000000 0.757044
Life_Expectancy 0.740537 0.852848 0.757044 1.000000
In [137]:
# set up the matplotlib figure
f,axes=plt.subplots(1,3, figsize=(12,3))
# regression plot with x as a predictor of y variable. set the axes.
sns.regplot(y ="Life_Satisfaction",x="Log_GDP_per_cap", data=MV, ax=axes[0])
sns.regplot(y ="Life_Satisfaction",x="Social_Support", data=MV, ax=axes[1])
sns.regplot(y ="Life_Satisfaction",x="Life_Expectancy", data=MV, ax=axes[2])
# change the axis limited for Life expectancy
axes[2].set_ylim(0,10); axes[2].set_xlim(40,90);

4. Conclusion

The aim of the project was to create a data set by simulating a real-world phenomenon of our choosing. Then rather than collect data related to the phenomenon, model and synthesis such data using Python and the numpy.random package. In this notebook I first analysed an actual dataset for the phenomenon of World Happiness in order to understand the type of variables and the type of distributions they were likely to have come from and also to see how they were related to each other.

I then simulated the dataset, mainly using the numpy.random package but exploring some functionality from the scikit-learn package.

Finally I compared the results of the simulation to a real world dataset.

I believe the synthesised dataset closely matches the actual dataset. The results will change each time the code is run as the random seed is not set. In reality data samples would also vary from sample to sample due to randomness.

The dataset was a little more complicated that it first appeared to be, given the regional variation and also variation within regions. Given that the number of datapoints for a dataset such as this is limited, the number of datapoints that could be simulated is limited but I am satisfied with the results. Overall this project was a very good learning experience. It was also very eye-opening to see the levels of inequality across the globe.

The last section below lists the references used for this project.

The simulated dataset:

pd.options.display.max_rows=8

In [138]:
# change the max numbers of rows to display the simulated dataset.
pd.options.display.max_rows=130
result
Out[138]:
Country Life_Satisfaction Log_GDP_per_cap Social_Support Life_Expectancy Group
0 Sim_Country_0 5.306706 8.921793 0.787441 66.048883 Group0
1 Sim_Country_1 5.651307 8.654395 0.809210 66.436883 Group0
2 Sim_Country_2 5.661618 9.308995 0.693635 63.622131 Group0
3 Sim_Country_3 4.861526 10.605480 0.782290 64.910602 Group0
4 Sim_Country_4 5.348437 9.782862 0.785385 66.617080 Group0
5 Sim_Country_5 5.128717 9.817910 0.760056 57.921714 Group0
6 Sim_Country_6 5.269363 10.109246 0.761935 68.399157 Group0
7 Sim_Country_7 6.501682 8.112212 0.754724 64.210712 Group0
8 Sim_Country_8 4.775925 9.228186 0.698084 69.355376 Group0
9 Sim_Country_9 5.774356 8.475127 0.772971 66.751545 Group0
10 Sim_Country_10 4.878215 9.583725 1.053083 64.910144 Group0
11 Sim_Country_11 4.971545 9.843221 0.876955 66.535709 Group0
12 Sim_Country_12 5.041377 8.237230 0.935270 65.121138 Group0
13 Sim_Country_13 5.157232 9.935217 0.914311 66.977529 Group0
14 Sim_Country_14 5.105177 9.051786 0.822069 65.698418 Group0
15 Sim_Country_15 5.529221 8.909552 0.808504 69.575341 Group0
16 Sim_Country_16 6.112835 10.098701 0.645022 67.533733 Group0
17 Sim_Country_17 5.085967 9.902266 0.828094 69.822780 Group0
18 Sim_Country_18 6.106302 8.647881 0.892219 63.468165 Group0
19 Sim_Country_19 6.668034 8.887026 0.729624 64.394763 Group0
20 Sim_Country_20 4.029621 10.779107 0.683975 66.856696 Group0
21 Sim_Country_21 5.057753 7.962426 0.754665 65.793576 Group0
22 Sim_Country_22 5.851524 10.127415 0.908542 62.503728 Group0
23 Sim_Country_23 6.002350 8.352002 0.863631 68.506211 Group0
24 Sim_Country_24 5.654982 8.802262 0.693322 64.873640 Group0
25 Sim_Country_25 5.253160 9.731938 0.729333 61.137760 Group0
26 Sim_Country_26 5.041206 9.819737 0.857276 61.731023 Group0
27 Sim_Country_27 5.131967 8.575882 0.721815 65.468924 Group0
28 Sim_Country_28 6.445590 8.746669 1.001303 61.517664 Group0
29 Sim_Country_29 5.824301 9.300203 0.848021 62.704594 Group0
30 Sim_Country_30 5.016133 9.535686 0.969046 64.939119 Group0
0 Sim_Country_0 4.671344 6.724028 0.641005 50.339410 Group1
1 Sim_Country_1 4.761840 7.475304 0.676521 54.053141 Group1
2 Sim_Country_2 5.851653 8.594777 0.591575 52.256857 Group1
3 Sim_Country_3 4.299704 9.815454 0.723505 54.739286 Group1
4 Sim_Country_4 4.526663 8.752159 0.649671 53.802220 Group1
5 Sim_Country_5 4.665746 7.702158 0.580355 54.713613 Group1
6 Sim_Country_6 3.280790 9.497792 0.762002 52.950554 Group1
7 Sim_Country_7 4.332600 6.549067 0.544959 57.401968 Group1
8 Sim_Country_8 5.037402 7.993202 0.769817 57.928439 Group1
9 Sim_Country_9 5.279576 5.872378 0.625565 54.165948 Group1
10 Sim_Country_10 4.079668 9.248135 0.699928 48.970274 Group1
11 Sim_Country_11 3.434483 7.180820 0.736100 53.631864 Group1
12 Sim_Country_12 4.730052 7.865161 0.640194 61.551827 Group1
13 Sim_Country_13 4.417993 7.366929 0.588115 58.964043 Group1
14 Sim_Country_14 4.777392 8.975000 0.557432 57.160128 Group1
15 Sim_Country_15 5.205033 9.592819 0.535112 58.096569 Group1
16 Sim_Country_16 4.500506 9.300303 0.727104 57.704147 Group1
17 Sim_Country_17 6.082710 8.887950 0.574452 57.281150 Group1
18 Sim_Country_18 4.474548 6.565864 0.578081 57.996228 Group1
19 Sim_Country_19 4.242393 8.207752 0.678505 56.065283 Group1
20 Sim_Country_20 4.304773 7.727857 0.576695 55.615156 Group1
21 Sim_Country_21 5.312878 7.398653 0.697577 58.870289 Group1
22 Sim_Country_22 4.152087 7.470080 0.664083 54.883122 Group1
23 Sim_Country_23 3.511708 7.626213 0.630826 54.703135 Group1
24 Sim_Country_24 4.087854 8.349840 0.808642 55.791146 Group1
25 Sim_Country_25 5.516232 6.010996 0.751711 59.558816 Group1
26 Sim_Country_26 4.374513 8.275585 0.641192 53.109696 Group1
27 Sim_Country_27 3.513561 7.199095 0.668894 54.240348 Group1
28 Sim_Country_28 4.548910 9.169545 0.554520 57.026642 Group1
29 Sim_Country_29 5.000956 7.773946 0.671258 50.538825 Group1
30 Sim_Country_30 4.800033 7.991752 0.609611 57.686102 Group1
31 Sim_Country_31 4.019863 9.271410 0.559175 55.289283 Group1
32 Sim_Country_32 3.906991 8.715052 0.687227 57.173227 Group1
33 Sim_Country_33 6.220091 8.392975 0.736993 58.125733 Group1
34 Sim_Country_34 4.318359 7.791876 0.655685 54.456182 Group1
35 Sim_Country_35 4.904859 8.355346 0.576573 58.568376 Group1
36 Sim_Country_36 5.592356 6.747997 0.658175 59.476083 Group1
37 Sim_Country_37 4.909638 7.444697 0.721786 51.673735 Group1
38 Sim_Country_38 4.673152 7.425028 0.586048 55.680735 Group1
0 Sim_Country_0 6.650582 10.486408 0.957040 71.204546 Group2
1 Sim_Country_1 5.948507 10.530889 0.832581 70.661832 Group2
2 Sim_Country_2 7.115427 9.467759 0.945339 69.148849 Group2
3 Sim_Country_3 5.896939 11.135984 0.843904 73.312323 Group2
4 Sim_Country_4 5.581362 11.543183 0.907253 69.451951 Group2
5 Sim_Country_5 7.090332 10.213221 0.861714 63.705882 Group2
6 Sim_Country_6 7.570644 9.985238 0.972682 71.646964 Group2
7 Sim_Country_7 6.936207 10.711594 0.930214 67.391960 Group2
8 Sim_Country_8 6.683955 10.716176 0.911880 77.238667 Group2
9 Sim_Country_9 7.510457 9.937862 0.840780 70.116599 Group2
10 Sim_Country_10 7.159937 10.849546 0.819747 73.947126 Group2
11 Sim_Country_11 6.670528 9.869081 1.004287 70.925823 Group2
12 Sim_Country_12 5.691150 10.361646 0.908960 71.747314 Group2
13 Sim_Country_13 6.039334 9.247191 0.958541 72.298442 Group2
14 Sim_Country_14 6.454486 10.316801 0.861675 71.337509 Group2
15 Sim_Country_15 6.331236 11.105422 0.927258 72.221305 Group2
16 Sim_Country_16 5.351351 10.196518 0.813308 71.521464 Group2
17 Sim_Country_17 6.968510 10.605107 0.805385 74.367682 Group2
18 Sim_Country_18 6.848621 10.714149 0.903825 73.188295 Group2
19 Sim_Country_19 6.421923 9.681545 0.907251 72.737426 Group2
20 Sim_Country_20 6.944172 10.897829 0.893933 71.432948 Group2
21 Sim_Country_21 6.104733 11.312868 0.859266 71.270826 Group2
22 Sim_Country_22 5.971745 11.616137 0.910276 72.727849 Group2
23 Sim_Country_23 8.433459 10.599189 0.894080 74.493444 Group2
24 Sim_Country_24 6.432378 10.786862 0.888852 72.331226 Group2
25 Sim_Country_25 6.213734 10.604391 0.915690 68.635093 Group2
26 Sim_Country_26 6.012344 10.081934 0.813485 72.344020 Group2
27 Sim_Country_27 6.859502 10.534599 0.910122 70.365875 Group2
28 Sim_Country_28 7.972533 10.308342 0.908386 69.974966 Group2
29 Sim_Country_29 6.226960 10.723157 0.909400 71.627059 Group2
30 Sim_Country_30 5.958528 10.651351 0.938047 71.121344 Group2
31 Sim_Country_31 5.856611 11.501646 0.900802 73.917783 Group2
32 Sim_Country_32 5.838465 10.547440 0.972957 72.531078 Group2
33 Sim_Country_33 8.860939 10.577834 0.920326 71.458597 Group2
34 Sim_Country_34 7.026720 10.560634 1.023850 72.052767 Group2
35 Sim_Country_35 7.927715 10.396347 0.958713 73.626023 Group2
36 Sim_Country_36 5.767130 10.177779 0.940705 69.391173 Group2
37 Sim_Country_37 5.877582 10.433927 0.824850 72.478332 Group2
38 Sim_Country_38 6.007317 10.808531 0.818562 67.894686 Group2
39 Sim_Country_39 6.752764 10.738708 0.968577 74.145574 Group2
40 Sim_Country_40 7.317622 10.293289 0.798077 70.073409 Group2
41 Sim_Country_41 5.573241 10.029195 0.897342 70.913134 Group2
42 Sim_Country_42 7.610233 10.279829 0.889985 72.556180 Group2
43 Sim_Country_43 7.061405 10.079536 0.866913 71.741196 Group2
44 Sim_Country_44 7.011454 10.081431 0.916591 67.124498 Group2
45 Sim_Country_45 8.151749 9.806227 0.946049 71.236996 Group2
46 Sim_Country_46 7.901804 10.574144 0.878869 72.038218 Group2
47 Sim_Country_47 6.227294 9.934210 0.906127 69.434784 Group2
48 Sim_Country_48 7.206467 10.495159 0.880406 73.246613 Group2
49 Sim_Country_49 7.351988 10.415570 0.900912 70.182151 Group2
50 Sim_Country_50 6.638009 9.901553 0.783736 73.867776 Group2
51 Sim_Country_51 5.569607 10.346279 0.890414 71.811613 Group2
52 Sim_Country_52 6.862937 10.019423 0.928102 71.754594 Group2
53 Sim_Country_53 5.862264 10.268521 0.907081 71.617507 Group2
54 Sim_Country_54 6.288760 11.645804 0.942329 76.936191 Group2
55 Sim_Country_55 6.509125 10.598213 0.901340 70.859460 Group2

5. References

  • [1] Programming for Data Analytis Project Instructions.
  • [14] Eurostat.
      https://ec.europa.eu/eurostat

End