Data Simulation

The goal of this project was to simulate a dataset. Simulating data is used for a number of reasons. Monte carlo simulations are used to simulate real world problems using repeated random sampling while simulated data is very useful for learning and demonstration purposes. Data can be simulated before the real world data is collected to help identify the type of tests and programs that need to be run. Collecting data requires resources of time and money whereas data can be simulated easily using computer programs.

Statistical analysis can be performed on the simulated data in advance of collecting the real data this process can be repeated as many times as needed. By studying simulated data you can become more familiar with the different kinds of data distributions and be in a better position to make decisions about the data and what to do with it such as how to measure it and how much is required. Simulations produce multiple sample outcomes. Experiments can be run by modifying inputs and seeing how this changes the output. The process of generating a random sample can be repeated many many times which will allow you to see how often you would expect to get the outcomes you get. Repeating the process gives multiple outcomes which can then be averaged across all simulations.

When data is collected, it is often only a small sample of data from the overall population of interest. The researchers of the World Happiness Reports did not collect all the data about the variables of interest. The typical sample size used per country was 1000 people while some countries had more than one survey per year and others had less. A sample is a subset of numbers from a distribution and the bigger the sample size the more it resembles the distribution from which it is drawn. Depending on the distribution the data is drawn from, some numbers will occur more often than others. Sample statistics are descriptions of the data that can be calculated from the sample dataset and then be used to make inferrences about the population. The population parameters are of most interest. These are the characteristics of the actual population from which a sample dataset is taken. Samples are used to estimate the parameters of the population. The sample mean is the mean of the numbers in the sample while the population mean is the mean of the entire population but it is not always possible to study the entire population directly. The law of large numbers refers to how as a sample size increases, the sample mean gets closer to the true population mean. Under the law of large numbers the more data that is collected, the closer the sample statistics will get to the actual true population parameters.

The sampling distribution of the sample means is when you collect many samples from the population and calculate the sample means on each sample. If you know the type of distribution you could sample some data from this distribution, calculate the means or any other sample statistic of the samples and plot them using a histogram to show the distribution of the sample statistic. The sampling distributions can tell you what to expect from your data.

Simulation can be used to find out what the sample looks like if it comes from that particular distribution. This information can be used to make inferences about whether the sample came from particular distribution or not. The sampling distribution of a statistic varies as a function of sample size. Small sample taken from the distribution will probably have sample statistics such as sample means that vary quite a bit from sample to sample and therefore the sampling distribution will be quite wide. Larger samples are more likely to have similar statistics and a narrower sampling distribution.

As the size of the samples increases, the mean of the sampling distribution approaches the mean of the population. The sampling distribution is itself a distribution and has some variance. The standard deviation of the sampling distribution is known as the standard error. As the sample size increases, the standard error of the sample mean decreases. According to the central limit theorem, as the sample size increases the sampling distribution of the mean begins to look more like a normal distribution, no matter what the the shape of the population distribution is.

Large experiments are considered more reliable than smaller ones. If you take a big enough sample, the sample mean gives a very good estimate of the population mean.

When simulating a random variable, you first need to define the possible outcomes of the random variable. To do this you can use the sample statistics from the sample dataset. Using simulated data therefore allows you to identify coding errors as you know what the outcomes should be.

Resampling methods are another way of simulating data and involve resampling with replacements. Bootstrap resampling is the most common method.

For this section I referred to an online book called Answering Questions with Data (Textbook): Introductory Statistics for Psychology Students by Matthew J C Crump [6].

Tech used: