Introduction to the Data Simulation project

Introduction and Project Overview:

This is an introduction to the Data Simulation project for the Programming in Data Analysis Project 2019 as part of the Higher Diploma in Data Analytics at GMIT.

Objectives of the Project:

The problem statement from the Programming for Data Analysis Project 2019 instructions [1] is as follows:

For this project you must create a data set by simulating a real-world phenomenon of your choosing. You may pick any phenomenon you wish – you might pick one that is of interest to you in your personal or professional life. Then, rather than collect data related to the phenomenon, you should model and synthesise such data using Python. We suggest you use the numpy.random package for this purpose. Specifically, in this project you should:

Choose a real-world phenomenon that can be measured and for which you could collect at least one-hundred data points across at least four different variables.
Investigate the types of variables involved, their likely distributions, and their relationships with each other.
Synthesise/simulate a data set as closely matching their properties as possible.
Detail your research and implement the simulation in a Jupyter notebook – the data set itself can simply be displayed in an output cell within the notebook. Note that this project is about simulation – you must synthesise a data set. Some students may already have some real-world data sets in their own files. It is okay to base your synthesised data set on these should you wish (please reference it if you do), but the main task in this project is to create a synthesised data set.

The pdf document outlining the project requirements is included in the Github project repository for reference.

This project was mainly developed using Python 3 and the following packages:

seaborn a Python data visualization library for making attractive and informative statistical graphics in Python.
pandas provides data analysis tools and is designed for working with tabular data that contains an ordered collection of columns where each column can have a different value type.
numpy.random is a subpackage of the NumPy package for working with random numbers. NumPy is one of the most important packages for numerical and scientific computing in Python.

The end goal of this project is to simulate a real-world phenomenon across at least one hundred data points across at least 4 different variables. A dataset must be simulated or synthesised. The instructions note that it is ok to base the synthesised dataset on an actual real-world dataset but the main task is to create a synthesised data set.