Some machine learning and statistics assignments
Some statistics and machine learning tasks
This is a summary of the Tasks project I submitted for the Machine Learning and Statistics module.
The project consisted of 4 tasks that were assigned at various times during the semester.
The entire notebook containing the Python code, output and explanations can be read here.
Task 1: The square root of 2 to 100 decimal places using plain Python
The first task was a write a Python function that calculates and prints the square root of two to 100 decimal places in plain Python without using any module from the Python standard library or otherwise. This sounds easier than it was! Like pi the square root of two is an irrational number. An irrational number has an infinite number of decimal points with no recurring pattern.
This task looked at the following:
 Various methods for calculating the square root of a number including the The Babylonian Method and Newton’s Method
 How irrational numbers are stored on computers
 The issues and limitations related to floatingpoint arithmetic in Python and other programming languages, in particular the problems with accuracy to high levels of precision. Also the problems that may arise when converting from one format to another.
Task 2: The Chisquare ($\chi^2$) Test for independence
$$\chi^2=\sum_{i=1}^{k}\frac{(O_iE_i)^2}{E_i}$$
The second task involved verifying the Chisquared test statistic as reported in this Wikipedia article and calculating the pvalue.
Wikipedia contributors, “Chisquared test — Wikipedia, the free encyclopedia,” 2020, [Online; accessed 1November2020]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Chisquared test&oldid=983024096
Collar  A  B  C  D  Total 

White Collar  90  60  104  95  349 
Blue Collar  30  50  51  20  151 
No Collar  30  40  45  35  150 
Total  150  150  200  150  650 
The Chisquared test for independence is a statistical hypothesis test. It is used to analyse whether two categorical variables are independent, to determine if there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table.
This task involved the following:
 Explaining how the Chisquared test of independence of a pair of random variables works.
 Calculating the Chisquared statistic using Panda DataFrames to illustrate.
 Using the Scipy.Stats module to calculate the Chisqaured statistic and associated pvalues.
 Interpreting the results of the Chisquared tests.
Task 3: Compare the STDDEV.P and STDDEV.S excel functions
Use NumPy to perform a simulation demonstrating that the STDDEV.S calculation is a better estimate for the standard deviation of a population when performed on a sample.
This task involved the following:
 A comparison of the population variance and the sample variances.
The population variance is calculated using the formula: $$\sigma^2\frac{=\sum(x_i\mu)^2}{n}$$
The sample variance is calculated using the formula:
$$S^2\frac{=\sum(x_i\bar{x})^2}{n1}$$
$n1$ is usually used in the sample variance formula instead of n because using n tends to produce an underestimate of the population variance. According to wikipedia, using $n1$ instead of $n$ in the formula for the sample variance and sample standard deviation is known as Bessel’s Correction, named after Friedrick Bessel.
 I simulated data using Numpy.random’s
standard_normal
function  10,000 data points were put into 1,000 subarrays of size 10 to represent 1,000 samples each containing 10 observations each.
 The mean of each sample was calculated using Numpy’s
mean()
function.  The variances and standard deviations are calculated using Numpy’s
var
andstd()
functions.  The unbiased variances and standard deviations were calculated using Numpy’s
var()
andstd()
functions with theddof
parameter set to $n1$ instead of the default $n$.  The sample means and variances from each sample were then plotted and compared to the population parameters.
 The plots above show that the means are clusted around the population mean of zero which is shown by the vertical blue dashed line. Some sample means are below and some are above and some more are quite a bit away from the population mean. This would depend on the actual values in each sample. Similarly with the standard deviations. The dashed red horizontal line represents the population standard deviation of 1. Again while the sample sample standard are clustered around them, some sample standard deviations are above, some below and some more extreme than others. However when the mean of the sample standard deviations are plotted you can see that the mean of the unbiased standard deviations are closer to the actual standard devision of the population than the mean of the biased sample standard deviations.
 I also looked at the means of the sample standard deviations for various small sample sizes between 2 and 6. With a smaller sample size you are more likely to get a sample mean that is a bad estimate for the population mean and you are more likely to significantly underestimate the sample variance. The plots can be seen in the here in the project notebook.
If a sample is taken from this population, the sample mean of this sample could be very close to the population mean or it might not be depending on which data points are actually in the sample and if the sample points came from one end or other of the underlying population. The mean of the sample will always fall inside the sample but the true population mean could be outside of it. For this reason you are more likely to be underestimating the true variance. The standard deviation is the square root of the variance. Dividing by $n1$ as shown above instead of $n$ will give a slightly larger sample variance which will be an unbiased estimate.
Task 4: Use Scikitlearn to apply Kmeans clustering to the famous Fishers Iris dataset.
The final task was to apply kmeans clustering to Fisher’s famous Iris Dataset.
This task used the Scikitlearn package to perform the kmeans clustering. For this task I took the following steps:
Gained an overview of how the kmeans clustering algorithm works in general

Visualising the Iris dataset

Preprocessing the data including scaling the data, label encoding using
sklearn.preprocessing
. 
Identifying the correct number of clusters using the Elbow method and the Silhouette Scores.

Performing kmeans clustering on the data using
sklearn.cluster
moduel. 
Visualising the clusters found using Kmeans and comparing to the actual Species of Iris.

Using
the Yellowbrick.cluster
library to visualise the Silhouette scores. 
Evaluating the performance of the clustering model using Scikitlearns
metrics
module.
Note there was an error in the task instructions as the lecturer meant for us to do Knearest neighbors instead of Kmeans clustering. Kmeans clustering in an unsupervised learning technique and you do not usually have the labels as you have with the Iris dataset. Kmeans is not usually used for making predictions on new data. Unsupervised learning can be used to find patterns in the data or to discover new features. Knearest neighbours on the other hand is a supervised learning algorithm that is used for classification and regression. It assumes that similar data points are close to each other.
Nonetheless this gave me an opportunity to learn more about Kmeans clustering. I will do another notebook on the Knearest neighbours algorithm which I will include here later.