by Angela Carpenter.

This notebook contains my submission for the Machine Learning and Statistics Tasks 2020 Project as part of the requirement for the Machine Learning and Statistics module at GMIT as part of the Higher Diploma in Computing and Data Analytics programme.

Write a Python function called sqrt2 that calculates and prints to the screen the square root of 2 to 100 decimal places. Your code should not depend on any module from the standard library or otherwise. You should research the task first and include references and a description of your algorithm.

## The square root of two $\sqrt{2}$¶

The square root of a number is a value that when multiplied by itself gives the number. In geometrical terms the square root function maps the area of a square to its side length.

Square roots of positive numbers are not in general rational numbers, and so cannot be written as a terminating or recurring decimal expression. Therefore in general any attempt to compute a square root expressed in decimal form can only yield an approximation, though a sequence of increasingly accurate approximations can be obtained. 

The square root of two then is the number that when multiplied by itself gives the number two. Geometrically the square root of 2 is the the length of the diagonal of a square where each side of the square is of length 1.

The square root of 2, or the one-half power of 2, written in mathematics as $2{\sqrt {2}}$ or $2^{{1/2}}$ is the positive algebraic number that, when multiplied by itself, equals the number 2. Technically, it must be called the principal square root of 2, to distinguish it from the negative number with the same property.  The square root of two was probably the first number known to be irrational.

An irrational number is one which cannot be expressed as a ratio of two numbers, i.e. a fraction. Any number that can be expressed as a ratio of two integers, a decimal that either ends or repeats in a recurring manner or a percentage is said to be a rational number. All whole numbers are rational as they can be represented as fractions divided by 1. Greek mathematicians such as Pythagoras had believed that all numbers could be shown to be fractions, believing that the number line was made up entirely of fractions, because for any two fractions we can always find another fraction in between them. Because this process has no end there are infinitely many such points. Hippasus, a Greek scholar is believed to have first identified irrational numbers in the 5th century BCE as he worked on geometrical problems. He applied Pythagoras's theorem, which states that the square of the hypothenuse in a triangle is equal to the sum of the squares of the other two sides, to a right-angled triangle with both shorter sides equal to one. Hippasus realised that the length of the hypothesus of such as triangle was equal to the square root of two but that the square root of two could not be expressed as a ratio of two whole numbers, i.e. as a fraction. There is no rational number which can be multiplied by itself to give precisely two which means that the square root of two an irrational number. The number two is termed non-square or square free. The numbers 3, 5 and 7 and many others are also non-square with irrational square roots. Numbers such as 4, 9 and 16 are square numbers and do have square roots that are also whole numbers and therefore rational. The value of the square root of two cannot be determined accurately because it cannot be represented as a fraction in the form a/b where a and b are whole numbers and has an infinite number of decimals.

The square root of a number $x$ is rational if and only if $x$ is a rational number that can be represented as a ratio of two perfect squares. The square roots of perfect squares such as 1, 4, 9 and 16 are integers. In all other cases, the square roots of positive integers are irrational numbers, and therefore have non-repeating digits in any standard positional notation system.

The square root of 2 is a real number that is not rational. A real number gives a positive result when squared. The square root of two gives a positive result when squared of 2. An irrational number has an infinite number of decimals with no recurring pattern. This is the case for the square root of two. The square root of 2 truncuated to 50 decimal places is 1.41421356237309504880168872420969807856967187537694.

#### What does this all mean for computing?¶

In terms of computing, computer programs are made up of binary strings of 1's and 0's which can be converted into integers. Every computer program can be converted into one of the counting numbers from the set of $\mathbb{N}$. The natural numbers are often referred to as the counting numbers as they are used for counting and ordering. Natural numbers $\mathbb{N}$, excluding negative integers and zero, are often known as the counting numbers. They are discrete numbers in contrast to measuring numbers or real numbers where the values are continuous and represent a distance along a line. The natural numbers are the basis from which other number sets are built by extension.

According to Turing, all computer programs are countable which means they can be put into one-to-one correspondance with the natural numbers but Turing also realised that not all numbers can be calculated or computed. The set of $\mathbb{N}$ natural numbers are an infinite set but it is the smallest infinite set there is and is known as countable infinite. The set of measuring numbers are all the numbers on the number line that are used for quantifying and measuring things and there is an infinite amount of such numbers between any two discrete points on the number line.

A rational number can be expressed as the ratio of 2 whole numbers. It can be stored as an array of two numbers. A number is rational only if it's decimal expansion becomes periodic. Real numbers can be represented as an infinite decimal expansion. Real numbers such as $\sqrt{2}$ have an infinite number of places after the decimal point. Computers cannot really store such real numbers as there is only a finite number of bits available to store the number.

Floating point numbers are stored on a computer with a finite number of digits so there is a finite limit on how many decimal places they can actually store. Floating point numbers are actually stored using integers in two parts, for the whole number part before the decimal point (the characteristics) and the part of the whole after the decimal point (the mantissa).

The square root of two is an irrational number and therefore cannot be expressed in the ratio of two numbers in their most simplified form with the resulting implications being that the square root of two can never be stored in a computer. Instead an approximation of its value can be stored (in the same way an approximation for the irrational number $pi$ is stored). Irrational numbers are often stored as procedures which are used to calculate but not store the number.

For this reason caution is always required when calculation statistics or performing analytics on irrational numbers. It is possible to work theoretically or to use approximations but rounding errors could interfere will the calculations.

Chapter 5 of the Python Tutorial looks at the issues and limitations associated with floating point arithmetic. It illustrates how the decimal fraction 0.125 and the binary fraction 0.001 have identical values:

• Floating-point numbers are represented in computer hardware as base 2 (binary) fractions
• The decimal fraction 0.125 has value $\frac{1}{10}$ + $\frac{2}{100}$ + $\frac{5}{1000}$
• The binary fraction 0.001 has value $\frac{0}{2}$ + $\frac{0}{4}$ + $\frac{1}{8}$ The only real difference here is that the decimal fraction 0.125 is written in base 10 fractional notation while the binary fraction 0.001 above is written in base 2 notation.

Unfortunately, most decimal fractions cannot be represented exactly as binary fractions. A consequence is that, in general, the decimal floating-point numbers you enter are only approximated by the binary floating-point numbers actually stored in the machine. 

The Python tutorial demonstrates how the fraction $\frac{1}{3}$ can be approximated as a base 10 fraction as 0.3 or 0.33 or 0.333 but that no matter how many 3's you use the result will never actually be $\frac{1}{3}$. It will be increasingly a better approximation of $\frac{1}{3}$ though.

The same tutorial also demonstrates how the decimal value 0.1 cannot be represented exactly as a base 2 fraction because in base 2, $\frac{1}{10}$ is the infinitely repeating fraction 0.0001100110011001100110011001100110011001100110011... and if you stop at any finite number of bits you get an approximation. On most machines today floats are approximated using a binary fraction where the numerator uses the first 53 bits starting with the most significant bit and the denominator is as a power of 2. They illustrate how the binary fraction for the decimal 0.1 is 3602879701896397 / 2**55 which is close to the true value of $\frac{1}{10}$ but it is not exactly equal to the true value of 0.1. However we generally don't notice that this is not the true value and only an approximation because of the way values are displayed. Python and other programming languages only prints a decimal approximation to the to the true decimal value of the binary approximation stored by the machine. The true decimal value of the binary approximation stored for 0.1 is actually 0.1000000000000000055511151231257827021181583404541015625 but it is printed as 0.1. Python displays a rounded value instead of all the digits as in most situations this is enough.

The important point is that even though what we see printed (0.01) looks like the exact value of $\frac{1}{10}$, the actual value that is stored is the nearest representable binary fraction.

The article continues by illustrating how there are many different decimal numbers which actually share the same nearest approximate binary fraction. For example the numbers 0.1 and 0.10000000000000001 and 0.1000000000000000055511151231257827021181583404541015625 are all approximated by 3602879701896397 / 2 ** 55 . These decimal values all have the same approximation.

The errors in Python float operations are inherited from the floating-point hardware, and on most machines are on the order of no more than 1 part in 2**53 per operation. That’s more than adequate for most tasks, but you do need to keep in mind that it’s not decimal arithmetic and that every float operation can suffer a new rounding error. 

String formatting can be used to produce a limited number of significant digits but this is simply rounding the display of the true machine value. For example format(math.pi, '.12g') gives 12 significant digits while format(math.pi, '.2f') gives 2 digits after the point.

Representation error refers to the fact that most decimal fractions cannot be represented exactly as binary (base 2) fractions in Python and other languages and therefore won’t display the exact decimal number you expect. Most machines use IEEE-754 floating point arithmetic and almost all platforms map Python floats to IEEE-754 “double precision” which contain 53 bits of precision. The computer strives to convert 0.1 to the closest fraction that it can of the form $J/2**N$ where J is an integer containing exactly 53 bits.

IEEE 754 has 3 basic components:

* The sign of the mantissa where 0 represents a positive number while 1 represents a negative number.
* The biased exponent
* The Normalised Mantissa


,.

Python does have some solutions for dealing with these issues. It's decimal module implements decimal arithmetic suitable for accounting applications and high-precision applications while the fractions module implements arithmetic based on rational numbers (so the numbers like $\frac{1}{3}$ can be represented exactly). The NumPy and scipy packages are recommended for mathematical and statistical operations. Python also has a float.as_integer_ratio() method which expresses the value of a float as a fraction. This can help when you want to know the exact value as a float. As the ratio is exact, it can be used to losslessly recreate the original value.

Both Python's math module and the numpy library have a function called sqrt for calulating square roots. math.sqrt(x) calculates the square root of whatever is passed in as x. While this cannot be used for Task 1 itself, it can be used to check the results. If you square the results of either of these functions applied to the number 2, the result will be 2.0000000000000004 which demonstrates exactly the problem with performing calculations on the irrational number square root of two!

## Methods for Calculating square roots¶

Finding the square root of a number is the inverse of finding the square of a number. The square of a number is the number times itself. The perfect squares are the squares of the whole numbers. The square root of a number is the number that when multiplied by itself gives the number.

The square roots of the perfect squares (e.g., 0, 1, 4, 9, 16) are integers. In all other cases, the square roots of positive integers are irrational numbers, and hence have non-repeating decimals in their decimal representations. 

While getting the square root of the perfect squares is straightforward enough, getting the square root of numbers that are not perfect squares is not so. A manual calculation would start by looking at the square roots of the nearest perfect squares on either side of the number in question and involves some guesswork to begin. The most common iterative method of square root calculation by hand is known as the "Babylonian method". This is a guess and divide method which takes a starting estimate to the square root $x_0$, divides this starting estimate into the number $\text{a}$ whose square root you are looking for and then finds the average of the quotient $x_0$ and the divisor $a$ to get the next estimate $x_{0+1}$.

Newtons Method which is also known as the Newton–Raphson method, can be used to get an estimate of the square root of a number. It is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-valued function. Newton's Method is one of many methods of computing square roots and is outlined in detail in the Wikipedia article. It outlines how the problem of finding the square root of a number $a$ can be rephrased as finding the zero of $f(x)=x2−a$ where $f$ is a single-variable function defined for a real variable $x$, with an initial guess of $x_0$ for a root of $f$. The derivative of the function $f'(x)=2x$. Rearranging the formula yields the Babylonian method of finding square roots :

$$x_{n+1}=x_{n}-{\frac{f(x_{n})}{f'(x_{n})}}=x_{n}-{\frac{x_{n}^{2}-a}{2x_{n}}}={\frac {1}{2}}{\biggl(}2x_{n}-{\Bigl(}x_{n}-{\frac{a}{x_{n}}}{\Bigr)}{\biggr)}={\frac {1}{2}}{\Bigl(}x_{n}+{\frac {a}{x_{n}}}{\Bigr )}$$

i.e. the arithmetic mean of the guess, $x_n$ and $\frac{a}{x_n}$

Using Newtons Method, regardless of the starting estimate, once the square of the estimate gets close enough to the actual number itself, then that is considered a good approximation for the square root. The Baylonian Method uses the same iterative scheme as the Newton–Raphson method yields when applied to the function $y = f(x) = x2 − a$, using the fact that its slope at any point is $dy/dx = f′(x) = 2x$.  You consider the nearest perfect squares on both sides of the number to make a starting guess of what the square root could be. The Babylonian method of guess and divide takes a starting estimate to the square root ($x_0$), divides this starting estimate into the number $a$ whose square root you are looking for and then finds the average of the quotient $x_0$ and the divisor $a$ to get the next estimate $x_{0+1}$.

Rick Wicklin explains simply in a blog post on Newton's method and an iterative square root algorithm how the algorithm taught in school for calculating the estimating a square root from an arbitrary guess is an iterative procedure. Given any positive number $S$ and any positive number $x_0$, apply the following formula: $$x_{n+1} = \frac{(x_n + S/x_n)}{2}$$ until the iteration converges to the square root of $S$. The same author has a post on the Babylonian method for finding square roots by hand. Suppose you are given any positive number $S$. To find it's square root do the following:

1. Make an initial guess. This guess can be any positive number. This is $x_0$
2. Improve the guess by applying the formula $x_1 = \frac{(x_0 + \frac{S}{x_0})}{2}$. The number $x_1$ is a better approximation to the square root of $S$.
3. Iterate until convergence. Apply the formula $x_{n+1}=\frac{(x_n+\frac{S}{x_n})}{2}$ until the process converges. Convergence is achieved when the digits of $x_{n+1}$ and $x_n$ agree to as many decimal places as you desire.

Wikipedia similarly describes the process: The algorithm repeats a simple calculation that results in a number closer to the actual square root each time it is repeated with its result as the new input. The motivation is that if x is an overestimate to the square root of a nonnegative real number a then $\frac{a}{x}$ will be an underestimate and so the average of these two numbers is a better approximation than either of them. However, the inequality of arithmetic and geometric means shows this average is always an overestimate of the square root (as noted below), and so it can serve as a new overestimate with which to repeat the process, which converges as a consequence of the successive overestimates and underestimates being closer to each other after each iteration. To find x:

1. Start with an arbitrary positive start value x. The closer to the square root of a, the fewer the iterations that will be needed to achieve the desired precision.
2. Replace x by the average of x and a/x.
3. Repeat from step 2, using this average as the new value of x.

That is, if an arbitrary guess for $\sqrt {a}$ is $x_0$, and $x_{n+1} =\frac{ (x_n + \frac{a}{x_n})}{2}$, then each $x_n$ is an approximation of $\sqrt{a}$ which is better for large $n$ than for small $n$. If $a$ is positive, the convergence is quadratic, which means that in approaching the limit, the number of correct digits roughly doubles in each next iteration. If $a = 0$, the convergence is only linear. Using the identity $\sqrt{a}=2^{-n}{\sqrt{4^{n}a}}$, the computation of the square root of a positive number can be reduced to that of a number in the range [1,4). This simplifies finding a start value for the iterative method that is close to the square root, for which a polynomial or piecewise-linear approximation can be used.

The time complexity for computing a square root with $n$ digits of precision is equivalent to that of multiplying two n-digit numbers.

### Coding up the square root function using Python.¶

The methods described above can be put into code using an iterative algorithm.

• $S$ represents the number whose square root is being sought.
• $x$ represents the changing estimates ($x_n$, $x_0$, $x_1$, $x_{n+1}$ for the estimates at each step).
• The algorithm takes the average of the estimate $x_n$ and the number $S$ divided by the estimate $x_n$. This will give a new estimate for the next iteration $x_{n+1}$.
• The values of the estimate $x$ is updated at each iteration of a while loop representing the $X_n$ and $x_{n+1}$ estimates.

First we make a guess of what the root of the number in question could be based on the square root of the nearest perfect squares on either side. Add the result of this to the number and question and divide by two to get the average as the new starting guess. Keep repeating.

1. Find two perfect square roots that the number in question lies between.

2. Take one of those square roots as an estimate and divide the number in question by it.

• S = 2
• The square root of 2 must lie between 1 and 2
• Taking 1 here as the estimate $x_0=1$
• Divide number in question by the starting estimate S/x0 = 2/1 = 2 = X1
3. Take the average of the latest estimate x1 (2) and S (from the previous step above) and the Number.

• x1 = (x0 +S/x0)/2 = (2 + 2/1)/2 = 2
• x2 = (x1 + S/x1)/2 = (2 + 2/2)/2 =1.5
• x3 = (x2 + S/x2)/2 = (1.5 + 2/1.5)/2 = 1.4166666666666665
• x4 = (x3 + S/x3)/2 = (1.4166666666666665 + 2/1.4166666666666665)/2 = 1.4142156862745097
• x5 = (x4 + S/x4)/2 = (1.4142156862745097 + 2/1.4142156862745097)/2 = 1.4142135623746899
• x6 = (x5 + S/x5)/2 = (1.4142135623746899 + 2/1.4142135623746899)/2 = 1.414213562373095
• x7 = (x6 + S/x6)/2 = (1.414213562373095 + 2/1.414213562373095)/2 = 1.414213562373095
• x8 = (x7 + S/x7)/2 = (1.414213562373095 + 2/1.414213562373095)/2 = 1.414213562373095
• x9 = (x8 + S/x8)/2 = (1.414213562373095 + 2/1.414213562373095)/2 = 1.414213562373095

math.sqrt(10) gives 3.1622776601683795

As the illustration shows here, once you get to the 5th iteration, the result is rounded.

In :
print(f"{(1.4166666666666665 + 2/1.4166666666666665)/2}")
print(f"{(1.4142156862745097 + 2/1.4142156862745097)/2}")
print(f"{(1.4142135623746899 + 2/1.4142135623746899)/2}")
print(f"{(1.414213562373095 + 2/1.414213562373095)/2}")
print(f"{(1.414213562373095 + 2/1.414213562373095)/2}")
# and on until you get the required accuracy

In :
# trying to get the square root of two by using the power of 1/2
200000000000000000000000000000000000000**(1/2)

Out:
1.414213562373095e+19
In :
import numpy as np
print(np.sqrt(2))
import math
print(math.sqrt(2))

1.4142135623730951
1.4142135623730951


The most this method yielded was 15 digits. While the algorithm itself will calculate approximately the square root of two (accurate to 15 digits at least), it will not print out 100 decimal places so easily. At some stage the result gets rounded and this is the main issue as outlined earlier when dealing with real numbers. Wikipedia records the square root of 2 to 50 decimal places as 1.41421356237309504880168872420969807856967187537694. Numpy's sqrt function returns the same answer as my function above to 15 decimal places. The square root of 2 using np.sqrt to 50 decimal places is 1.41421356237309514547462185873882845044136047363281 which differs from that recorded at Wikipedia.

I did try to increase the precision of the while loop to see if this would yield additional digits but the maximum was still only 15 decimal places. This is as expected given the background of how computers deals with floating point numbers. Unfortunately, most decimal fractions cannot be represented exactly as binary fractions. A consequence is that, in general, the decimal floating-point numbers you enter are only approximated by the binary floating-point numbers actually stored in the machine.  According to the Python Docs section on numeric types integers have unlimited precision.

Therefore the next logical step to me was to apply the function to a very large integer multiple of 2 as this would not have the same limitations as floats and then divide the result by the same integer multiple. In the same way that the square root of 4 is 2, the square root of 400 is 20, and so on. The whole number before the decimal point cannot be rounded. In this way the digits of the square root of two could be produced but with the decimal point in the incorrect place. Some string manipulation then to return the correct result.

Applying either Newtons formula or the equivalent Babylonian guess and divide formula resulted in 1.4142135623730951 being returned as the approximate square root of two (using precision of 0.0000000000001 to stop the while loop.) Applying the function to get the square root of $2x10^{200}$ gave the result in scientific notation. Therefore the 100 digits cannot be extracted at this stage. Nevertheless I compared the results to math.sqrt to see how they fare and the results are the same. Python only prints a decimal approximation to the true decimal value of the binary approximation stored by the machine. . While the printed result might look like the exact value, the actual stored value is the nearest representable binary fraction. While string formatting can be used to produce a limited number of significant digits, you are really just rounding the display of the true machine value. Also as noted earlier every float operation can suffer a new rounding error.

Here I put the above code into a function. Without setting a stopping value the loop will take a long time as will setting the stopping value smaller than 0.000000000000001. When applying the function to a large multiple of 2 $2x10^{200}$, the function did not stop with the precision value of 15 decimal places so I changed it to 0. The intention here was to use string formatting to get the digits out, remove the scientific notation at the end to and construct a string representation of the square root of 2 to 100 decimal places. So the result of the square root of $2x10^{200}$ is 1.414213562373095e+100 which is presented to us in scientific notation and therefore foils my attempt to get the 100 digits!. Attempting to apply formatting to get 100 significant digits does not return the same digits as reported on Wikipedia. It did however match the result from applying the math.sqrt function to get the square root of $2x10^{200}$ and then using string formatting to get the 100 significant digits.

In :
# S is number to get square root of, S= 2
def sqrt2(S):
# take the first estimate x to be the number itself
x = S
# repeat this until the difference between the Estimate Squared minus the number is small enough
# without setting a stopping value here, the loop will continue for a long time
while abs(x*x - S) > 0.000000000000001:
# Update the new estimate to be the current estimate minus the Number using Newtons formula
x -= ((x*x)- S)/ (2 * x)
# or equivalently the babylonian guess and divide formula x= (x + S/x)/2

return x
print(sqrt2(2))
# string formatting to get 100 significant digits
print(format(sqrt2(2),'.100g'))

1.4142135623730951
1.4142135623730951454746218587388284504413604736328125

In :
# define the function, using multiple of 2 and changing the precision
def sqrt2(S):
S = S * 10**200
# first estimate
x = S
# repeat until there is no difference between the squared estimate gives the square back
while abs(x*x -S) > 0:
# update the new estimate to be the mean of the current estimate and the number divided by the estimate

x= (x + S/x)/2

#x -= ((x*x)- S)/ (2 * x) # this gives 1 extra digit
return x

print(sqrt2(2))
# string formatting to get 100 significant digits
print(format(sqrt2(2),'.100g'))

1.414213562373095e+100
1.414213562373095027142412563281858698349164881791987548177900388860130684271654303022821004349852877e+100

##### Checking against the math module square root function math.sqrt.¶
In :
import math
# calculating the square root of 2e200
math_root2 = math.sqrt(2*(10**200))
print(math_root2)
format(math_root2,'.100g')

1.414213562373095e+100

Out:
'1.414213562373095027142412563281858698349164881791987548177900388860130684271654303022821004349852877e+100'

So back to the drawing board. A further bit of research found an article on getting the square roots of integers on geeksforgeeks.com using a binary search function to find the largest integer i whose square is less than or equal to the original number. The values of $i * i$ is monotonically increasing, so the problem can be solved using binary search. The code sample there, contributed by Nikita Tiwari, uses integer division and defines the lower and upper bounds, the midpoint and the answer at each iteration. The loop runs until the lower bound is less than or equal to the upper bound. Then it checks if the midpoint squared is equal to the number and narrows the search to one half of the search space.

If I apply this function below to get the square root of two it returns 1 as it is using integer division. However if I apply it to a very large multiple of two $2x10^{200}$ as outlined earlier I can get the integer result of the square root of $2x10^{200}$. While this is not the answer to the square root of two which is what the task here calls for, it does contain the first 100 digits of the square root of 2! Therefore in the function below, I adapt the code to include some string manipulation of the square root of $2x10^{200}$ and return the first 100 digits of the square root of 2. First calculate the square root of $2x10^{200}$, then convert the result (which is an integer) to a string and iterate through each digit of the string and concatenating them to a new string that begins with a '1' followed by a decimal point. In this way I have calculated and printed to the screen the square root of 2 to 100 decimal places.

In :
# adapted and extended from code on geekforgeeks contributed by Nikita Tiwari.

# Returns floor of square root of x
def sqrt2() :

# set up an empty string to hold the result
sqrt_string =""

S = 2 * 10**200
# Base cases is only required for the general case
if (S == 0 or S == 1) :
return S

# Do Binary Search to find the square root of S
lower = 1
upper = S
while (lower <= upper) :
x = (lower + upper) // 2

# If x squared is a perfect square then return x
if (x*x==S) :
return x

# As we need the floor, update the answer when mid*mid is smaller than x, and move closer to getting the square root of S
if (x*x < S) :
lower = x + 1
ans = x

else :

# If mid*mid is greater than x
upper = x-1

#return ans
# create a string version of the result
str_ans = str(ans)

for s in str_ans:
sqrt_string += s

# concatenate the strings to produce the square root of 2
sqrt_string = "1."  + sqrt_string[1:]
#print(f"The square root of 2 to 100 decimal places is \n{sqrt_string}")

# this will return an integer for square root of 2e200
#return ans

# instead return the string containing the first 100 digits of square root of 2
print(f"The first 100 digits of the square root of 2 are \n{sqrt_string}")
return sqrt_string

# call the function to print the first 100 digits of the square root of 2
sqrt2()

The first 100 digits of the square root of 2 are
1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727

Out:
'1.4142135623730950488016887242096980785696718753769480731766797379907324784621070388503875343276415727'

Nerdparadise.com has the digits of the square root of 2 to 10,000 places. Here are the first 100 digits as reported there: 1. 4142135623 7309504880 1688724209 6980785696 7187537694 8073176679 7379907324 7846210703 8850387534 3276415727

These are the string results from the python function above:

14142135623 7309504880 1688724209 6980785696 7187537694 8073176679 7379907324 7846210703 8850387534 3276415727

### Some final notes on the square root of 2 task.¶

• The sqrt2() function above does yield the correct digits for the square root of 2. It is a string however and further numerical calculations cannot be performed on the string. Converting from one type to another will not always return results as expected.
• The first function above using the babylonian guess and divide method does return the exact same result as math.sqrt to 15 decimal places.
• This task illustrates the problems with accuracy to high levels of precision and also how when converting between types you won't always get what you expect. The lecture on data types and numbers highlighted that there are some irrational numbers that simply cannot be stored on a computer. Procedures can be written to produce the value when needed though. The article from the Python tutorial  on the problems with floating point numbers showed an example of how there are many different decimal numbers which actually share the same nearest approximate binary fraction. It demonstrated how the numbers 0.1 and 0.10000000000000001 and 0.1000000000000000055511151231257827021181583404541015625 are all approximated by 3602879701896397 / 2 ** 55.

• The first function above applied to get the square root of $2x10^{200}$ returned scientific notation as does the math.sqrt function when applied to such a large number. Using string formatting to display the actual digits did not print the correct digits to the screen. This shows that what we see printed out is not always the exact true value.

In reality, while we would be concerned with the accuracy of results, other considerations would be the computational cost and time. When I tried to run the earlier versions of the square root function, my computer struggled and I had to interrupt the process. The final function above uses bounds though. Newton's method uses approximations and while it converges towards the actual result, the number of steps to reach this is infinite. This is combined with the problems outlined earlier of computing and storing real numbers on a computer. I came across MIT course notes on the Square Roots via Newton’s Method online. The author S.G.Johnson notes that we are concerned not only with existence and correctness of the solutions, but with the time (and other computational resources, e.g. memory) required to compute the result. Some algorithms like Newtons method are intrinsically approximate and while they converge towards the desired result they never reach it in a finite number of steps. How fast they converge is a key question. He also points out that arithmetic with real numbers is approximate on a computer, because we approximate the set $\mathbb{R}$ of real numbers by the set $\mathbb{F}$ of floating-point numbers, and the result of every elementary operation (+,−,×,/) is rounded to the nearest element of $\mathbb{F}$. Therefore it is important to understand floating point numbers and how accumulation of these rounding errors affects different algorithms.

Wikipedia notes that the time complexity for computing a square root with $n$ digits of precision is equivalent to that of multiplying two n-digit numbers.

The Chi-squared test for independence is a statistical hypothesis test like a t-test. It is used to analyse whether two categorical variables are independent. The Wikipedia article gives the table below as an example , stating the Chi-squared value based on it is approximately 24.6. Use scipy.stats to verify this value and calculate the associated p value. You should include a short note with references justifying your analysis in a markdown cell.

A B C D Total
White Collar 90 60 104 95 349
Blue Collar 30 50 51 20 151
No Collar 30 40 45 35 150
Total 150 150 200 150 650

Wikipedia contributors, “Chi-squared test — Wikipedia, the free encyclopedia,” 2020, [Online; accessed 1-November-2020]. [Online]. Available: https://en.wikipedia. org/w/index.php?title=Chi-squared test&oldid=983024096

### Extract from Wikipedia Article in question:¶

A chi-squared test, also written as $\chi^2$ test, is a statistical hypothesis test that is valid to perform when the test statistic is chi-squared distributed under the null hypothesis, specifically Pearson's chi-squared test and variants thereof. Pearson's chi-squared test is used to determine whether there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. In the standard applications of this test, the observations are classified into mutually exclusive classes. If the null hypothesis that there are no differences between the classes in the population is true, the test statistic computed from the observations follows a $\chi^2$ frequency distribution. The purpose of the test is to evaluate how likely the observed frequencies would be assuming the null hypothesis is true. Test statistics that follow a $\chi^2$ distribution occur when the observations are independent and normally distributed, which assumptions are often justified under the central limit theorem. There are also $\chi^2$ tests for testing the null hypothesis of independence of a pair of random variables based on observations of the pairs. Chi-squared tests often refers to tests for which the distribution of the test statistic approaches the $\chi^2$ distribution asymptotically, meaning that the sampling distribution (if the null hypothesis is true) of the test statistic approximates a chi-squared distribution more and more closely as sample sizes increase. Chi-squared test

In the Wikipedia article from which the above table was taken, the Chi-Squared test is testing the null hypothesis of independence of a pair of random variables based on observations of the pairs, i.e to analyse whether two categorical variables are independent. Hypothesis testing is a decision making process for evaluating a hypothesis. A hypothesis is a claim about a population parameter. Hypothesis testing consists of the following steps:

• Define the population under study
• State the hypothesis to be investigated
• Select a sample from the population and collect the data
• Evaluate the evidence contained in the data and reach a conclusion.

In a statistical hypothesis problem there are always two and only two contradictory hypotheses. A Null Hypothesis $H_0$ is a claim or statement about population parameters that is assumed to be true until it is declared false. An Alternative Hypothesis $H_A$ or $H_1$ is the competing hypothesis, it is a claim about a population parameter that will be true if the null hypothesis is false. The null hypothesis is that each person's neighbourhood of residence is independent of the person's occupational classification. The decision whether or not to reject a null hypothesis $H_0$ is based on the evidence contained in the sample data.

The data here are tabulated on a two-way frequency table. Two-way frequency tables display frequencies or relative frequencies for two categorical variables. Categorical data is data where each observation belongs to one of a set of categories. It is often referred to as count data because the number of observations belonging to a particular category are counted. The data is bivariate as two variables are measured on each member of the sample. The data entries in the table refer to two categories. The two-way frequency table is used to examine the relationship between two categorical variables, in this case to see if the proportions of residents living in the neighbourhoods (belonging to the categories) A, B, C and D are the same for the occupational categories of White Collar, Blue Collar and No Collar.

The city has a population of 1,000,000 residents with four neighbourhoods, A, B ,C and D from which a random sample of 650 residents is taken and their occupation recorded as "white collar", "blue collar" or "no collar".

A $\chi^2$ test is used to determine if there is a statistically significant difference between the expected frequencies and the observed frequencies in one or more categories of a contingency table. If the null hypothesis that there are no differences between the classes in the population is true then the test statistic computed from the observations follows a $\chi^2$ frequency distribution. The test evaluates how likely the observed frequencies would be assuming the null hypothesis is true, the null hypothesis being that each person's neighborhood of residence is independent of the person's occupational classification.

A B C D Total
White Collar 90 60 104 95 349
Blue Collar 30 50 51 20 151
No Collar 30 40 45 35 150
Total 150 150 200 150 650

The cell counts are the numbers in a particular sample that belong to a particular category. They are frequency counts and the table is a frequency table. Relative frequences such as percentages or proportions could also be used to represent the same data in which case it would be a relative frequency table with the table entries showing the conditional distribution. The column and row totals are known as the marginal totals or the marginal distribution and and each set of marginal totals adds to a single grand total. In this case the row totals are the sizes of the samples selected from the different occuational populations (i.e. white-collar, blue-collar, no-collar). The column totals are the sizes of the samples selected from the different neighbourhood populations A, B, C and D.

The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification and therefore the neighbourhood category proportions are the same for all occupational classifications. Knowing a person's occupational classification therefore would not tell you anything about where they live. This hypothesis considers whether the proportion of workers living in the neighbourhoods "A", "B","C" and "D" are the same for the person's classification as White-Collar, Blue-Collar or No-Collar.

• $H_0$ : Category proportions in the different neightbourhoods are the same for all occupational classifications
• $H_A: H_0$ not true.

150 of the sample live in neighbourhood A. This figure can be used to estimate what proportion of the 1 million residents live in neighbourhood A and similarly estimates can be calculated for neighbourhoods B, C and D. In the same way, 349 of the sample are white-collar workers. This figure can be used to determine what proportion if the 1 million residents are white-collar workers. Again similar estimates can be calculated for blue-collar and no-collar workers. This figure can be used to determine what proportion of the 1 million residents are blue-collar workers. The proportions in each neighbourhood A, B C and D are calculated below based on the sample at 23.08%, 23.08%, 30.77% and 23.08% respectively. The proportions of each occupation type White Collar, Blue Collar and No Collar are 53.69%, 23.23% and 23.08% respectively.

We need to calculate the frequencies (cell counts) that would be expected if the null hypothesis was true. Assuming the null hypothesis is true, for a two-way table the expected cell counts for each cell = $\frac{\text{row total} \times \text{column total}}{Grand Total}$.

For example, by the assumption of independence under the null hypothesis we should expect the number of White Collar's in neighbourhood A to be: $$150 *\frac{349}{650} \approx 80.54$$ This is the value for the expected frequency in the first cell of the first row. The expected frequency for the remaining cells are calculated in a similar fashion to populate the expected frequency table.

In :
A, B, C, D =150, 150, 200,150
print(f"The column totals are {A} in neighbourhood A, {B} in B, {C} in C and {D} in  D ")
White, Blue, No = 349,151,150
print(f"The Row totals are {White} for White Collar, {Blue} for Blue Collar and {No} for No Collar. ")
Total = A+B+C+D
print(f"The Grand Total in the sample is {Total}.")
print("The proportions in each neighbourhood A, B C and D are {:.2%}, {:.2%}, {:.2%} and {:.2%} respectively.".format(A/Total,B/Total,C/Total,D/Total))
print("The proportions of each occupation type White Collar, Blue Collar and No Collar  {:.2%}, {:.2%} and {:.2%} respectively.".format(White/Total,Blue/Total,No/Total))

The column totals are 150 in neighbourhood A, 150 in B, 200 in C and 150 in  D
The Row totals are 349 for White Collar, 151 for Blue Collar and 150 for No Collar.
The Grand Total in the sample is 650.
The proportions in each neighbourhood A, B C and D are 23.08%, 23.08%, 30.77% and 23.08% respectively.
The proportions of each occupation type White Collar, Blue Collar and No Collar  53.69%, 23.23% and 23.08% respectively.

In :
print("The expected number of White Collars in each of the neighbourhoods are: A: {:.2f}, B: {:.2f}, C: {:.2f} and D: {:.2f}.".format(A*White/Total,B*White/Total,C*White/Total,D*White/Total))
print("The expected number of Blue Collars in each of the neighbourhoods are: \tA: {:.2f}, B: {:.2f}, C: {:.2f} and D: {:.2f}.".format(A*Blue/Total,B*Blue/Total,C*Blue/Total,D*Blue/Total))
print("The expected number of No Collars in each of the neighbourhoods are: \tA: {:.2f}, B: {:.2f}, C: {:.2f} and D: {:.2f} .".format(A*No/Total,B*No/Total,C*No/Total,D*No/Total))

The expected number of White Collars in each of the neighbourhoods are: A: 80.54, B: 80.54, C: 107.38 and D: 80.54.
The expected number of Blue Collars in each of the neighbourhoods are: 	A: 34.85, B: 34.85, C: 46.46 and D: 34.85.
The expected number of No Collars in each of the neighbourhoods are: 	A: 34.62, B: 34.62, C: 46.15 and D: 34.62 .


#### Comparing the expected and observed counts to calculate the Chi-Squared Test Statistic:¶

Once you have the expected frequencies or expected cell counts as well as the observed freqeuncies, the next step is to compare them. If the differences between them are so big that they could not be explained by sampling variation then the null hypothesis is rejected in favour of the alternative hypothesis.

The comparison of observed and expected cell counts is made by calculating the following for each cell:

$$\frac{(Observed Count - Expected Count)^2}{Expected Count}$$

For example the value for the first cell is $\frac{(90-80.54)^2}{80.54}\approx1.11$. Next the calculated value for all the cells are added together to get the chi-square test statistic.

$$\chi^2=\sum_{i=1}^{k}\frac{(O_i-E_i)^2}{E_i}$$

The sum of these quantities over all of the cells is the $\chi^2$ test statistic which is $\approx24.6$ as per the Wikipedia article. Assuming the null hypothesis to be true, $\chi^2$ is approximately chi-squared distributed with degrees of freedom being $(\text{number of rows}-1)({\text{number of columns}}-1)$ which is $(3-1)(4-1)=6$ in this case.

If the test statistic is improbably large according to that chi-squared distribution, then one rejects the null hypothesis of independence.

• Large values of $\chi^2$ suggest rejection of the Null hypothesis in favour of the alternative hypothesis.
• Small values of $\chi^2$ would support the null hypothesis.

If calculated manually the test statistic would be compared to the critical value (from a table such as the New Cambridge Statistical Tables) at a particular alpha level such as 0.05 taking into account the degrees of freedom.

• If the chi-squared test statistic is less than the chi-squared critical value with 6 degrees of freedom (at the chosen alpha level such as 0.05) then you would fail to reject the null hypothesis.
• If the chi-squared test statistic is greater than the chi-squared critical value, then the null hypothesis would be rejected and you would accept the alternative hypothesis that the data contains sufficient evidence of neighbourhood preference by occupational classification. The proportions of residents in the four neighbourhoods A, B, C and D are not the same for the 3 occupational classifications of white-collar, blue-collar and no-collar.

#### Calculations using Python¶

Here I will put the data from the contingency table into a Pandas dataframe and then create an expected frequency table based on the observed data. The SciPy stats package will actually calculate the value of the chi-squared test statistic. This is just for illustration and to verify the results.

I first created a dataframe containing the actual observed cell counts and from this created a dataframe with the expected cell counts that would be expected under the null hypothesis. The marginal row and column totals as well as the overall grand total are the same as totals for the observed frequencies tables. From this the formula above is applied to calculate the chi-squared test statistic. (note there may be some rounding here)

In :
import numpy as np
import pandas as pd

# create a pandas dataframe and add a row index
df=pd.DataFrame({"A":[90,30,30], "B":[60,50,40],"C":[104,51,45],"D":[95,20,35]})
df.index = ['White_collar', 'Blue_collar', 'No-Collar']
# add a total column to sum across each row
df["Total"] = df.sum(axis=1)
# add a row to sum across column totals
#(https://stackoverflow.com/questions/20804673/appending-column-totals-to-a-pandas-dataframe)
df.loc['Total']= df.sum()
df.style.format("{:.2f}")

Out:
A B C D Total
White_collar 90.00 60.00 104.00 95.00 349.00
Blue_collar 30.00 50.00 51.00 20.00 151.00
No-Collar 30.00 40.00 45.00 35.00 150.00
Total 150.00 150.00 200.00 150.00 650.00
In :
# get the marginal row totals
WhiteCollarTotal =df.iat[0, 4]
BlueCollarTotal=df.iat[1,4]
NoCollarTotal=df.iat[2,4]
# get the marginal column total
ATotal=df.iat[3,0]
BTotal=df.iat[3,1]
CTotal=df.iat[3,2]
DTotal=df.iat[3,3]
# get the grand total
GrandTotal = df.iat[3,4]

In :
# create a pandas dataframe using dictionaries containing zero values to start
dfE =pd.DataFrame({"A":[0,0,0], "B":[0,0,0],"C":[0,0,0],"D":[0,0,0]},dtype='float32')
dfE.index = ['White_collar', 'Blue_collar', 'No-Collar']

# Calculate the White collar expected frequencies
dfE.iat[0,0]=(WhiteCollarTotal*ATotal)/GrandTotal
dfE.iat[0,1]=(WhiteCollarTotal*BTotal)/GrandTotal
dfE.iat[0,2]=(WhiteCollarTotal*CTotal)/GrandTotal
dfE.iat[0,3]=(WhiteCollarTotal*DTotal)/GrandTotal
# Blue collar
dfE.iat[1,0]=(BlueCollarTotal*ATotal)/GrandTotal
dfE.iat[1,1]=(BlueCollarTotal*BTotal)/GrandTotal
dfE.iat[1,2]=(BlueCollarTotal*CTotal)/GrandTotal
dfE.iat[1,3]=(BlueCollarTotal*DTotal)/GrandTotal
# No Collar
dfE.iat[2,0]=(NoCollarTotal*ATotal)/GrandTotal
dfE.iat[2,1]=(NoCollarTotal*BTotal)/GrandTotal
dfE.iat[2,2]=(NoCollarTotal*CTotal)/GrandTotal
dfE.iat[2,3]=(NoCollarTotal*DTotal)/GrandTotal

# add a total column to sum across each row
dfE["Total"] = dfE.sum(axis=1)
# add a row to sum across column totals
dfE.loc['Total']= dfE.sum()
dfE.round(2)
#dfE.style.format("{:.2f}")
dfE.style.format("{:.2f}")
dfE

Out:
A B C D Total
White_collar 80.538460 80.538460 107.384613 80.538460 349.0
Blue_collar 34.846153 34.846153 46.461540 34.846153 151.0
No-Collar 34.615383 34.615383 46.153847 34.615383 150.0
Total 150.000000 150.000000 200.000000 150.000000 650.0
In :
# calculate the chi-squared statistic using the formula
chi = ((df-dfE)**2)/dfE
chi
chi["Total"] = chi.sum(axis=1)

chi.loc['Total']= chi.sum()
chi

Out:
A B C D Total
White_collar 1.111528 5.237601 0.106678 2.596724 9.052531
Blue_collar 0.673968 6.590084 0.443326 6.325182 14.032560
No-Collar 0.615384 0.837607 0.028846 0.004274 1.486111
Total 2.400880 12.665292 0.578851 8.926180 24.571203
In :
print(f"The Chi-squared test statistical is calculated as {chi['Total']['Total']}")

The Chi-squared test statistical is calculated as 24.57120296026715


The Chi-Squared test statistic here is calculated as 24.57120296026715 which when rounded to 1 decimal place is the same as the Chi-squared value in the Wikipedia article of $\approx24.6$. 

### Use Scipy.stats to verify this value.¶

The next step is to use the scipy.stats to verify this value and calculate the associated $p$ value.

## Scipy.stats functions for Chi-Squared tests¶

The scipy.stats module has a number of Statistical Tests. There is a chisquare function which is used to calculate a one-way chi-square test. This chi-square test function tests the null hypothesis that the categorical data has the given frequencies and takes as parameters an array of observed frequencies or occurences in each category, an array of expected frequencies in each category and the degrees of freedom.

There is another function, the chi2_contingency function which computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in a contingency table. It computes the expected frequencies based on the marginal sums under the assumption of independence.

The contingency.expected_freq(observed) function can be used to compute the expected frequencies from a contingency table. It takes as input parameters the n-dimensional contingency table of observed frequency values or counts and then computes the expected frequencies for the table based on the marginal sums, under the assumption that the groups associated with each dimension are independent.

#### The observed values.¶

The observed values must be passed into the function using a numpy array:

In :
# observed is the multidimensional arrays of values from the original table.
observed = np.array([[90,60,104,95],[30,50,51,20],[30,40,45,35]])
print("The table of observed values: \n")
observed

The table of observed values:


Out:
array([[ 90,  60, 104,  95],
[ 30,  50,  51,  20],
[ 30,  40,  45,  35]])

#### The expected frequency values:¶

The contingency.expected_freq(observed) function returns an array of expected frequencies, based on the marginal sums of the table. The expected values returned will be of type float. The shape of the expected values will be the same as the observed. Our table has 3 rows and 4 columns of observed values. Therefore the array of expected values will be the same. The expected values are calculated in the same way as I showed above behind the scenes using marginal sums of the table.

In :
# to get an array of the expected frequencies as calculated by the function
from scipy.stats.contingency import expected_freq
print("The expected frequency table: \n")
expected_freq(observed)

The expected frequency table:


Out:
array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
[ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
[ 34.61538462,  34.61538462,  46.15384615,  34.61538462]])
In :
# the shape of the observed array
observed.shape

Out:
(3, 4)

This chi-squared test statistic is a test for the independence of different categories of a population and should only be used when the dimension of observed is two or more. chi2_contingency computes a chi-square statistic. It first computes the expected frequencies and the degrees of freedom from the given contingency table. The degrees of freedom is calculated as follows: dof = observed.size - sum(observed.shape) + observed.ndim - 1.

The function returns four pieces of output:

1. The chi-squared test statistic as a float
2. The p-value of the test as a float
3. The degrees of freedom as an integer
4. The expected frequencies based on the marginal sum of the table of observed frequencies. This is returned as an ndarray of the same shape as the table of observed values.
In :
# import from scipy.stats
from scipy.stats import chi2_contingency
# Calculate the chi-squared test statistic, the p-value of the test, the degrees of freedom and the expected frequencies.
chi2_contingency(observed)
#print(f"The chi-squared statistic calculated using the scipy stats package is {chi2_contingency(obs)} ")

Out:
(24.5712028585826,
0.0004098425861096696,
6,
array([[ 80.53846154,  80.53846154, 107.38461538,  80.53846154],
[ 34.84615385,  34.84615385,  46.46153846,  34.84615385],
[ 34.61538462,  34.61538462,  46.15384615,  34.61538462]]))
In :
print(f"The degrees of freedom is {observed.size - sum(observed.shape) + observed.ndim - 1}")

The degrees of freedom is 6

In :
print(f"The chi-squared statistic calculated using the scipy stats package is {chi2_contingency(observed)} \n")
print(f"The p-value is {chi2_contingency(observed)} ")

The chi-squared statistic calculated using the scipy stats package is 24.5712028585826

The p-value is 0.0004098425861096696


### Interpreting the results:¶

The Chi-squared statistic calculated by scipy.stats returns the same value as I manually calculated above using Pandas dataframe and the actual formulas. It also agrees with the value as reported in the Wikipedia article.  The null hypothesis as stated in the article with the table is that each person's neighborhood of residence is independent of the person's occupational classification. The decision on whether or not to reject the null hypothesis is based on the evidence in the sample data. The result of the hypothesis test will either be:

• Reject the null hypothesis $H_0$ in favour of the alternative hypothesis $H_A$ when the data provides sufficient evidence that the alternative hypothesis is true. In this case the result is said to be statistically significant.
• Fail to the reject the null hypothesis $H_0$ when there is insufficient evidence that the alternative hypothesis $H_A$ is true. In this case the result is not statistically significant. Failing to reject the the null hypothesis does not mean that there is evidence that it is true. It just means that there is insufficient evidence to reject it.

In the example above first the population under study was defined as the hypothetical city of 1,000,000 residents with four neighbourhoods, A, B C and D. A random sample of 650 residents were taken and their occupation recorded as "white collar", "blue collar", or "no collar". The null hypothesis is that each person's neighborhood of residence is independent of the person's occupational classification. The evidence is evaluated using the Chi-squared statistical test which tests for independence of two categorical variables. A sample statistic is based on the a sample taken from a population. It corresponds to the population characteristic under investigation. It was calculated here as 24.5712 which when rounded is the same as the value on the Wikipedia page. The p-value was also calculated by scipy.stats as 0.0004098425861096696.

These values can be used to decide if the two categorical variables studied are independent or not. If the sample statistic is far enough away from the hypothesised value and in the direction of the alternative hypothesis $H_A$ then we can conclude that the null hypothesis $H_0$ is probably false. This would mean that the sample could not have come from a population with the hypothesised value and the null hypothesis is rejected. Otherwise we fail to reject the null hypothesis.

Any decision making procedure can lead to wrong decisions being made. A Type I Error is where the null hypothesis $H_0$ is rejected when it is true. A Type II Error is where we do not reject the null hypothesis when it is false. With statistical hypothesis testing the probabilility of making a Type I or a Type II error can be quantified and controlled. The probability of making a Type I error is known as the tests significance level and is referred to as $\alpha$ or alpha. The probability of a Type II error is known as $\beta$. The power of a test is $1-\beta$. Reducing the $\alpha$ value increases the $\beta$ value and vice versa. The only way to reduce $\beta$ without increasing $\alpha$ is by increasing the sample size. The value of $\alpha$ is usually set as 0.05 or 0.01. The null hypothesis is rejected if the sample statistic is far enough away from the hypothesis value.

When the observed and expected cell counts are compared, if the differences between them are too big to be explained by sampling variation then the null hypothesis is rejected in favour of the alternative hypothesis. The bigger the difference, the less likely a Type I error will arise. The definition of how big is based on $\alpha$. A test statistic is calculated which belongs to a known distribution, in this case the $\chi^2$ frequency distribution.

The null hypothesis is rejected if the chi-squared test statistic is greater than some critical value. The critical value is chosen in such as way that the probability of a Type I Error is equal to $\alpha$ and to do this you need to know the sampling distribution of $\chi^2$ when the null hypothesis is true. (The distribution is approximately a chi-squared distribution with k-1 degrees of freedom.) The critical value divides the possible values of the test statistic into a critical rejection region and a non-critical non-rejection region. If the test statistic lies in the rejection region then the null hypothesis is rejected, otherwise we fail to reject the null hypothesis.

Large values of the chi-squared statistic suggest rejection of the null hypothesis in favour of the alternative hypothesis. The overall difference between what was observed and what would have been expected under the null hypothesis is larger. Smaller values of the chi-squared test statistic are compatible with the null hypothesis.

The Chi-squared value is compared to the Chi-squared critical value which takes into account the alpha level of significance and the degrees of freedom. If the chi-squared test statistic is greater than the critical value of chi at that alpha level for the number of degrees of freedom, then the null hypothesis of independence is rejected and we can conclude that the data contains sufficient evidence of neighbourhood preference by occupational categories. Without computers you would have to refer to a statistical tables such as the New Cambridge Statistical tables  and compare your calculated test statistic with the critical values, using the alpha level and the degrees of freedom. The chi-squared critical value from page 41 of Table 8 in the tables is 12.59 and since the the chi-squared test statistic is greater than the critical value we can reject the null hypothesis and conclude that the data contains sufficient evidence of neighbourhood preference by occupational classification.

The chi2_contingency function from the SciPy Stats package calculates and provides all of the information required to make the same decision given the same data. It does not return the critical value but it does return the chi-squared test statistic, the p-value, the degrees of freedom and the expected frequencies.

The $\alpha$ value is the probability of rejecting the null hypothesis when it is true. It is generally chosen as 0.05 but a smaller alpha value of 0.01 can be used where more precision is required. If the p-value for the test is greater than the alpha value then the null hypothesis is not rejected.

The calculated p-value is 0.0004098425861096696 and this is less than the alpha level of 0.05 which means that the null hypothesis should be rejected in favour of the alternative hypothesis. We can conclude that the data supports the hypothesis that the choice of neighbourhood is not independent of occupational classification. The two categorical variables in this data are not independent. The data contains sufficient evidence of neighbourhood preference by occupational classification.

Note that the conditions of the Chi-squared test is that the approximation is good when every expected cell count is greater than or equal to 5, and the $n$ observed counts are a random sample from the population of interest.

The standard deviation of an array of numbers x is calculated using numpy as $np.sqrt(np.sum((x - np.mean(x))^2)/len(x))$ . However, Microsoft Excel has two different versions of the standard deviation calculation, STDDEV.P and STDDEV.S. The STDDEV.P function performs the above calculation but in the STDDEV.S calculation the division is by len(x)-1 rather than len(x).

• Research these Excel functions, writing a note in a Markdown cell about the difference between them. Then use numpy to perform a simulation demonstrating that the STDDEV.S calculation is a better estimate for the standard deviation of a population when performed on a sample. Note that part of this task is to figure out the terminology in the previous sentence

### The Excel STDDEV.P and STDDEV.S functions.¶

To begin I refer to the Microsoft documentation here on these two Excel functions. The standard deviation is a measure of how widely values are dispersed from the average value (the mean). According to the Microsoft documentation, the STDEV function estimates standard deviation based on a sample. This function has been replaced with one or more new functions that may provide improved accuracy and whose names better reflect their usage. These are the STDDEV.P and STDDEV.S functions.

• The STDDEV.S function estimates standard deviation based on a sample (ignores logical values and text in the sample). STDEV.S assumes that its arguments are a sample of the population. The standard deviation is calculated using the "n-1" method. If your data represents the entire population, then compute the standard deviation using STDEV.P.

### Statistics¶

The aim of statistics is to make statements about the population using a sample. A parameter is a population characterestic while a sample statistic is any quantity computed from values in the sample. A sample of size $\text{n}$ is just one of many that can be drawn from a population of size $\text{N}$. While the value of a population parameter is fixed, the value of a statistic varies from sample to sample. It is a random variable and it's probability distribution known as it's sampling distribution.

The population mean takes the sum of every data point in the population and divide by the number of observations ($N$ data points) to get the population mean $\mu$. The sample mean $\bar{x}$ is the sum of the observations in a sample divided by the number of observations $\text{n}$. The next thing to consider is the dispersion of the data points from the mean. The variance is the the sum of the squared distances from the population mean divided by the total number of data points in the population, i.e. the average of the squared difference (of each data point) from the mean. The standard deviation, which is the square root of the variance, is the most commonly used measure of the dispersion of a data set. The larger the standard deviation, the greater the spread of the data.

A statistic is biased if, in the long run, it consistently over or underestimates the parameter it is estimating. More technically it is biased if its expected value is not equal to the parameter. A statistic is positively biased if it tends to overestimate the parameter; a statistic is negatively biased if it tends to underestimate the parameter. An unbiased statistic is not necessarily an accurate statistic. If a statistic is sometimes much too high and sometimes much too low, it can still be unbiased. It would be very imprecise, however. A slightly biased statistic that systematically results in very small overestimates of a parameter could be quite efficient. 

n-1 is usually used in the sample variance formula instead of n because using n tends to produce an underestimate of the population variance. For this reason n-1 is used to provide the appropriate correction for this tendency.

According to wikipedia, using n-1 instead of n in the formula for the sample variance and sample standard deviation is known as Bessel's Correction, named after Friedrick Bessel.

This method corrects the bias in the estimation of the population variance. It also partially corrects the bias in the estimation of the population standard deviation. However, the correction often increases the mean squared error in these estimations. In estimating the population variance from a sample when the population mean is unknown, the uncorrected sample variance is the mean of the squares of deviations of sample values from the sample mean (i.e. using a multiplicative factor 1/n). In this case, the sample variance is a biased estimator of the population variance. Multiplying the uncorrected sample variance by the factor $\frac{n}{n-1}$ gives an unbiased estimator of the population variance

Bessel's correction is the degrees of freedom in the residuals vector $x_1-\bar{x},...,x_n - \bar{x}$ where $\bar{x}$ is the sample mean. There are n independent observations in the sample but only n-1 residuals because they sum to one. Generally Bessel's correction is an approach to reduce the bias due to finite sample size.

So there is more than one way that the sample variance could be calculated. The biased sample variance $S_n$ is a non-unbiased estimator of the population variance. From every one of the n data points in the sample you substract the sample mean $\bar{x}$ and divide by the number of data points n. To get the unbiased estimate instead, you divide by n-1 instead of n. As n-1 is a smaller number than n, then dividing by the smaller number will give a larger value for the sample variance. This unbiased estimate is usually denoted by $S_{n-1}$.

The population variance is calculated using the formula: $$\sigma^2\frac{=\sum(x_i-\mu)^2}{n}$$

The sample variance is calculated using the formula:

$$S^2\frac{=\sum(x_i-\bar{x})^2}{n-1}$$

### Simulation using Numpy to demonstrate.¶

The task requires using Python's NumPy package to perform a simulation demonstrating that the STDDEV.S calculation is a better estimate for the standard deviation of a population when performed on a sample.

The STDDEV.S function estimates standard deviation based on a sample, assumes that its arguments are a sample of the population and therefore the standard deviation is calculated using the "n-1" method. If the data represented the entire population, then the standard deviation should be computed using STDEV.P.

Numpy has a function std for calculating the standard deviation of an array of numbers. Referring to the numpy.std documentation, Numpy's standard deviation std is calculated as the square root of the average of the squared deviations from the mean, i.e. std = sqrt(mean(abs(x - x.mean())**2)) where the average squared deviations uses x.sum() / N where N = len(x). This makes it similar to Excels STDEV.P function.

The std function can take some optional parameters though, one of which is ddof which means delta degree of freedom can be specified. When the default of zero is used, the divisor is N but if this parameter is specified then the divisor becomes N-ddof. Therefore if this parameter is set to 1, the formula for the standard deviation changes from being the population standard deviation to the sample standard deviation.

The Numpy documentation notes that:

in standard statistical practice, ddof=1 provides an unbiased estimator of the variance of the infinite population. ddof=0 provides a maximum likelihood estimate of the variance for normally distributed variables. The standard deviation computed in this function is the square root of the estimated variance, so even with ddof=1, it will not be an unbiased estimate of the standard deviation per se.

Usually we are interested in measurements of a population but it is not practical to measure this parameter. Therefore we take representative samples from the population and get the average of the samples instead. A statistic is a good estimator for a population parameter and measures estimates of population parameters in a sample. The average of a sample taken from a population is unlikely to be the actual true average for the population. This process can be repeated many times though. NumPy random distribution functions can be used to imitate this process by simulating many many samples of a set sample size from a distribution. The mean and standard deviations from the samples can then be used to get the sampling distribution of the mean. While some random samples may have randomly big or randomly small values, overall you could take the average from each of the samples and relate these to the mean of the overall population you took them from.

Below I will simulate a population using NumPy's random module and from the population select some samples and calculate some sample statistics, then compare the sample statistics with the population parameters. I could use the standard normal distribution as the population, it has a mean of zero and a standard deviation and variance of 1. Alternatively any normal distribution can be used as the population by specifying the $\mu$ and $\sigma^2$ or any other distribution such as the Uniform distribution. In general you could take the average of the sample statistics from any population distribution and see how they relate to the population that the samples are drawn from. The mean is an estimator of the centre of the distribution. If all the possible samples were taken and the sample statistics were calculated were calculated for each sample, you could then take the mean of all the sample means and expect this to be close to the mean of the population and therefore a good estimator of the population mean. If you calculated the mean and variance for each of the samples and looked to see how they were distributed you would have the sampling distribution of the mean.

We would expect to see that sample statistics calculated from somes samples would be close to the population parameters but others might vary somewhat. This depends largely on the size of the sample and the actual random values that were drawn from the population which will vary from sample to sample. Some of the random samples will have means that randomly quite big or small or even more extreme. It depends on where exactly the samples came from in the population.

Given the background above we should expect to see that when using n instead of n-1 for the variance and standard deviation calculations that the true variance and standard deviations are being underestimated but when using n-1 they should be closer to the true population parameters.

I will sample 1,000 samples each containing 10 points, then calculate the sample mean, sample variance and sample standard deviation for each sample. The variable samps will hold the 1,000 samples of 10 data points. The mean of samps will not be exactly 0 as it is 10,000 points sampled from the standard normal distribution with mean 0 but it should be fairly close.

The same analysis could be easily applied to a normal distribution with a specified mean and standard deviation by specifying the loc and scale parameters as well as the size and substituting np.random.normal for np.random.standard_normal in the code below.

#### Import Python modules¶

In :
# import libraries using common alias names
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = 12, 8 # makes plots bigger


### The Standard Normal Distribution¶

The data below is generated from the standard normal distribution which has a mean of 0 and a variance of 1. In a standard normal distribution, 99.73% of the value fall between -3 and 3.

In :
plt.rcParams['figure.figsize'] = 12, 8 # makes plots big
fig = sns.histplot(np.random.standard_normal((100000)),label="$\mu=0,\ \sigma=1$")
#plt.legend(loc="best")
plt.xlabel=('mean')
plt.ylabel=("Standard deviation")
#plt.title("Histogram")
plt.suptitle('Standard Normal Distribution with $\mu$=0,$\sigma^2 =1$')
plt.show(fig) ### Simulating data¶

Numpy.random's standard_normal numpy.random.standard_normal(size=None) function samples from a standard Normal distribution (mean=0, stdev=1).

• A size parameter can be provided as an integer to return an array containing a sample of that size.
• If a tuple of integers (m,n,k) is provided as a size parameter then will generate m n k samples.

In the code below 1,000 samples of size 10 are generated from a standard normal population. In a standard normal distribution, 99.73% of the value fall between -3 and 3. Without setting a random seed, each time this code is run gives different statistics.

(The same analysis could be applied to a normal distribution with a specified mean and standard deviation by specifying the loc and scale parameters. For example in the code below use samps = np.random.normal(5, 2, size = (no_samples, sample_size)) to perform the same type of analysis on a standard distribution with a mean of 5 and a standard deviation of 2.)

In :
import numpy as np
np.set_printoptions(formatter={'float': lambda x: "{0:6.3f}".format(x)})

sample_size = 10
# 1000
no_samples = 1000
#size of the sample here is a tuple(m,n) to get m * n samples
samps = np.random.standard_normal((no_samples, sample_size))

# alternatively to run the analysis on a standard distribution
#samps = np.random.normal(5, 2, size = (no_samples, sample_size))

print(f"The mean of all the samples is {samps.mean()}")
print(f"The sample has a {samps.size} points")
print(samps.shape)

The mean of all the samples is 0.002566389717266284
The sample has a 10000 points
(1000, 10)


The 10,000 data points generated were put into 1,000 sub-arrays of size 10 to represent 1,000 samples each containing 10 data points in each. The mean of each sample of 10 can be obtained by applying the mean() function (at axis level 1) or manually by summing each (sample) array of 10 data points and dividing by the number of data points in the sample (10).

Similarly the variances and standard deviations can be calculated using the formula or applying the NumPy var and std functions.

In :
# the mean of all the sample means
sample_means = samps.mean(axis=1)
# samp.sum(axis=1) / sample_size
#print(sample_means)

# the mean of the sample means?
print(f"The mean of the sample means is {sample_means.mean()}")
print(f"The variance of the sample means is {sample_means.var()}")
print(f"The variance of the sample means is {sample_means.var(ddof=1)}")
print(f"The standard deviation of the sample means is {sample_means.std()}")
print(f"The standard deviation of the sample means is {sample_means.std(ddof=1)}")

The mean of the sample means is 0.002566389717266285
The variance of the sample means is 0.09980471803400522
The variance of the sample means is 0.09990462265666188
The standard deviation of the sample means is 0.3159188472282165
The standard deviation of the sample means is 0.31607692522020947

In :
# an array containing all the means of the samples
samps.mean(axis=1)

# manually calculating the mean of each sample using the foruma
mean = samps.sum(axis=1) / sample_size


### Sample statistics using numpy functions¶

As per the numpy.std documentation, setting the ddof parameter uses n-1 as the divisor instead of n to give the unbiased statistic.

In :
# array of the means of the samples
means = samps.mean(axis=1)
# array of the variances (biased using n)
vars_b = samps.var(axis=1)
# array of the variances (unbiased using n-1)
vars_ub = samps.var(axis=1, ddof=1)
# array of the standard deviations using biased (n)
stds_b = samps.std(axis=1)
# array of the sample standard deviations using unbiased (n-1)
stds_ub = samps.std(axis=1, ddof=1)


#### Sample Statistics using manual calculations¶

Here the sample variances and standard deviations are calculated manually using the formulas noted above.

In :
# newaxis converts the array into column vector
diff_from_mean = (samps - mean[:, np.newaxis])**2
# Calculate the biased variance for each of the samples using n
variances_b = diff_from_mean.sum(axis=1) / (sample_size)
# Calculate the unbiased variance for each of the samples using n-1
variances_ub = diff_from_mean.sum(axis=1) / (sample_size-1)
# Calculate the biased standard deviations for each of the samples using n
standard_deviations_b = np.sqrt(variances_b)
# Calculate the ubbiased standard deviations for each of the samples using n-1
standard_deviations_ub = np.sqrt(variances_ub)


#### Checking to see if the manual calculations are the same as using the NumPy var and std.¶

(To run this cell highlight the cell and click y).

# checking to see if the vars_b == variances_b vars_ub == variances_ub stds_b == standard_deviations_b stds_ub ==standard_deviations_ub

### Comparing the biased and unbiased statistics to the population parameters.¶

Using the formula for the biased variance gives a mean variance that is the exact same as using the numpy var() formula. Adjusting the variance formula to divide the differences from the mean by $n-1$ gives a mean variance that is very close to the true variance of the population.

Multiplying the uncorrected sample variance by the factor $\frac{n}{n-1}$ gives an unbiased estimator of the population variance.

In :
# using the var() formula
print(f"The mean of the variances using np.vars() is {vars_b.mean()}")
# the mean of the biased variance formula
print(f"The mean of the biased variance (variance calculated using n) is {variances_b.mean()}")
# mean of the unbiased formula
print(f"The mean of the unbiased variances (variance calculate using n-1) is {variances_ub.mean()}")
# the actual population variance (if my sample of 10000 points is taken as the full population for this demo)
print(f"The actual population variance is {samps.var()}")
# using Bessel correction by multiplying the biased sample variance by n/n-1
print(f"Using Bessel's correction gives {variances_b.mean() * (sample_size / (sample_size - 1.0))}")

print(f"The difference between the population variance and the means of the biased variances is {samps.var() - variances_b.mean()}")
print(f"The difference between the population variance and the means of the unbiased variances is {samps.var() - variances_ub.mean()}")

The mean of the variances using np.vars() is 0.8858265772434409
The mean of the biased variance (variance calculated using n) is 0.8858265772434409
The mean of the unbiased variances (variance calculate using n-1) is 0.9842517524927121
The actual population variance is 0.9856312952774461
Using Bessel's correction gives 0.9842517524927122
The difference between the population variance and the means of the biased variances is 0.09980471803400515
The difference between the population variance and the means of the unbiased variances is 0.0013795427847339559

In :
# using the std() formula
print(f"The mean of the standard deviations using np.std() is {stds_b.mean()}")
# the mean of the biased standard deviation formula
print(f"The mean of the biased standard deviation (standard deviation calculated using n) is {stds_b.mean()}")
# mean of the unbiased formula
print(f"The mean of the unbiased standard deviations (using n-1) is {stds_ub.mean()}")
# the actual population standard deviation (if my sample of 10000 points is taken as the full population for this demo)
print(f"The actual population standard deviation is {samps.std()}")

print(f"The difference between the population standard deviation and the means of the biased standard deviations is {samps.std() - stds_b.mean()}")
print(f"The difference between the population standard deviation and the means of the unbiased standard deviation is {samps.std() - stds_ub.mean()}")

The mean of the standard deviations using np.std() is 0.9144098068393425
The mean of the biased standard deviation (standard deviation calculated using n) is 0.9144098068393425
The mean of the unbiased standard deviations (using n-1) is 0.9638725681356453
The actual population standard deviation is 0.9927896530874232
The difference between the population standard deviation and the means of the biased standard deviations is 0.0783798462480807
The difference between the population standard deviation and the means of the unbiased standard deviation is 0.028917084951777916


The above comparisons show that using the unbiased variances and unbiased standard deviations on the samples drawn from the population give a closer estimate of the true population parameters than the unbiased formulas.

If the sample size for each sample drawn from the population is made smaller, the difference between the population standard deviation and the mean of the sample sample deviations is more notable. Note in the above I used a large sample drawn from the standard normal distribution as my population. The results could also be compared with the population parameters of the standard normal distribution which has a variance and standard deviation of 1.

### Plotting the sample mean and variances¶

Below I will plot the sample statistics from each of the samples and compare to the population parameters. (This is something I saw on a Khan Academy video but I don't know what application was actual used for plotting there. It didn't appear to be Python.) The sample mean and sample variance for each of the samples are calculated and plotted on the graph. A vertical line represents the actual population mean while a horizontal line represents the population variance. khan Academy

In :
x = sample_means
y = stds_b
z = stds_ub
a = stds_b.mean()
# mean of the unbiased formula
b = stds_ub.mean()

In :
%matplotlib inline
# set up the subplots and figure sizes
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.scatterplot(x =x,y=y, hue=y, ax=axes)
sns.scatterplot(x =x,y=z, hue=y, ax=axes)
axes.axvline(samps.mean(), color='blue', linewidth=2, linestyle="--")
axes.axvline(samps.mean(), color='blue', linewidth=2, linestyle="--")
axes.axhline(samps.std(), color='red', linewidth=2, linestyle=":")
axes.axhline(stds_b.mean(), color='green', linewidth=2)
axes.axhline(samps.std(), color='red', linewidth=3, linestyle=":")
axes.axhline(stds_ub.mean(), color='cyan', linewidth=2)
axes.set(xlabel='Sample Means', ylabel='Sample Standard Deviations')
axes.set(xlabel='Sample Means', ylabel='Sample Standard Deviations')
axes.set_title("Using biased standard deviations")
axes.set_title("Using unbiased standard deviations")
plt.show() The Khan academy video  also looked at the size of the samples. With a smaller sample size, you are much more likely to get a sample mean that is a bad estimate for the population mean and you are more likely to significantly underestimate the sample variance. Their demo showed that the cases where the sample variance is significantly underestimated are also disproportionately the cases where the means of those samples are way off from the true sample mean. For example, if you took many small samples of size 2 from the population, over many trials with samples of size 2 you see that the ratio of the biased sample variance to the population variance approaches half of the true population variance. With a sample size of 3 the ratio approaches two thirds of the true population variance. For sample size of 4 the ratio approaches $\frac{3}{4}$ of the true population variance. This illustrates that when using the biased estimate instead of approaching the population variance we approach $\frac{n-1}{n}$ times the population variance. Therefore to get the best estimate, an unbiased estimate of the true population variance, multiply by $\frac{n}{n-1}$ to get an unbiased estimate. $$\frac{(n-1)\sigma^2}{n}.\frac{n}{n-1}=\sigma^2$$

Below I have plotted the means of the sample standard deviations for various sample sizes between 2 and 6. The plots on the left used n in the denominator while the plots on the right used n-1.

The green horizontal line on the plot on the left (representing the mean of the biased standard deviations) moves further away from the population mean as the sample size gets smaller. When the sample size is only two, the green line is just over 0.5. For a sample size of 3, the mean of the biased standard deviations is above 0.6, for a sample size of 4 it is approx 0.8. Meanwhile the line representing the mean of the standard deviations calculated using n-1 do not vary as much.

The plots above show that the means are clusted around the population mean of zero which is shown by the vertical blue dashed line. Some sample means are below and some are above and some more are quite a bit away from the population mean. This would depend on the actual values in each sample. Similarly with the standard deviations. The dashed red horizontal line represents the population standard deviation of 1. Again while the sample sample standard are clustered around them, some sample standard deviations are above, some below and some more extreme than others. However when the mean of the sample standard deviations are plotted you can see that the mean of the unbiased standard deviations are closer to the actual standard devision of the population than the mean of the biased sample standard deviations.

In :
def plotx(s):

sample_size = s
# 1000
no_samples = 1000
#size of the sample here is a tuple(m,n) to get m * n samples
samps = np.random.standard_normal((no_samples, sample_size))

# the mean of all the sample means
sample_means = samps.mean(axis=1)

# array of the means of the samples
means = samps.mean(axis=1)
# array of the variances (biased using n)
vars_b = samps.var(axis=1)
# array of the variances (unbiased using n-1)
vars_ub = samps.var(axis=1, ddof=1)
# array of the standard deviations using biased (n)
stds_b = samps.std(axis=1)
# array of the sample standard deviations using unbiased (n-1)
stds_ub = samps.std(axis=1, ddof=1)

x = sample_means
y = stds_b
z = stds_ub
a = stds_b.mean()
# mean of the unbiased formula
b = stds_ub.mean()

# set up the subplots and figure sizes
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.scatterplot(x =x,y=y, hue=y, ax=axes)
sns.scatterplot(x =x,y=z, hue=y, ax=axes)
axes.axvline(samps.mean(), color='blue', linewidth=2, linestyle="--")
axes.axvline(samps.mean(), color='blue', linewidth=2, linestyle="--")
axes.axhline(samps.std(), color='red', linewidth=2, linestyle=":")
axes.axhline(stds_b.mean(), color='green', linewidth=2)
axes.axhline(samps.std(), color='red', linewidth=3, linestyle=":")
axes.axhline(stds_ub.mean(), color='cyan', linewidth=2)
axes.set(xlabel='Sample Means', ylabel='Sample Standard Deviations')
axes.set(xlabel='Sample Means', ylabel='Sample Standard Deviations')
axes.set_title("Using biased standard deviations")
axes.set_title("Using unbiased standard deviations")
plt.show()
for i in range(2,8):
plotx(i)  