Python for Data Analysis
By Angela C
October 1, 2021
Reading time: 6 minutes.
Some notes on various Python libraries used for data analytics.
Some basic PythonL
- Variable assignent using
=operator. - Calculations can be performed with variables
- Data types such as strings
str, integersint,float, booleanbool. - Many Python libraries including
pandasfor data analysis,numpyfor scientific computing,matplotlibandseabornfor 2-d plotting,scikit-learnfor machine learning,plotlyfor interactive visualisations,dash plotlyfor dashboards and many more.
Strings in Python
There are various methods for working with strings including .upper(), .lower() and .title(), .replace(), .count(), .strip() etc.
See post on Strings in blog section. Also String partitioning
Lists
Subsets
mylist[1]select item at index 1mylist[-2]select second last item Slicesmylist[1:4]select items from index 1 up to (but not including) index 4mylist[:3]select items before index 3mylist[:]a copy of the list
Lists of lists can be subset
list[0][2]select list in index 0, from that select items in index 2list[1][:3]select list in index 1, from that select items up to index 3
Lists Operations
list1 + list2to add lists togetherlist1*3to multiply a list
List methods
Lists are mutable. If you don’t want to make changes to a list then assign any changes to a new variable. All the methods below will change the original list.
mylist.index('a')to get the index of an item in the listmylist.count('a')to count occurences of an item in the listmylist.append('z)'to append an item to the end of a listmylist.extend('x')to append an item to the end of the listmylist.remove('z')to remove an item from a list (first occurence)del(mylist[0:2])to remove items up to index 2mylist.reverse()to reverse a listmylist.pop()to remove the last item from the listmylist.pop(-2)to remove the second last itemmylist.insert(1,'x)to insert an item at index 1. The index must be providedmylist.sort()to sort a list
Some Python packages
Numpy
Numpy: A Python library for creating and manipulating vectors and matrices. It is the core library for scientific computing in Python. Numpy provides high-performance multi-dimensional array objects and the tools for working with these arrays.
Numpy is usually imported using the alias np.
import numpy as np
Arrays can be created using np.array()
a = np.array([1,3,5,7])
b = np.array([[3,6,8,9], [2,5,7,9]])
c = np.array([[1.3, 3.5], [3, 5.4]], dtype=float)
Placeholders can be used:
np.zerosto create an array of zerosnp.onesto create an array of onesnp.arange()to create an array withstart,stopand optionalstepparameters.
np.arange(2, 20, 2)
array([ 2, 4, 6, 8, 10, 12, 14, 16, 18])
np.linspace()to create an array of evenly spaced values
np.linspace(0,5, 10)
array([0. , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
2.77777778, 3.33333333, 3.88888889, 4.44444444, 5. ])
np.full()to create a constant array
np.full((2,2),8)
array([[8, 8],
[8, 8]])
np.eye()to create an identity matrix
np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
np.random.random()to create an array of random values.
See post on Numpy Random project.
np.random.random((2,3))
array([[0.6447133 , 0.30962173, 0.87798529],
[0.01284644, 0.03490406, 0.84898446]])
np.empty()to create an array of uninitialized (arbitrary) data of the given shape, dtype, and order.
np.empty((2,4))
array([[3., 6., 8., 9.],
[2., 5., 7., 9.]])
Inspecting arrays:
a.shape()for array dimensions,len(a)for the length of the arraya.ndim()for the number of array dimensionsa.size()for the number of elementsa.dtype()for the data type of the elementsa.dtype.namefor the name of the data type.a.astype(int)to convert to another data type.
Numpy Data Types:
np.int64signed 64-bit integernp.float32standard double precision floating pointnp.complexcomplex numbers represented by 128 floatsnp.boolfor boolean True and False valuesnp.objectfor Python object typenp.string_for fixed-length string typenp.unicode_for fixed-length unicode type
Operations for performing maths on arrays:
np.add(a,b)same asa + bnp.substract(a,b)same asa - bnp.divide(a,b)same asa / bnp.multiply(a,b)same asa * bnp.exp(a)for exponentialsnp.sqrt(b)for square rootsnp.cos(a),np.sin(a)etc for element wise cosines, sines etcnp.log(a)element wise natural logarithma.dot(b)for dot product of two arrays
Comparison operators
a=belement-wise comparisons, result in Trues or Falsesa > 2elment-wise comparisonsnp.array_equals(a,b)array-wise comparisons
Aggregate functions
Array-wise aggregation examples:
a.sum()a.min()
can specify the axis:
a.min(axis=0)minimum value of an array rowa.max(axis=1)maximum value of an array columna.cumsum()for cumulative suma.mean()ornp.mean(a)for meana.std()ornp.std(a)for standard deviationnp.median(a)for mediannp.corrcoef(a)for correlation coefficients
Copying arrays
np.copy(a)to copy an array ora.copy()to create a deep copya.view()to create a view of the array with same data
Sorting arrays
a.sort()to sort an arraya.sort(axis=0)to sort along an axisa.sort(axis=1)
Subsetting, slicing, indexing
Subsetting and slicing using [].
a[:3]to select elements up to index 3 (at index 0, 1 and 2)a[1,2]select elements at row 1 column 2.
This is the same asa[1][2]a[:,1]a[::-1]to reverse an arraya[a>2]boolean indexing
Array Manipulation
transposing an array
np.transpose(a)same asa.Ta.ravelto flatten an array Changing array shapea.reshape(2,3)to reshape an array with 6 elements to a 2 by 3. If the number of elements is unknown usea.reshape(2,-1).
Adding or removing elements
a.resize()to return a new array with specified shapenp.append(a,b)to append items to an arraynp.insert()to insert items into an arraynp.delete()to delete items from an array
Combining arrays
np.concatenate((a,b), axis=0)np.vstack((a,b))to stack arrays vertically (row-wise)np.hstack((a,b))to stack arrays horizontally (column wise)
Splitting arrays
np.hsplit(a, 2)split array horizontally at index 2np.vsplit(a,2)to split vertically at index 2
Pandas - for data wrangling
In a long format dataframe, each row is a complete and independent representation. In a wide dataframe, categorical values are grouped.
pd.pivot_table() and pd.pivot()
-
pd.pivot_table(): To transform a long-format dataframe to wide format. Create a spreadsheet-style pivot table as a DataFrame. -
pd.pivot_table()is also used for generating tables of summary statistics. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame. -
index: the variables to remain untouched -
columns: the variables to be spread across more columns -
values: the numerical values to be aggregated or processed
The output of pivot_table() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.
Some columns might be better represented as column names instead of values.
The output of pivot() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.
.rename_axis() sets the name of the axis for the index or columns.
df.pivot: Pivot without aggregation that can handle non-numeric data
pd.pivot to pivot a dataframe spreading rows into columns
Melting dataframes using the .melt method
-
pd.melt()to transform wide to long.
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set. -
id_varsare values to keep as rows, duplicated as needed. -
value_varsare columns to be taken and made as values, melted into a new column. If thevalue_varsis not specified, then all columns that are not included inid_varswill be used asvalue_vars. -
var_nameis optional
Group by
The groupby method allows you to group rows of data together and aggregation functions to be callled on the grouped rows.
'by' takes a list with the columns you are interested to group.
Docstring: Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results.
This can be used to group large amounts of data and compute operations on these groups