s

Python for Data Analysis

By Angela C

October 1, 2021 in Python numpy

Reading time: 6 minutes.

Some notes on various Python libraries used for data analytics.

Some basic PythonL

  • Variable assignent using = operator.
  • Calculations can be performed with variables
  • Data types such as strings str, integers int, float, boolean bool.
  • Many Python libraries including pandas for data analysis, numpy for scientific computing, matplotlib and seaborn for 2-d plotting, scikit-learn for machine learning, plotly for interactive visualisations, dash plotly for dashboards and many more.

Strings in Python

There are various methods for working with strings including .upper(), .lower() and .title(), .replace(), .count(), .strip() etc.

See post on Strings in blog section. Also String partitioning


Lists

Subsets

  • mylist[1] select item at index 1
  • mylist[-2] select second last item Slices
  • mylist[1:4] select items from index 1 up to (but not including) index 4
  • mylist[:3] select items before index 3
  • mylist[:] a copy of the list

Lists of lists can be subset

  • list[0][2] select list in index 0, from that select items in index 2
  • list[1][:3] select list in index 1, from that select items up to index 3

Lists Operations

  • list1 + list2 to add lists together
  • list1*3 to multiply a list

List methods

Lists are mutable. If you don’t want to make changes to a list then assign any changes to a new variable. All the methods below will change the original list.

  • mylist.index('a') to get the index of an item in the list
  • mylist.count('a') to count occurences of an item in the list
  • mylist.append('z)' to append an item to the end of a list
  • mylist.extend('x') to append an item to the end of the list
  • mylist.remove('z') to remove an item from a list (first occurence)
  • del(mylist[0:2]) to remove items up to index 2
  • mylist.reverse() to reverse a list
  • mylist.pop() to remove the last item from the list
  • mylist.pop(-2) to remove the second last item
  • mylist.insert(1,'x) to insert an item at index 1. The index must be provided
  • mylist.sort() to sort a list

Some Python packages


Numpy

Numpy: A Python library for creating and manipulating vectors and matrices. It is the core library for scientific computing in Python. Numpy provides high-performance multi-dimensional array objects and the tools for working with these arrays.

Numpy is usually imported using the alias np.

import numpy as np

Arrays can be created using np.array()

a = np.array([1,3,5,7])
b = np.array([[3,6,8,9], [2,5,7,9]])
c = np.array([[1.3, 3.5], [3, 5.4]], dtype=float)

Placeholders can be used:

  • np.zeros to create an array of zeros
  • np.ones to create an array of ones
  • np.arange() to create an array with start, stop and optional step parameters.
np.arange(2, 20, 2)
array([ 2,  4,  6,  8, 10, 12, 14, 16, 18])
  • np.linspace() to create an array of evenly spaced values
np.linspace(0,5, 10)
array([0.        , 0.55555556, 1.11111111, 1.66666667, 2.22222222,
       2.77777778, 3.33333333, 3.88888889, 4.44444444, 5.        ])
  • np.full() to create a constant array
np.full((2,2),8)
array([[8, 8],
       [8, 8]])
  • np.eye() to create an identity matrix
np.eye(3)
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
  • np.random.random() to create an array of random values.

See post on Numpy Random project.

np.random.random((2,3))
array([[0.6447133 , 0.30962173, 0.87798529],
       [0.01284644, 0.03490406, 0.84898446]])
  • np.empty() to create an array of uninitialized (arbitrary) data of the given shape, dtype, and order.
np.empty((2,4))
array([[3., 6., 8., 9.],
       [2., 5., 7., 9.]])

Inspecting arrays:

  • a.shape() for array dimensions,
  • len(a) for the length of the array
  • a.ndim() for the number of array dimensions
  • a.size() for the number of elements
  • a.dtype() for the data type of the elements
  • a.dtype.name for the name of the data type.
  • a.astype(int) to convert to another data type.

Numpy Data Types:

  • np.int64 signed 64-bit integer
  • np.float32 standard double precision floating point
  • np.complex complex numbers represented by 128 floats
  • np.bool for boolean True and False values
  • np.object for Python object type
  • np.string_ for fixed-length string type
  • np.unicode_ for fixed-length unicode type

Operations for performing maths on arrays:

  • np.add(a,b) same as a + b
  • np.substract(a,b) same as a - b
  • np.divide(a,b) same as a / b
  • np.multiply(a,b) same as a * b
  • np.exp(a) for exponentials
  • np.sqrt(b) for square roots
  • np.cos(a), np.sin(a) etc for element wise cosines, sines etc
  • np.log(a) element wise natural logarithm
  • a.dot(b) for dot product of two arrays

Comparison operators

  • a=b element-wise comparisons, result in Trues or Falses
  • a > 2 elment-wise comparisons
  • np.array_equals(a,b) array-wise comparisons

Aggregate functions

Array-wise aggregation examples:

  • a.sum()
  • a.min()

can specify the axis:

  • a.min(axis=0) minimum value of an array row
  • a.max(axis=1) maximum value of an array column
  • a.cumsum() for cumulative sum
  • a.mean() or np.mean(a) for mean
  • a.std() or np.std(a) for standard deviation
  • np.median(a) for median
  • np.corrcoef(a) for correlation coefficients

Copying arrays

  • np.copy(a) to copy an array or a.copy() to create a deep copy
  • a.view() to create a view of the array with same data

Sorting arrays

  • a.sort() to sort an array
  • a.sort(axis=0) to sort along an axis
  • a.sort(axis=1)

Subsetting, slicing, indexing

Subsetting and slicing using [].

  • a[:3] to select elements up to index 3 (at index 0, 1 and 2)
  • a[1,2] select elements at row 1 column 2.
    This is the same as a[1][2]
  • a[:,1]
  • a[::-1] to reverse an array
  • a[a>2] boolean indexing

Array Manipulation

transposing an array

  • np.transpose(a) same as a.T
  • a.ravel to flatten an array Changing array shape
  • a.reshape(2,3) to reshape an array with 6 elements to a 2 by 3. If the number of elements is unknown use a.reshape(2,-1).

Adding or removing elements

  • a.resize() to return a new array with specified shape
  • np.append(a,b) to append items to an array
  • np.insert() to insert items into an array
  • np.delete() to delete items from an array

Combining arrays

  • np.concatenate((a,b), axis=0)
  • np.vstack((a,b)) to stack arrays vertically (row-wise)
  • np.hstack((a,b)) to stack arrays horizontally (column wise)

Splitting arrays

  • np.hsplit(a, 2) split array horizontally at index 2
  • np.vsplit(a,2) to split vertically at index 2

Pandas - for data wrangling

In a long format dataframe, each row is a complete and independent representation. In a wide dataframe, categorical values are grouped.

pd.pivot_table() and pd.pivot()

  • pd.pivot_table(): To transform a long-format dataframe to wide format. Create a spreadsheet-style pivot table as a DataFrame.

  • pd.pivot_table() is also used for generating tables of summary statistics. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

  • index: the variables to remain untouched

  • columns: the variables to be spread across more columns

  • values: the numerical values to be aggregated or processed

The output of pivot_table() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.

Some columns might be better represented as column names instead of values. The output of pivot() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.

.rename_axis() sets the name of the axis for the index or columns.

  • df.pivot : Pivot without aggregation that can handle non-numeric data

pd.pivot to pivot a dataframe spreading rows into columns

Melting dataframes using the .melt method

  • pd.melt() to transform wide to long.
    Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

  • id_vars are values to keep as rows, duplicated as needed.

  • value_vars are columns to be taken and made as values, melted into a new column. If the value_vars is not specified, then all columns that are not included in id_vars will be used as value_vars.

  • var_name is optional

Group by

The groupby method allows you to group rows of data together and aggregation functions to be callled on the grouped rows. 'by' takes a list with the columns you are interested to group.

Docstring: Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups

Stacking and Unstacking

Melting