s

Pandas

By Angela C

March 3, 2021 in datasets dataframes pandas

Reading time: 6 minutes.

pandas is one of the most commonly used libraries in Python for data analysis and is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

https://pandas.pydata.org/

The pandas user guide and the 10 minutes to pandas tutorial in particular is the starting point. The tutorial outlines how to create data structures in pandas, get data into and out of pandas, view data, edit and filter data, merge datasets as well as how to perform group operations and to reshape data. pandas also has time-series and plotting functionality. The two main data structures are a Series which is a one-dimensional labeled array capable of holding any data type. A pandas ‘DataFrame’ is a 2-dimensional labeled data structure with columns of potentially different types. A DataFrame is like a spreadsheet or SQL table.

With pandas you can create DataFrames out of a dataset or subsets of a dataset. Data can be read in from a variety of file formats including CSV and text files, Excel, JSON, HTML, SQL etc.

Transformations can be applied to the rows and / or columns of a DataFrame.

  • DataFrame() function

    • data
    • index
    • rows
    • columns
  • pandas automatically creates an index.

    • set index to None to stop this
  • shape() to get the shape or size of the dataset

  • index to get the index of the dataframe

  • read_csv to read from a csv file

  • read_excel, read_json, read_html to read from other formats

    • can specify column names
    • can specify if the dataset has a header to use as the column names or not and which row to use
    • can specify how to parse dates using parse_dates argument
    • there are any other arguments can be provided
  • export data from a DataFrame to various file formats

    • to_csv
    • to_excel
    • to_json
    • to_html
    • these functions can take arguments such as
    • encoding for the encoding type such as “utf-8”
    • sep for symbols to separate on such as \t
    • write_excel requires you to create an instance of the writer object first
  • Select components of the dataframe:

    • loc to select rows and columns using labels
    • iloc to select rows and columns using the index of the entries.
    • can select columns using their name
  • Add new rows and columns

    • can use loc to add new rows and assign values to the new rows
    • can use loc to add new columns and provide the name of the new column
  • Add an index using set_index()

  • Deleting rows and columns

    • drop() function which takes axis and inplace arguments
    • axis to indicate which axis to delete, 0 for rows, 1 for columns.
    • inplace to indicate if the changes effects the existing dataframe or not. The default is not to change the existing dataframe.
      • can create a new dataframe with the changes
      • or set inplace=True to make the changes apply to the existing dataframe.
  • Rename rows and columns and the index

    • .rename() takes a dictionary with the old and new values
    • inplace=True to make the change in the existing dataframe
df = df.rename(columns = {'year': 'Year', 'maxtp' : 'max-tp', 'mintp' : 'min-tp'} )
  • Reset the index:

    • reset_index() function to revert to the original index if you have made changes
  • Reshape dataframes:

    • pivot() to create a new table with selected values and columns
    • stack() stacks the columns so that the columns become part of multi-level indices.
    • unstack()
    • melt()
  • Iterating over dataframe rows: iterrows() with a for loop

    for idx, row in dataframe.iterrows():
        do something
    
    

Plotting with Pandas

  • import matplotlib library * import matplotlib.pyplot as plt

    • plt.show() to show the plot
  • Some examples:

    • bar chart: df.value_counts(sort=False).plot.bar() to plot the values of a dataframe df as a bar chart
    • histogram: df.value_counts(sort=True).plot.hist(orientation='vertical') to plot a histogram vertically. The default is horizontal.
    • area plot : df.plot.area(stacked=False) to show how values change with respect to other variables. stacked=True to stack rather than overlap.
    • pie chart: df.value_counts(sort=False).plot.pie()
    • density plot: df.value_counts.plot.kde() to plot the probability distribution of a variable
    • scatter matrix: to generate a pair-wise scatter matrix for selected variables.
      • alpha to set the transparency
      • figsize to set the size of the figure
      • diagonals to specify what should be shown on the diagonals ( instead of a variable plotted against itself)
      • bootstrap_plot
  • can set the relative size of the figure using the figsize argument.


Reshaping data, Grouping data, Aggregating data

In a long format dataframe, each row is a complete and independent representation. In a wide dataframe, categorical values are grouped.

pd.pivot_table() and pd.pivot()

  • pd.pivot_table(): To transform a long-format dataframe to wide format. Create a spreadsheet-style pivot table as a DataFrame.

  • pd.pivot_table() is also used for generating tables of summary statistics. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

  • index: the variables to remain untouched

  • columns: the variables to be spread across more columns

  • values: the numerical values to be aggregated or processed

The output of pivot_table() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.

Some columns might be better represented as column names instead of values. The output of pivot() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.

.rename_axis() sets the name of the axis for the index or columns.

  • df.pivot : Pivot without aggregation that can handle non-numeric data

pd.pivot to pivot a dataframe spreading rows into columns

Melting dataframes using the .melt method

  • pd.melt() to transform wide to long.
    Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

  • id_vars are values to keep as rows, duplicated as needed.

  • value_vars are columns to be taken and made as values, melted into a new column. If the value_vars is not specified, then all columns that are not included in id_vars will be used as value_vars.

  • var_name is optional

Group by

The groupby method allows you to group rows of data together and aggregation functions to be callled on the grouped rows. 'by' takes a list with the columns you are interested to group.

Docstring: Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

The .agg() method can be used to chain multiple methods to group by

  • Use .agg() method to groupby and chain multiple methpds
  • provide the aggregation functions as arguments in a list
  • custom functions can also be used when grouping and aggregating
  • you can pass a tuple containing the name of the variable to aggregate and the aggregate function to use as well as the name of the resulting column

Stacking and Unstacking


.apply() to apply a function along an axis of the dataframe

df.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)

Apply a function along an axis of the dataframe. The function can be a built-in function or a custom function. The function is applied / broadcast to the dataframe and assigned to a series or a dataframe.

  • func : Function to apply to each column or row.

  • axis : the axis along which the function is applied.

  • 0 or ‘index’: apply function to each column.

  • 1 or ‘columns’: apply function to each row.

  • raw : bool, default False Determines if row or column is passed as a Series or ndarray object.