Pandas
By Angela C
March 3, 2021
Reading time: 6 minutes.
pandas is one of the most commonly used libraries in Python for data analysis and is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
The pandas user guide and the 10 minutes to pandas tutorial in particular is the starting point. The tutorial outlines how to create data structures in pandas, get data into and out of pandas, view data, edit and filter data, merge datasets as well as how to perform group operations and to reshape data. pandas also has time-series and plotting functionality.  The two main data structures are a Series which is a one-dimensional labeled array capable of holding any data type.  A pandas ‘DataFrame’ is a 2-dimensional labeled data structure with columns of potentially different types. A DataFrame is like a spreadsheet or SQL table.
With pandas you can create DataFrames out of a dataset or subsets of a dataset. Data can be read in from a variety of file formats including CSV and text files, Excel, JSON, HTML, SQL etc.
Transformations can be applied to the rows and / or columns of a DataFrame.
- 
DataFrame()function- data
- index
- rows
- columns
 
- 
pandas automatically creates an index. - set index to None to stop this
 
- 
shape()to get the shape or size of the dataset
- 
indexto get the index of the dataframe
- 
read_csvto read from a csv file
- 
read_excel,read_json,read_htmlto read from other formats- can specify column names
- can specify if the dataset has a headerto use as the column names or not and which row to use
- can specify how to parse dates using parse_datesargument
- there are any other arguments can be provided
 
- can specify column 
- 
export data from a DataFrame to various file formats - to_csv
- to_excel
- to_json
- to_html
 - these functions can take arguments such as
 - encodingfor the encoding type such as “utf-8”
- sepfor symbols to separate on such as- \t
 - write_excelrequires you to create an instance of the writer object first
 
- 
Select components of the dataframe: - locto select rows and columns using labels
- ilocto select rows and columns using the index of the entries.
- can select columns using their name
 
- 
Add new rows and columns - can use locto add new rows and assign values to the new rows
- can use locto add new columns and provide the name of the new column
 
- can use 
- 
Add an index using set_index()
- 
Deleting rows and columns - drop()function which takes- axisand- inplacearguments
- axisto indicate which axis to delete,- 0for rows,- 1for columns.
 - inplaceto indicate if the changes effects the existing dataframe or not. The default is not to change the existing dataframe.- can create a new dataframe with the changes
- or set inplace=Trueto make the changes apply to the existing dataframe.
 
 
- 
Rename rows and columns and the index - .rename()takes a dictionary with the old and new values
- inplace=Trueto make the change in the existing dataframe
 
df = df.rename(columns = {'year': 'Year', 'maxtp' : 'max-tp', 'mintp' : 'min-tp'} )
- 
Reset the index: - reset_index()function to revert to the original index if you have made changes
 
- 
Reshape dataframes: - pivot()to create a new table with selected values and columns
- stack()stacks the columns so that the columns become part of multi-level indices.
- unstack()
- melt()
 
- 
Iterating over dataframe rows: iterrows()with aforloopfor idx, row in dataframe.iterrows(): do something
Plotting with Pandas
- 
import matplotlib library * import matplotlib.pyplot as plt- plt.show()to show the plot
 
- 
Some examples: - bar chart: df.value_counts(sort=False).plot.bar()to plot the values of a dataframedfas a bar chart
- histogram: df.value_counts(sort=True).plot.hist(orientation='vertical')to plot a histogram vertically. The default is horizontal.
- area plot : df.plot.area(stacked=False)to show how values change with respect to other variables.stacked=Trueto stack rather than overlap.
- pie chart: df.value_counts(sort=False).plot.pie()
- density plot: df.value_counts.plot.kde()to plot the probability distribution of a variable
- scatter matrix: to generate a pair-wise scatter matrix for selected variables.
- alphato set the transparency
- figsizeto set the size of the figure
- diagonalsto specify what should be shown on the diagonals ( instead of a variable plotted against itself)
- bootstrap_plot
 
 
- bar chart: 
- 
can set the relative size of the figure using the figsizeargument.
Reshaping data, Grouping data, Aggregating data
In a long format dataframe, each row is a complete and independent representation. In a wide dataframe, categorical values are grouped.
pd.pivot_table() and pd.pivot()
- 
pd.pivot_table(): To transform a long-format dataframe to wide format. Create a spreadsheet-style pivot table as a DataFrame.
- 
pd.pivot_table()is also used for generating tables of summary statistics. The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.
- 
index: the variables to remain untouched
- 
columns: the variables to be spread across more columns
- 
values: the numerical values to be aggregated or processed
The output of pivot_table() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.
Some columns might be better represented as column names instead of values.
The output of pivot() is a DataFrame with a multi-index. This can be transformed to a regular index using reset_index() and rename_axis() methods.
.rename_axis() sets the name of the axis for the index or columns.
- df.pivot: Pivot without aggregation that can handle non-numeric data
pd.pivot to pivot a dataframe spreading rows into columns
Melting dataframes using the .melt method
- 
pd.melt()to transform wide to long.
 Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
- 
id_varsare values to keep as rows, duplicated as needed.
- 
value_varsare columns to be taken and made as values, melted into a new column. If thevalue_varsis not specified, then all columns that are not included inid_varswill be used asvalue_vars.
- 
var_nameis optional
Group by
The groupby method allows you to group rows of data together and aggregation functions to be callled on the grouped rows.
'by' takes a list with the columns you are interested to group.
Docstring: Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results.
This can be used to group large amounts of data and compute operations on these groups.
The .agg() method can be used  to chain multiple methods to group by
- Use .agg()method to groupby and chain multiple methpds
- provide the aggregation functions as arguments in a list
- custom functions can also be used when grouping and aggregating
- you can pass a tuple containing the name of the variable to aggregate and the aggregate function to use as well as the name of the resulting column
Stacking and Unstacking
.apply() to apply a function along an axis of the dataframe
df.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)
Apply a function along an axis of the dataframe. The function can be a built-in function or a custom function. The function is applied / broadcast to the dataframe and assigned to a series or a dataframe.
- 
func: Function to apply to each column or row.
- 
axis: the axis along which the function is applied.
- 
0 or ‘index’: apply function to each column. 
- 
1 or ‘columns’: apply function to each row. 
- 
raw: bool, default False Determines if row or column is passed as a Series or ndarray object.