Tidy Data principles

Long format data

  • columns contain data about one and only one attribute
  • each row is a complete representations of a complete observation

A long form dataframe contains one observation per row and one variable per column.

Tidy Data and data visualisations

The structure of a dataset plays an important role in creating visualisations using plotly.express with very few lines of code. When the data is in a consistent format it becomes easier to analyse and visualise it.

Tidy data allows each marker on a chart to be independently represented by a row, with each value in the row belonging to a distinct column. Each column represents a separate variable and contains only one data type. This makes it easy to map visual attributes such as the colour, size and shape of markers to feaures of the dataset.

Some dataframes might have values spread across several columns. For example where a dataset might keep track of values over time, there might be a separate column for each year instead of repeating the values in other columns. A tidy or long-format dataframe would have more rows than columns but each row would be a complete and independent representation of an observation. A wide-form dataframe can be converted into a long-form / tidy dataframe using the melt method.

  • Set the columns to be kept as columns as id_vars. The values will be duplicated as needed to keep the mapping.
  • Set the columns to be made into values as value_vars. Every column other than those specified as the id_vars will be taken to be a value_vars unless you specify the value_vars
  • You can set the name for the new column using var_name. This is optional. Melting should resulting in a dataframe with less columns but more rows. Each row will contain a single value in a column named value. To remove any missing values under the value column, use .dropna(subset=['value'])

Pivoting dataframes

Pivoting dataframes does the reverse of melting. Sometimes the values in a column might be better off being represented independently in their own columns. You can use pivoting to convert the unique values from a single column and make them column names instead.

pivot takes an index which can be a list of columns that you want to have a unique combination of for each row. These are the columns that will not change

The disadvantage of the long format is storage as a lot of values are repeated so it is inefficient for storage. However having a consistent mapping makes it easier to create and change visualisations based on the data.

Tech used: