Iris_notebook

A collection of the python scripts and files I had under my project repo which I am cleaning up. Much of this is duplicated.

# first importing the following libraries
import numpy as %notebookp 
import pandas as pd  
import matplotlib.pyplot as plt 
import seaborn as sns


# save link to data and reference the link 
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# the data read in does not have any column names. 
# Specify header = None to avoid reading the first row of data as a header or column name

iris = pd.read_csv(csv_url, header = None)

#iris = pd.read_csv('iris_data.csv', header =  None)

# using the attribute information as the column names
col_names = ['Sepal_Length_cm','Sepal_Width_cm','Petal_Length_cm','Petal_Width_cm','Class']

iris =  pd.read_csv(csv_url, names = col_names)

# look at the top 5 observations
iris.head()

# look at the bottom 5 observations
iris.tail()

# How many rows in the iris DataFrame
len(iris)

# the shape or dimensions of the dataset
iris.shape

The Data Frame has 5 columns, with the first 4 being the attributes or features of the data set. The last column is the class or type of iris plant each observation belongs to. Each row correspond to an individual observation of an iris plant

species_type = iris['Class'].unique()
species_type

#(by default index.col is set to a range from 0 to the number of rows.)
#The DataFrame has an index which was automatically assigned when the DataFrame was created on reading in the csv file. 
#The index is a range from 0 to 150
iris.index

# column names of the data
iris.columns

# can write the DataFrame to a comma separated file to save any changes including column names added
iris.to_csv('iris_data.csv')

Checking for any missing values in the dataset


pd.isnull(iris).sum()
iris.isnull().sum()
# the opposite of isnull is notnull. 
iris.notnull().sum()

Indexing and Filtering data

Making a separate dataframe for each class or species. This might make it easier for plotting and getting statistics. Trying out the methods and functions from the pandas.pydate website. http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-and-selecting-data I want to select all the columns with only rows that belong to one class or species of the iris plant. You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:

# this gets a column of data based on attribute.
iris.Sepal_Length_cm.head()

# this gets the row of data corresponding to index 0 (the first row)
iris.iloc[[0]]

# can get a slice of data using slicing inside of []
#first 5 rows of data
iris[0:5]

# slice using label. (using the loc attribute)
iris.loc[0:5] # the index is labelled o to 150

# getting values with a boolean array. This is used for checking a condition in a row is met
iris.loc[0:5] > 5.1


species_type =iris['Class'].unique()
species_type

You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame)

# select from the iris DataFrame only the rows where the Class equals the string "Iris-setosa"
iris[iris['Class'] == "Iris-setosa"].head()

#http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-indexing

# select from the iris DataFrame only the rows where the Class equals the string "Iris-setosa"
# save to a new DataFrame
iris_setosa = iris[iris['Class'] == "Iris-setosa"]

iris_setosa.head()

# how many setosas is there.
iris_setosa.count()

Summary Statistics of the Iris data set¶

# summary statistics for the entire dataset
iris.describe()

# a quick summary statistics for the Setosa variation only
iris_setosa.describe()

# select from the iris DataFrame only the rows where the Class equals the string "Iris-setosa"
# save to a new DataFrame
iris_versicolor = iris[iris['Class'] == "Iris-versicolor"]
iris_versicolor.head()

iris_versicolor.describe()

# select from the iris DataFrame only the rows where the Class equals the string "Iris-virginica"
# save to a new DataFrame
iris_virginica = iris[iris['Class'] == "Iris-virginica"]
iris_virginica.head()

DataFrame also has an isin() method. When calling isin, pass a set of values as either an array or dict. If values is an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.

Oftentimes you’ll want to match certain values with certain columns. Just make values a dict where the key is the column, and the value is a list of items you want to check for.

Combine DataFrame’s isin with the any() and all() methods to quickly select subsets of your data that meet a given criteria.

To select a row where each column meets its own criterion: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-isin

# Here subsetting the data to meet a given criteria
values =  {'Class': ['Iris-versicolor', 'Iris-virginica']}
row_mask = iris.isin(values).any(1)
iris[row_mask].head(10)

iris[row_mask].tail(10)

# select from the iris DataFrame only the rows where the Class equals the string "Iris-setosa"
iris_setosa = iris[iris['Class'] == "Iris-setosa"]
print("Selecting from the iris dataframe only those rows containing the Class Iris-setosa")
print(iris_setosa.head())

# subsetting the data to meet a given criteria.

values =  {'Class': ['Iris-versicolor', 'Iris-virginica']}
row_mask = iris.isin(values).any(1)
print(" subsetting the dataframe using Boolean masks")
print(iris[row_mask].head())

# GROUPBY
# can use the groupby functions to look at statistics at the species level
iris_grouped = iris.groupby("Class")
print(iris_grouped.mean())
print(iris_grouped.count())

http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#the-query-method DataFrame objects have a query() method that allows selection using an expression. You can get the value of the frame where column b has values between the values of columns a and c Using pure python: df[(df.a < df.b) & (df.b < df.c)] Using query df.query('(a < b) & (b < c)')

iris[(iris.Sepal_Length_cm > iris.Sepal_Width_cm)].count()

iris.query('Sepal_Length_cm > Sepal_Width_cm & Petal_Length_cm > Petal_Width_cm').count()

## Missing Values

## check for any missing values using pandas.isnull() or the opposite using pandas.notnull()
print(pd.isnull(iris).sum())

print(pd.notnull(iris).sum())

Selecting and Filtering.

Using an index with square brackets will return a Series corresponding to the column name. Can retrieve a column of data from the iris DataFrame using dict-like notation or by attribute.

# by attribute
iris.Petal_Width_cm.head()

# The index for the iris DataFrame at the moment is just a range of integers from 0 to 150 
iris.loc[[0]]

#  rows of the iris DataFrame can be retrieved by position name or using the loc attribute.
# The index operators can be used to select a subset or rows and columns.
# The index for the iris DataFrame at the moment is just a range of integers from 0 to 150 
# retrieve as a Series
iris.loc[0]

# Boolean Indexing
# can use Boolean operators to select rows that meet certain conditions.

iris[iris.Sepal_Length_cm > 7]


# getting column at row index 0 to 3 column index 2 to 5
iris.iloc[0:3,2:5].head()

# getting all columns for row at index 4
iris.iloc[4,:].head()

# can index into the dataframe to retrieve one or more columns either with a single value or a sequence
print(iris[0:5])

Groupby

http://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html group by is a process involving one or more of the following steps:

First split the data into groups based on some criteria. Split the data into groups and then do something with the individual groups
Applying a function to each group independently. Aggregation function such as computing summary statistic for each group, group sums or means, group sizes and group counts Transformation function to perform some group-specific computations and return a like-indexed object.
Combining the results into a data structure.

pandas objects can be split on any of their axes.


iris.head()

iris_grouped = iris.groupby("Class")
iris_grouped.mean()
iris_grouped

iris_grouped.count()

# summary statistics by group (species). Transposed to make it more readable
iris_grouped.describe().T

# can do summary statistics on the groups which should give the same results as the individual statistics 
# from the DataFrames where I seperated the data into

Fishers paper

Fishers paper tables showing the observed means for two species and their differences. I want to try and get the same information in Table 2 of Fisher’s paper. Table II. Observed means for two species and their difference(cm.)

iris.head()

iris.groupby("Class").head()
iris_grouped.mean().T

# group by class and then get the mean of each group, then transpose the rows and columns
iris.groupby("Class").mean().T

# group by class and then get the mean of each group and transpose the columns to rows.
# This gives the mean of each measurement for each class.
table2 =iris.groupby("Class").mean().T


# I want to get the values in Table 2 of Fishers paper
# get the mean of each variable by class or species. 
# Transpose the data to get the rows as columns
# only want the columns up to versicolor. exclude Virginica species
table2.loc[:,'Class':'Iris-versicolor']

# make it into a DataFrame called means
means =pd.DataFrame(table2)
means.head()

# in one go: create a dataframe from grouping the iris dataframe  by class and calculating the group means
pd.DataFrame(iris.groupby("Class").mean().T)

# add a new column for the difference in means between the Versicolor and Setosa species
means['diff (Versicolor - Setosa)'] = means['Iris-versicolor'] - means['Iris-setosa']

# add a new column for the difference in means between the Versicolor and Virginica species
means['diff (Versicolor - Virginica)'] = means['Iris-versicolor'] - means['Iris-virginica']

# add a new column for the difference in means between the Versicolor and Virginica species
means['diff (Virginica - Setosa)'] = means['Iris-virginica'] - means['Iris-setosa']

means

means['ssd'] = means['Iris-versicolor'] * means['Iris-setosa'] - means['diff']

There are some small differences in the means of the iris-setosa This may be to do with the small difference between the 35th and 38 observations of Setosa in the UCI dataset


means.head()

# Here subsetting the data to meet a given criteria
values =  {'Class': ['Iris-versicolor', 'Iris-setosa']}
row_mask = iris.isin(values).any(1)
iris[row_mask].head(10)

iris_versicolor.mean() - iris_setosa.mean()

grouped = df.groupby('order', axis='columns')

grouped = df.groupby(['class', 'order'])

Sorting


# can sort the DataFrame by one or more of the columns.
# put the columns in the order to sort by
iris.sort_values(by =['Petal_Width_cm','Class'])
iris.sort_values(by =['Class','Petal_Length_cm'])
iris.sort_values(by =['Class','Sepal_Length_cm'])
iris.sort_values(by =['Class','Sepal_Width_cm'])

iris.sort_values(by =['Class','Petal_Length_cm'])

THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS by R.A Fisher

When two or more populations have been measured in several characters, xl, …,x8, special interest attaches to certain linear functions of the measurements by which the populations are best discriminated. In the present paper the application of the same principle will be illustrated on a taxonomic problem; some questions connected with the precision of the processes employed will also be discussed. Table 1 in Fisher’s paper displayed the measurements of the flowers of fifty plants each of the two species Iris Setosa and Iris Versicolor which were found growing together in the same colony and measured by Dr E Anderson. Four flower measurements are given. Fisher looked at the question of what linear function of the 4 measurements would maximise the ratio of the difference between the specific means to the standard deviation within species. The sample of the third species given in Table I,Irisvirginica, differs from the two other samples in not being taken from the same natural colony as they were-a circumstance which might considerably disturb both the mean values and their variabilities.

Visualising the Iris data set


iris = sns.load_dataset("iris")
g = sns.PairGrid(iris)
g.map(plt.scatter);

https://seaborn.pydata.org/examples/scatterplot_categorical.html

sns.set(style="whitegrid", palette="muted")

# Load the example iris dataset
iris = sns.load_dataset("iris")

# "Melt" the dataset to "long-form" or "tidy" representation
iris = pd.melt(iris, "species", var_name="measurement")

# Draw a categorical scatterplot to show each observation
sns.swarmplot(x="measurement", y="value", hue="species",
              palette=["r", "c", "y"], data=iris)

# https://seaborn.pydata.org/examples/jitter_stripplot.html

sns.set(style="whitegrid")
iris = sns.load_dataset("iris")

# "Melt" the dataset to "long-form" or "tidy" representation
iris = pd.melt(iris, "species", var_name="measurement")

# Initialize the figure
f, ax = plt.subplots()
sns.despine(bottom=True, left=True)

# Show each observation with a scatterplot
sns.stripplot(x="value", y="measurement", hue="species",
              data=iris, dodge=True, jitter=True,
              alpha=.25, zorder=1)

# Show the conditional means
sns.pointplot(x="value", y="measurement", hue="species",
              data=iris, dodge=.532, join=False, palette="dark",
              markers="d", scale=.75, ci=None)

# Improve the legend 
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[3:], labels[3:], title="species",
          handletextpad=0, columnspacing=1,
          loc="lower right", ncol=3, frameon=True)

# https://seaborn.pydata.org/examples/pair_grid_with_kde.html
sns.set(style="white")

df = sns.load_dataset("iris")

g = sns.PairGrid(df, diag_sharey=False)
g.map_lower(sns.kdeplot)
g.map_upper(sns.scatterplot)
g.map_diag(sns.kdeplot, lw=3)

# https://seaborn.pydata.org/examples/scatterplot_matrix.html
sns.set(style="ticks")

df = sns.load_dataset("iris")
sns.pairplot(df, hue="species")

iris.cov()
iris.corr()

From the scikit-learn tutorial

import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
np.unique(iris_y)

# Split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed(0)
indices = np.random.permutation(len(iris_X))
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test = iris_X[indices[-10:]]
iris_y_test = iris_y[indices[-10:]]
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train) 



knn.predict(iris_X_test)

print(iris_y_test)

iris.py

# iris.py
# Angela Carpenter
# This script contains my script for project 2019.

# 1. IMPORT PYTHON LIBRARIES

# In order to use python libraries that are not part of the standard python library, they first need to be imported.
# Here I import the pandas library, the matplotlib pyplot library and the seaborn library using short name aliases pd, plt and sns. 
# This seems to be the conventional way to import these particular packages.

print("First importing the python libraries")
import pandas as pd  
import matplotlib.pyplot as plt 
import seaborn as sns

# help can be obtained using the python help function.
# help(pd) or help(pd.DataFrame.describe())

# 2. LOADING / READING IN THE IRIS DATA SET INTO PYTHON

# Create a variable `csv_url` and pass to it the url where the data set is available at 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'. 
csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# I have also saved the csv file to the folder or repository and can read it in from there in case for some reason the url is not available.

# Create a list of column names `col_names` using the iris attribute information available at the UCI machine learning repository.
# passing the column names to the names parameter of read_csv
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']

iris =  pd.read_csv(csv_url, names = col_names)

# to read the csv file from a csv file in the same folder as this script.
# iris = pd.read_csv('iris_data.csv', names = col_names)

# using the pandas DataFrame method head to return the first rows of the DataFrame and check that the file was correctly loaded
print("The first 10 rows of the iris dataframe:")
print(iris.head(10))

# using the pandas DataFrame method tail to return the last rows of the DataFrame and check that the file was correctly loaded
print("The last 10 rows of the iris dataframe:")
print(iris.tail(10))

# check the data types to ensure they have been correctly inferred by read_csv
print(iris.dtypes)

# 3. EXPLORING AND INVESTIGATING THE IRIS DATA SET 

# Having imported the iris data set from a csv file into a pandas `DataFrame`, all the attributes and methods of `DataFrame` objects can be used on the iris DataFrame object.

# First looking at the attributes of the iris DataFrame created from importing the iris data set above.

# Getting the number of axes / array dimensions of the iris DataFrame using ndim attribute
print(f"The iris DataFrame has {iris.ndim} dimensions")

# Look at the shape of the iris DataFrame as this shows the number of rows and columns in the table or matrix of data
# This will show how many rows (containing observations) and columns (containing features/variables)
print(f"The Iris data set consists of {iris.shape[0]} rows and {iris.shape[1]} columns corresponding to the rows and columns of the csv file.")

# the number of elements in the iris object.
print(f"There are {iris.size}  elements in total.")
# the number of elements in the iris object.
print(f"The iris DataFrame has {iris.size} elements in total.")

# The DataFrame has both a row and a column index which were automatically assigned when the DataFrame was created.
# Get the column labels of the iris DataFrame using  'pandas.DataFrame.columns'
print("The column labels of the iris DataFrame are: ", *iris.columns, sep = "   ")

# the row index 
print(f" The index of the DataFrame is: ", iris.index)
print("The index for the rows are ",*iris.index)
print("This index was automatically assigned when the DataFrame was created above.")

# the dtypes (data types) of the iris DataFrame
print(f"The data types of iris DataFrame are as follows:")
print(iris.dtypes)

# Return the ftypes (indication of sparse/dense and dtype) in the iris DataFrame.
print(iris.ftypes)

# pandas.DataFrame.axes return a a list representing the axes of the iris DataFrame which shows the row axis labels and the column axis labels in that order. This returns the same information as the index and columns attribute
print(iris.axes)
#print("The row axis labels of the iris DataFrame are  ", *iris.axes[0])
print("The row axis labels of the iris DataFrame is a range from ", *iris.axes[0][[0]], *iris.axes[0][[1]], *iris.axes[0][[2]],"..." , *iris.axes[0][[-3]],*iris.axes[0][[-2]], *iris.axes[0][[-1]] )
print("The column axis labels of the iris DataFrame are as follows:\n ",*iris.axes[1])

# Now using some of the DataFrame methods to explore the Iris DataFrame

# Look at the first ten observations in the DataFrame. 
print(iris.head(10))

# Look at the last ten observations in the DataFrame
print(iris.tail(10))

# It is possible to check for missing values in the DataFrame using the panda's `isnull()` method.
# This shows that there are no missing values which is as expected for this particular data set.
# Detect missing values in the DataFrame. Sum the values instead of printing the boolean values as True = 1.
print("The number of null or missing values in the iris dataframe for each column: ")
print(iris.isnull().sum())

# Print a concise summary of the iris DataFrame.
print(f"A concise summary of the iris DataFrame: \n")
iris.info()

# Count non-NA cells for each column or row.
print(f"\n The number of non-NA cells for each column or row are: \n {iris.count()}")

# Using the `unique()` method on the 'Class' column to show how many different class or species of Iris flower is in the data set.

iris['Class'].unique()
species_type =iris['Class'].unique()
print("The following are the three class or species types of iris in the data set \n",*species_type, sep = " ")

# count the number of distinct observations for each column 
iris.nunique()

# look at the summary statistics of the DataFrame
print("Here are some summary statistics for the iris DataFrame: \n ")
print(iris.describe())

####   ####   ####   ####   ####   ####

# VISUALISATIONS OF THE IRIS DATA SET

# Make a histogram of the DataFrame for each of the four numeric columns in the iris data set.
# The number of bins can be specified. 

# pandas DataFrame.hist() plots the histograms of the columns on multiple subplots:
print("Histogram of the distribution of the iris data. Make sure to close the plot to continue. ") 
# iris.hist(alpha=0.8, bins=30, figsize=(12,8))

iris.hist(alpha=0.8, bins=30, figsize=(12,8))
plt.suptitle("Histogram of the Iris petal and sepal measurements")
plt.savefig("images/IrisHistograms.png")

# Boxplot can be drawn using DataFrame.plot.box(), or DataFrame.boxplot() 
# This is used to visualize the distribution of values within each column.

iris.plot.box(figsize=(6,4))
plt.suptitle("Boxplots of the Iris petal and sepal measurements")
# plt.show()
# I am going to save the resulting plot to a file rather than printing it here
# to print it to screen just uncomment the code in the line above:     plt.show()
plt.savefig("images/irisbox.png")

# boxplot just showing the distribution of each measurement variable on its own is not very useful 
# next look at boxplots by Class or species of iris plant using 'seaborn' 

# SEABORN PLOTS
# The appearance of the plot can be changed by setting the figure aesthetics.
# set the theme. (The default theme is called darkgrid). Set the color palette.
sns.set(style="ticks", palette="pastel")

# plotting 4 plots on a 2 by 2 grid, do not want to share the y axis between plots. Setting the figure size 
f, axes = plt.subplots(2, 2, sharey=False, figsize=(12, 8))
# pass a panda Series as the x and y parameters to the boxplot. 
# Using the Class column (categorical) and one of the sepal or petal measurements (numerical) for each subplot

# setting the hue = Class so that the points will be coloured on the plot according to their Class/species type.
sns.boxplot(x="Class", y="Sepal_Length", data=iris, ax=axes[0,1])
sns.boxplot(x="Class", y="Sepal_Width", data=iris, ax=axes[1,1])
sns.boxplot(x="Class", y="Petal_Length",data=iris, ax = axes[0,0])

sns.boxplot(x="Class", y="Petal_Width",hue = "Class",data=iris, ax=axes[1,0])

# adding a title to the plot
f.suptitle("Boxplot of the Petal and Sepal measurements by Iris plant Species")

plt.savefig("images/irisBoxbyClass.png")



# 4. EXPLORING IRIS DATA SET BY SPECIES
# There are many ways to filter the data


# subsetting the data to meet a given criteria using the isin operator and boolean masks. only selecting where class is iris-versicolor or Iris-virginica
values =  {'Class': ['Iris-versicolor', 'Iris-virginica']}
row_mask = iris.isin(values).any(1)
iris[row_mask].head()

# select from the iris DataFrame only the rows where the Class equals the string "Iris-setosa"
iris_setosa = iris[iris['Class'] == "Iris-setosa"]
# print(iris_setosa.head())

# Using boolean masks to subset the data to meet a given criteria.
values =  {'Class': ['Iris-versicolor', 'Iris-virginica']}
row_mask = iris.isin(values).any(1)
# print(iris[row_mask].head())


# GROUP BY 

print("using groupby to split the iris dataframe by Class of iris species")
# Using groupby functions to look at statistics at the class / species level
iris_grouped = iris.groupby("Class")

# Compute count of group, excluding missing values.
iris.groupby("Class").count()
print("The number of observations for each variable for each Iris species in the data set are as follows: \n \n",iris.groupby("Class").count())

# Groupby Class of Iris plant and return the mean of the remaining columns in each group.

print("The mean or average measurement for each group of Iris Species in the dataset is \n",iris.groupby('Class').mean())
iris.groupby('Class').mean()
# Group by Class of Iris plant and then return the first observations in each group
iris.groupby("Class").first()
print("the first observation in each Class of Iris plant in the Iris dataset are: \n  \n",iris.groupby("Class").first())

# Group by Class of Iris and then return the last observations in each group
print("the last observation in each Class of Iris plant in the Iris dataset are: \n  \n",iris.groupby("Class").last())
iris.groupby("Class").last()

# get the first 3 rows in each group 
iris.groupby("Class").head(3)
print("The first three rows for each Class of Iris plant in the Iris dataset are: \n\n",iris.groupby("Class").head(3))

# get the last 3 rows in each group
iris.groupby("Class").tail(3)
print("The last three rows for each Class of Iris plant in the Iris dataset are: \n\n",iris.groupby("Class").tail(3))

# get max of group values
iris.groupby("Class").max()
print("The maximum value for each measurement for each Class of Iris plant in the Iris dataset are: \n\n",iris.groupby("Class").max())

# get min of group values
iris.groupby("Class").min()
print("The minimum value for each measurement for each Class of Iris plant in the Iris dataset are: \n\n",iris.groupby("Class").min())

# There does not seem to be a range function to see the range of values so I am going to calculate these ranges here.
# by taking the differences between the mimimum and the maximum values

iris_ranges = iris_grouped.max() - iris_grouped.min()
print(iris_ranges)

# sorting the range of values in ascending order, first by petal lengths, then petal widths and then by sepal lengths.
iris_ranges.sort_values(["Petal_Length","Petal_Width","Sepal_Length"])

# these stats are available from the describe() summary function
# get mean of group values
iris.groupby("Class").mean()

# get median of group values
iris.groupby("Class").median()

print(iris_grouped.count())

print(iris_grouped.mean())

# Can look at the summary statistics for each class of Iris in the data set.
# I transposed the results to make it easier to read.
print(iris.groupby("Class").describe())
# print(iris_grouped.describe())
# print(iris_grouped.describe().T)

# PAIRWISE SCATTER PLOTS

# SCATTER PLOTS OF THE IRIS DATA SET

print(" Here is a pairplot scatter matrix")
sns.pairplot(iris, hue="Class")

plt.savefig("images/irispairplots.png")

## correlation matrix of the iris dataset

# First getting the correlation between pairs of the measurement variables across the dataset

print("correlation between pairs of measurement variables for the dataset \n")
print(iris.corr())

print("correlation between pairs of measurement variables for the dataset by Class of iris \n")
iris.groupby("Class").corr()

Plots

# 1. IMPORT PYTHON LIBRARIES

print("First importing the python libraries")
import pandas as pd  
import matplotlib.pyplot as plt 
import seaborn as sns


csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']

iris =  pd.read_csv(csv_url, names = col_names)
print(iris.head(10))




# print("Histogram of the distribution of the iris data. Make sure to close the plot to continue. ") 
# iris.diff().hist(alpha=0.8, bins=30, figsize=(12,8))
# plt.show()
# iris.drop(['Class'], axis=1).diff().hist()
# # plt.figure()
# # iris.diff().hist(color='k', alpha=0.5, bins=20)
# plt.show()


# sns.set(style="ticks", palette="pastel")
# f, axes = plt.subplots(2, 2, sharey=False, figsize=(12, 8))
# sns.boxplot(x="Class", y="Petal_Length",data=iris, ax = axes[0,0])
# sns.boxplot(x="Class", y="Sepal_Length", data=iris, ax=axes[0,1])
# sns.boxplot(x="Class", y="Petal_Width",hue = "Class",data=iris, ax=axes[1,0])
# sns.boxplot(x="Class", y="Sepal_Width", data=iris, ax=axes[1,1])
# # adding a title to the plot
# f.suptitle("Boxplot of the Petal and Sepal measurements by Iris plant Species")
# plt.show()

# sns.set(style="ticks", palette="deep")
# f, axes = plt.subplots(2, 2, sharey=False, figsize=(12, 8))

# sns.scatterplot(x="Petal_Length", y="Petal_Width", hue = "Class",data=iris, ax=axes[0,0])
# sns.scatterplot(x="Sepal_Length", y="Sepal_Width", hue="Class", data=iris, ax=axes[0,1])
# sns.scatterplot(x="Petal_Length", y="Sepal_Length", hue = "Class",data=iris, ax=axes[1,0])
# sns.scatterplot(x="Petal_Width", y="Sepal_Width", hue="Class", data=iris, ax=axes[1,1])
# f.suptitle("Scatterplots of the Petal and Sepal measurements by Iris plant Species")
# plt.show()


# http://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot

`pairplot is used to plot pairwise relationships in a dataset. It creates a grid of Axes where each variable in the dataset will by shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column. A plot showing the univariate distribution for the variable in that column is drawn along the diagonal.


# sns.set(style="ticks", color_codes=True)
# sns.pairplot(iris)
# plt.show()

# sns.set(style="ticks", hue = "Class")
# sns.pairplot(iris)
# plt.show()

# sns.distplot(iris)
# plt.show()

# DataFrame.hist() plots the histograms of the columns on multiple subplots:
# print("Histogram of the distribution of the iris data. Make sure to close the plot to continue. ") 
# iris.hist(alpha=0.8, bins=30, figsize=(12,8))

"""
# PAIRPLOT
# a this function will create a grid of Axes such that each variable in the dataframe will by shared in the y-axis across a single row and
# in the x-axis across a single column. 
# the diagonals show  a plot to show the univariate distribution of the data
# for the variable in that column.
sns.set(style="ticks")
sns.pairplot(iris, hue="Class")
plt.suptitle('iris plot')
# plt.show()
plt.savefig("irisplot.png")
# BOXPLOT 
iris.plot.box(figsize=(6,4))
plt.suptitle("Boxplots of the Iris petal and sepal measurements")
# plt.show()
# I am going to save the resulting plot to a file rather than printing it here
# to print it to screen just uncomment the code in the line above:     plt.show()
plt.savefig("images/irisbox.png")
# SEABORN PLOTS
# The appearance of the plot can be changed by setting the figure aesthetics.
# set the theme. (The default theme is called darkgrid). Set the color palette.
sns.set(style="ticks", palette="pastel")
# plotting 4 plots on a 2 by 2 grid, do not want to share the y axis between plots. Setting the figure size 
f, axes = plt.subplots(2, 2, sharey=False, figsize=(12, 8))
# pass a panda Series as the x and y parameters to the boxplot. 
# Using the Class column (categorical) and one of the sepal or petal measurements (numerical) for each subplot
# setting the hue = Class so that the points will be coloured on the plot according to their Class/species type.
sns.boxplot(x="Class", y="Sepal_Length", data=iris, ax=axes[0,1])
sns.boxplot(x="Class", y="Sepal_Width", data=iris, ax=axes[1,1])
sns.boxplot(x="Class", y="Petal_Length",data=iris, ax = axes[0,0])
sns.boxplot(x="Class", y="Petal_Width",hue = "Class",data=iris, ax=axes[1,0])
# adding a title to the plot
f.suptitle("Boxplot of the Petal and Sepal measurements by Iris plant Species")
plt.show()
"""

# SCATTER PLOTS OF THE IRIS DATA SET

f, axes = plt.subplots(1, 2, sharey=True, figsize=(10, 4))
sns.scatterplot(x="Petal_Length", y="Sepal_Length", hue = "Class",data=iris, ax=axes[0])
sns.scatterplot(x="Petal_Width", y="Sepal_Width", hue="Class", data=iris, ax=axes[1])
f.suptitle("Scatterplots of the Iris petal and sepal measurements")
plt.show()

# DISTPLOT using Seaborn
# sns.distplot(iris, kde=False, rug=True)
# plt.savefig("IrisDistplot.png")

Seaborn plots


# note that the iris data set actually comes loaded with the seaborn package which I will use here as it is quicker to load.

# 1. IMPORT PYTHON LIBRARIES

print("First importing the python libraries")
import pandas as pd  
import matplotlib.pyplot as plt 
import seaborn as sns

# I want to save my plots to a pdf file instead of to the screen
# https://stackoverflow.com/a/11329151

from matplotlib.backends.backend_pdf import PdfPages
pp= PdfPages("iris_plots.pdf")

csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']

iris =  pd.read_csv(csv_url, names = col_names)
print(iris.head(10))

iris2 = sns.load_dataset("iris")
# snsplot1 = sns.relplot(x="Petal_Length",y="Petal_Width", col ="species", hue = "species", data = iris2)

# set a white grid for the figure
# sns.set(style ="whitegrid")
# # using a facet grid and setting hue = Class which means the points will be coloured on the plot according to their Class/species.
# sns.relplot(x="Sepal_Length",y="Sepal_Width", data = iris, hue ="Class")
# # plt.show()
# pp.savefig(snsplot)



# apply the default default seaborn theme, scaling, and color palette.
sns.set()
# draw a faceted scatter plot with multiple semantic variables, two numeric and one categorical.
# The numeric variables determine the position of each point on the axes
# sns.relplot(x="petal_length",y="petal_width", data = iris2, hue ="species")
# plt.show()

# sns.relplot(x="sepal_length",y="sepal_width", data = iris2, hue ="species")
# plt.show()


import matplotlib.pyplot as plt
f, axes = plt.subplots(1, 2, sharey=True, figsize=(6, 4))
sns.scatterplot(x="sepal_length",y="sepal_width", data = iris2, ax=axes[0], hue ="species")

sns.scatterplot(x="petal_length",y="petal_width", data = iris2, ax=axes[1], hue ="species")
plt.show()

sns.boxplot(x="species",y="petal_length", data = iris2)
plt.show()


f, axes = plt.subplots(1, 2, sharey=True, figsize=(6, 4))
sns.boxplot(x="species",y="sepal_width", data = iris2, ax=axes[0])
sns.boxplot(x="species",y="petal_width", data = iris2, ax=axes[1])

plt.show()


# sns.boxplot(x="Class", y="Petal_Length",data=iris, ax = axes[0,0])
# sns.boxplot(x="Class", y="Sepal_Length", data=iris, ax=axes[0,1])
# sns.boxplot(x="Class", y="Petal_Width",data=iris, ax=axes[1,0])
# sns.boxplot(x="Class", y="Sepal_Width", data=iris, ax=axes[1,1])

# plt.show()

g = sns.pairplot(iris, hue="Class")

plt.show()

Iris plots

# 1. IMPORT PYTHON LIBRARIES

print("First importing the python libraries")
import pandas as pd  
import matplotlib.pyplot as plt 
import seaborn as sns


csv_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width','Class']

iris =  pd.read_csv(csv_url, names = col_names)
print(iris.head(10))

# a simple scatter plot
# plot sepal length on the x axis and sepal width on the y axis.
iris.plot(kind="scatter", x = 'Sepal_Length', y="Sepal_Width", c= "DarkBlue")
plt.show()

# note the axes does not start at the origin (0,0)

plt.show()
# a simple scatter plot
# plot petal length on the x axis and petal width on the y axis.
ax3 = iris.plot.scatter(x = 'Sepal_Length', y='Sepal_Width', c= 'Red')
plt.show()


# Histograms can be drawn by using the DataFrame.plot.hist() and Series.plot.hist() methods.
# A histogram can be stacked using stacked=True. Bin size can be changed using the bins keyword.
# DataFrame.hist() plots the histograms of the columns on multiple subplots:
# plt.figure()
# iris['Sepal_Length'].hist()
# plt.show()


# plt.figure()
# iris['Sepal_Length'].diff().hist()
# plt.show()

print("Histogram of the distribution of the iris data. Make sure to close the plot to continue. ") 
iris.hist(alpha=0.8, bins=30, figsize=(12,8))
plt.show()


# Make a histogram of the DataFrame for each of the four numeric columns in the iris data set.
# The number of bins can be specified. 

# DataFrame.hist() plots the histograms of the columns on multiple subplots:
print("Histogram of the distribution of the iris data. Make sure to close the plot to continue. ") 
iris.hist(alpha=0.8, bins=30, figsize=(12,8))
plt.show()

# Boxplot can be drawn using DataFrame.plot.box(), or DataFrame.boxplot() 
# This is used to visualize the distribution of values within each column.

print("Boxplot the distribution of the iris data. Make sure to close the plot to continue. ") 
iris.plot.box(figsize=(12,8))
plt.show()

# Now instead of using just pandas I am using the seaborn package to do some visualisations.
# get_ipython().run_line_magic('pinfo', 'sns.boxplot')

import seaborn as sns
import matplotlib.pyplot as plt


# The appearance of the plot can be changed by setting the figure aesthetics.
# set the theme. (The default theme is called darkgrid). Set the color palette.
sns.set(style="ticks", palette="pastel")

# plotting 4 plots on a 2 by 2 grid, do not want to share the y axis between plots. Setting the figure size 
f, axes = plt.subplots(2, 2, sharey=False, figsize=(12, 8))
# pass a panda Series as the x and y parameters to the boxplot. 
# Using the Class column (categorical) and one of the sepal or petal measurements (numerical) for each subplot

# setting the hue = Class so that the points will be coloured on the plot according to their Class/species type.
sns.boxplot(x="Class", y="Sepal_Length", data=iris, ax=axes[0,1])
sns.boxplot(x="Class", y="Sepal_Width", data=iris, ax=axes[1,1])
sns.boxplot(x="Class", y="Petal_Length",data=iris, ax = axes[0,0])

sns.boxplot(x="Class", y="Petal_Width",hue = "Class",data=iris, ax=axes[1,0])

# adding a title to the plot
f.suptitle("Boxplot of the Petal and Sepal measurements by Iris plant Species")
plt.show()