s

Regression: Discuss and analyse whether there is a relationship between the total bill and tip amount



In this section I will look at the relationship between the total bill and the tip amount using regression.

In this section I will look at the relationship between the total bill and the tip amount using regression. My primary references are the Fundamentals of Data Analytics lecture notes, chapter 9 of Experiment Design and Analysis by Howard J Seltman on Simple Linear Regression[5], Wikipedia[6] and a hacker earth blog post on regression analysis[7].

Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome variable’) and one or more independent variables (often called ‘predictors’, ‘covariates’, or ‘features’). The most common form of regression analysis is linear regression, in which a researcher finds the line (or a more complex linear function) that most closely fits the data according to a specific mathematical criterion.[8] Regression will be used here to see if there is a relationship between the tip amount (the dependent variable) and the total bill amount (independent variables). Exploratory data analysis can show whether there is a case for trying a linear regression on the data.

Regression is the statistical method used to find the equation of the line that best fits the data. It describes the nature of the relationship between variables which can be positive or negative, linear or non-linear. Regression can be used to to see whether two or more variables are related and if so what is the strength of the relationship. Also what kind of relationship exists and whether predictions can be made from the relationship. The goal of regression here is to relate two numerical variables in the Tips dataset to each other, specifically the tip amount and the total bill amount. Is the tip amount related to the total bill amount, how and by how much. Can you predict the tip amount if you know the bill amount?

Scatter plots to identify relationship between total bill and tip:

Plots such as scatter plots can help to identify trends and patterns in a dataset which might indicate a relationship.

The scatter plot below visualise relationships between two numerical variables such total bill and tip amount. The correlation statistics below will then be used to put a numerical value on the strength and direction of the relationship. A scatter plot is a plot of the ordered pairs of numbers consisting of the independent variable x and the dependent variable y. It shows the joint distribution of two variables where each point represents an observation in the dataset and can be used to spot relationships that may exist. Here scatter plots are drawn using the seaborn scatterplot function where the total bill is shown along the vertical axis and the tip amounts along the vertical axis. Each point is an actual observation is the Tips dataset with a total bill amount and the corresponding tip amount paid with that bill.

# create the plot
sns.scatterplot(x=df['total_bill'],y=df['tip'], hue=df["size"], style=df["sex"])
# add title
plt.title("Scatter plot of Total bill amount and Tip");

png

The scatter plot shows that there does appear to be a positive linear relationship of sorts between the two variables total bill and tip amount with the points forming a line across the diagonal from the intersection of the axis up to the top right hand corner. The tip amount does appear to rise with the bill amount as would be expected although there are some observations that this does not seem to hold for. The plot shows a few higher tips for smaller total bills but there are quite a number of observations where the tip seems quite small in comparison to the total bill size. These are the points on the bottom right hand side of the plot under the (imaginary) line and they do not seem to follow the same trend of higher bill amounts leading to corresponding high tip amounts.

Correlation and Covariance of Total Bill and Tip amounts

For two quantitative variables such as the total bill amount and the tip amount, the covariance and correlation are statistics of interest which are used to determine whether a linear relationship between variables exists and shows if one variable tends to occur with large or small values of another variable.

Covariance is a measure of the joint variability of two random variables and the (Pearson) correlation coefficient is the normalised version of the covariance which shows by its magnitude the strength of the linear relation.

The covariance shows how much two variables vary with each other and in what direction one variable will change when another one does. If a covariance is positive then when one measurement is above it’s mean then the other measurement will more than likely be above it’s mean and vice versa while with a negative covariance when one variable is above the mean the other measurement is likely to be below it’s mean. A zero covariance implies that the two variables vary independently of each other.

The correlation statistics are computed from pairs of arguments. The correlation of the measurements can be got using the pandas corr method on the dataframe. If there is a strong positive relationship between the variables, the value of the correlation coefficient will be close to 1, while a strong negative relationship will have a correlation coefficient close to -1. A value close to zero would indicate that there is no relationship between the variables.

The correlation is easier to interpret than the covariance. Correlation does not depend on units of measurement and does not depend on which variable is x or y. $r$ is the symbol used for sample correlation and $\rho$ is the symbol for the population correlation.

Correlation and Covariance statistics for Total bill and Tip amount.

The corr function applied to the tips dataset returns a correlation matrix for the numerical variables.

# calculating correlation on Tips dataset, subset to exclude new variables added

print("The covariances between the numerical variables in the Tips datasets are: \n" ,df.loc[:, ['total_bill','tip','sex','smoker','size']].cov())
print("\n The correlations between the numerical variables in the Tips datasets are: \n" ,df.loc[:, ['total_bill','tip','sex','smoker','size']].corr())
The covariances between the numerical variables in the Tips datasets are: 
             total_bill       tip      size
total_bill   79.252939  8.323502  5.065983
tip           8.323502  1.914455  0.643906
size          5.065983  0.643906  0.904591

 The correlations between the numerical variables in the Tips datasets are: 
             total_bill       tip      size
total_bill    1.000000  0.675734  0.598315
tip           0.675734  1.000000  0.489299
size          0.598315  0.489299  1.000000
print(f"The covariance between the total bill amount are tip amount is {df['total_bill'].cov(df['tip']):.4f}")
# correlation of total bill and tip amounts.
print(f"The correlation between the total bill and tip amount is {df['tip'].corr(df['total_bill']):.4f}")
The covariance between the total bill amount are tip amount is 8.3235
The correlation between the total bill and tip amount is 0.6757

The correlation between the total bill amount and the tip is positive and quite high at 0.67 which implies that tip amount is indeed related to the total bill amount. The relationship is quite strong but it does not seem to be the only factor. The size of the party can also be seen to have an influnce here with a positive value of 0.49.
Correlation however is not the same as causation! There can be many possible relationships between variables that are correlated such as cause and effect relationship and also reverse cause and effect. In other cases the observed correlation between variables may be due to the effects of one or more other variables so while it might seem that the total bill is correlated with the tip amount, it is possible that the other variables such as the size of the party or day of week have some influnce. Often a relationship between variables might even be just coincidental.

As the correlation coefficients and the scatter plots indicate that there is a linear relationship between total bill and tip amount the next step then is to look at regression.

In addition to scatter plots and joint distribution plots for two variables, seaborn has some regression plots that can be used to visualise relationships and patterns that exist in the data when exploring the data. Statistical models are used to estimate a simple relationship between sets of observations which can be quickly and easily visualised and can be more informative than looking at statistics and tables alone.

These regression plots are mainly used to visualise patterns in a dataset during the exploratory data analysis and are not meant to be used for statistical analysis as such. Other python packages such as statsmodels are recommended for looking at more quantitative measures concerning the fit of the regression models.

Seaborn has two main functions for visualising linear relationships through regression, regplot and lmplot which both produce similar output but have slightly different use. lmplot has slightly more features while regplot is an axes-level function and can draw onto specific axes giving you some control over the the location of the plot).

First a scatterplot of two variables x and y is drawn and then a regression model y ~ x is fitted and plotted over the scatter plot together with a 95% confidence interval for the regression.

I will draw some regression models of tip modelled on total_bill amount. First a regression model tip ~ total_bill is fitted on top of a scatter plot of the two variables. A confidence interval for the regression is drawn using translucent bands around the regression line which estimated using a bootstrap. This feature can be turned off by setting ci to None.

If there are outliers in the dataset (and there seems to be some in the Tips dataset), a robust regression can be fitted which uses a different loss function which downweights relatively large residuals but takes a bit longer to run though.

plt.figure(figsize=(7, 4))
sns.regplot(x="total_bill", y="tip", data=df)
plt.title("Regression: tip ~ total_bill");

png

The regression plot below shows an upward trend in the tip amount as the total bill increases. A confidence interval represents the range in which the coefficients are likely to fall with a likelihood of 95%. The confidence interval is wider towards the top of the regression line.

Before going any further I will give a little overview of what I have learned about simple linear regression and how it can be used on the Tips dataset.

Simple Linear Regression.

Simple linear regression is a common method for looking at relationships between a single explanatory variable and a single quantitative outcome variable. It can be used here to explore a relationship between the total bill amount as the explanatory variable and the tip amount as the outcome variable. (Simple just means that there is only one explanatory variable.) In the Tips dataset there are 229 different observed values of the total bill amount but there are many amounts in between these exact bill amounts that could be assumed to also possible values of the explanatory variable. There are 123 unique tip amounts because it seems that some rounding up takes place as we saw earlier.

The scatter and regression plots above show that there are many set of observations or points that seem to fall around an imaginary line. Any straight line is characterised by it’s slope and intercept and can be expressed as $y = mx +c$ where the slope of the line $m$ shows how much $y$ increases when $x$ is increased by one unit. $c$ is the y-intercept of the line which is the value of $y$ when $x$ is zero. Linear regression looks for the equation of the line that the points lie on and finds the best possible values for the slope and intercept. The best line to fit the data is the one that minimises the cost of the least squares line. This is when the sum of the squared vertical distances from each point to the regression line is at a minimum.

Regression looks at the relationship between the population mean of an outcome variable $Y$ and an explanatory variable $x$. A regression model can be used to find the expected value of the Tip amount as the outcome variable for a given value of the total bill amount as an explanatory variable. In other words to predict the tip amount expected if we know the total bill amount.

In regression there is usually a null hypothesis and an alternative hypothesis. The hypothesis here is whether there is a linear relationship between the total bill and the tip amount. The null hypothesis is that there is no relationship and $H_0: B_1=0$; the alternative hypothesis is that there is a relationship and $H_1: B1 \neq 0$.

The equation $E(Y|x) = B_0 + B_1x + \epsilon$ linearly relates the tip amount and total bill amount where the expected value of the tip amount (Y) for any particular total bill (x) amount would be equal to the intercept parameter and the slope parameter times the value of the total bill amount as x plus a random error term epsilon $\epsilon$. Regression finds the estimates of the unknown slope $B_0$ and intercept $B_1$ parameters.

If the null hypothesis turns out to be true then the slope parameter is zero and the expected value or population mean of $Y$ will just be equal to the value of the intercept $B_0$. This would mean that the $x$ value (the total bill amount) has no effect on the $y$ value (tip amount) and therefore the tip amount would stay the same for all total bill amounts which is very unlikely.

The alternate hypothesis is that changes in $x$ are associated with changes in $Y$. That is that changes in the total bill amounts are related to changes in the tip amounts.

A regression model would say that for each value of the total bill amount, the population mean of Tip amount over all the observations that have that particular total bill amount can be calculated using the simple linear expression $B_0 + B_1x$. The equation $E(Y|x) = B_0 + B_1x$ makes a prediction of the true mean of the Tip amount for any fixed value of the total bill amount. The exact values of the slope and intercept parameters are not known so estimates of the parameters are made and substituted into this equation. These parameters estimates are called coefficients.

This difference then between what the model says the y values should be (the fitted values $\hat{y}$ ) and what the observed y values actually are squared and summed up. The vertical distances between the actual observed points and the best fit regression line are known as the residuals. The least squares line is the line with the smallest sum of squared residuals. Calculus can be used to find the values of $b_0$ and $b_1$ that give this minimum value and find the best fit line using the following equations:

$$b_1 = \frac{\sum_i (x_i - \bar{x}) (y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2} \qquad \textrm{and} \qquad b_0 = \bar{y} - b_1 \bar{x} $$

$\hat{B_0}$ and $\hat{B_1}$ are often used instead of $b_0$ and $b_1$ to show these are statistics rather than parameters.

There are several assumptions that an ordinary least squares (OLS) regression model should satisfy. The first is that the regression model assumes linearity within the range of the observed values. Then there are important assumptions regarding the error term which the blogpost 7 classical assumptions of OLS[11] by Jim Frost outlines which include the following:

  • The error term accounts for the variation in the dependent variable (Tip amount) that the independent variables (total bill) do not explain and should only be determined by random chance for the model to be unbiased. If this is the case the expected value of the error term is zero, otherwise part of the error term should be added to the regression model itself. The expected value of the error term should equal zero.
  • All independent variables should be uncorrelated with the error term.
  • All observations of the error term should be uncorrelated with each other.
  • The variance of the error term should be consistent for all observations.

An error is the deviation of the true outcome value from the population mean of the outcome for a given $x$ value. The error term is a population value that we don’t know so the residuals are used instead. The residuals are the sample estimate of the error for each observation - the difference between the observed value and the fitted value.

A linear model should produce residuals that have a mean of zero, have a constant variance, and are not correlated with themselves or other variables.


Finding the regression coefficients:

While seaborn’s regression plots can help us spot trends and possible relationships in the data, they do not actually give us the regression coefficients or the regression equation. According to the seaborn tutorial on visualising linear relationships in the spirit of Tukey, the regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses. That is to say that seaborn is not itself a package for statistical analysis. To obtain quantitative measures related to the fit of regression models, you should use statsmodels.

The Fundamental of Data Analysis lectures on Linear Regression demonstrated how to find the slope and intercept parameters to give the equation of the line that that best describes the data points. First the actual observations are plotted using a scatter plot. The best line to fit the data is the one that minimises the cost of the least squares line. This is when the sum of the squared vertical distances from each individual data point to the regression line is at a minimum.

The best line is considered the one that minimises the cost $Cost(m,c)$ of the Least Squares Lines $ \sum_i (y_i - mx_i - c)^2 $.

  • Each $y_i$ is the corresponding value of Tip amount to each total bill amount $x_i$ in the dataset. These are actual measured values so $(x_i,y_i)$ is the ith point in the dataset.
  • The value $mx_i + c$ is what the model says that $y_i$ should have been.
  • $y_i - mx_i - c$ is the difference between the observed $y_i$ values and the the value the model gives $(mx_i + c)$. - These values are then squared.
  • Particular values of $m$ and $c$ will give the lowest values for this cost function which can be plotted on the scatter plot of actual observations.

The lecture demonstrated how several lines can be drawn as estimates for the model that best fits the data with a given slopes and intercepts. A cost is then calculated for each line and the line with the lowest cost is considered the best fit for the data. A cost function is used to determine the best line to fit the data. However the data does not alway fit perfectly and therefore the cost is usually greater than zero.

If I was to do this for the Tips dataset I would start with a guess of 15% for the slope using the typical tip rate of 15% and 1 as the intercept given the minimum tip in the dataset is 1 dollar. Fortunately numpy’s polyfit function can do all this. There are also functions in packages such as statsmodels and scikit-learn.

Using the numpy polyfit function to find the best fit line

numpy’s polyfit is a function that can fit a polynomial to a set of x and y points. A linear equation is a polynomial of degree 1. polyfit returns a vector of coefficients that minimise the squared error - the estimates for the slope and intercept parameters. It can be used to fit lines in many dimensions and does the calculations involved in minimising the cost function.

# set x and y to be total bill and tip amount from the dataset
x, y = df['total_bill'], df['tip']
# use polyfit function on total bill and tip amount. polynomial degree is 1 for a linear equation
np.polyfit(x,y,1)
array([0.105 , 0.9203])
# explanatory variable x is total bill, outcome variable tip is y
x, y = df.total_bill, df.tip
# First calculate the mean total bill amount (mean of x) and the mean tip amount (mean of y)
x_avg, y_avg = np.mean(df.total_bill), np.mean(df.tip)
print("The mean bill amount is $%.3f " %x_avg, "and mean tip amount is $%.3f \n" %y_avg)
#  subtract means from each of the individual total_bill and tip values
x_zero= df['total_bill'] - np.mean(df['total_bill'])
y_zero= df['tip'] - np.mean(df['tip'])

# The best slope m is found by the following calculations:
m = np.sum(x_zero * y_zero) / np.sum(x_zero * x_zero)

# The best slope m from above is used to calculate the best intercept c
c = y_avg - m* x_avg

print("The slope m is calculated above to be %.4f and the intercept c  to be %.4f." %(m,c))
The mean bill amount is $19.786  and mean tip amount is $2.998 

The slope m is calculated above to be 0.1050 and the intercept c  to be 0.9203.
# Calculating the cost for various slope and intercept values

cost = lambda m,c: np.sum([(y[i] - m * x[i] - c)**2 for i in range(x.size)])
print("The cost using the above calculated slope  (m = %.3f) and intercept (c = %5.3f): %8.2f" % (m, c, cost(m, c)))
## first for a guess using tip rate of 15% and minimum tip as intercept of 1
print("Cost with m = %.3f and c = %5.3f: %8.2f" % (0.15, 1.1, cost(0.15, 1.1)))
# using the estimates from the polyfit function
print("Cost with m = %.3f and c = %5.3f: %8.2f" % (0.105, 0.9203, cost(0.105, 0.9203)))
The cost using the above calculated slope  (m = 0.105) and intercept (c = 0.920):   252.79
Cost with m = 0.150 and c = 1.100:   570.90
Cost with m = 0.105 and c = 0.920:   252.79

As the cost is not zero this indicates that the tip is not completely explained by the total bill amount.

The estimate for the slope $m$ and intercept parameter $c$ can be put into the regression equation to find the expected value of the tip amount for any total bill amount (within the range that the slope and intercept were calculated on). For example $E(Y|x) = B_0 + B_1x$ using the intercept estimate of 0.9203 as $B_0$ and the slope estimate 0.105 for $B_1$ would give the values below. We could then check how the model would predict the tip amount for a total_bill amount. In his book, Seltman notes that while it is reasonable to interpolate and make predictions for unobserved $x$ values between the observed $x$ values it would be unwise to extrapolate and make predictions outside of the range of $x$ values studied.

There are only 229 unique total bill amounts in the tips dataset but the linear regression model could be used to make tip predictions for any total bill amounts between the minimum 3.07 and maximum 50.81 dollars but not outside this range.

## apply the model to some total_bill amounts to get predictions of tips
B0, B1 =  0.9203, 0.105,
# just selecting a range of total bills between 30 and 40 in steps of 3
for x in range(30,40,3):
    tips_y = B0 + B1*x 
    print(f"For a meal with a total bill of {x:.2f} the expected value of the tip amount is {tips_y:.2f}")
    
#print(f"The tip IQR ranges from {df['tip'].quantile(q=0.25):.2f} to {df['tip'].quantile(q=0.75):.2f} dollars")
For a meal with a total bill of 30.00 the expected value of the tip amount is 4.07
For a meal with a total bill of 33.00 the expected value of the tip amount is 4.39
For a meal with a total bill of 36.00 the expected value of the tip amount is 4.70
For a meal with a total bill of 39.00 the expected value of the tip amount is 5.02

Having made some predictions the next step would be to check how the predictions did on actual observations where we know the total_bill amount and the tip amount.

# selecting some data that fall in the same range predicted for above.
df[(df.loc[:, 'total_bill'] >=30) & (df.loc[:,'total_bill'] <40)].sort_values(by='total_bill')
total_bill tip sex smoker day time size Tip% BillPP TipPP total_spent
210 30.06 2.00 Male Yes Sat Dinner 3 6.653360 10.0200 0.666667 32.06
219 30.14 3.09 Female Yes Sat Dinner 4 10.252157 7.5350 0.772500 33.23
44 30.40 5.60 Male No Sun Dinner 4 18.421053 7.6000 1.400000 36.00
187 30.46 2.00 Male Yes Sun Dinner 5 6.565988 6.0920 0.400000 32.46
... ... ... ... ... ... ... ... ... ... ... ...
56 38.01 3.00 Male Yes Sat Dinner 4 7.892660 9.5025 0.750000 41.01
112 38.07 4.00 Male No Sun Dinner 3 10.506961 12.6900 1.333333 42.07
207 38.73 3.00 Male Yes Sat Dinner 4 7.745933 9.6825 0.750000 41.73
23 39.42 7.58 Male No Sat Dinner 4 19.228818 9.8550 1.895000 47.00

22 rows × 11 columns

There are packages with functions that do all this such as the scikit-learn package.

Using $R^2$ to show how much of the changes in Tip amounts is due to Total Bill amounts:

Coefficient of Determination: $R^2$

The correlation coefficient shows the stength and direction of the relationship between the tip and total bill amount variables. How much of the variance in tip amount ($y$) is actually determined by total bill amount ($x$) can be measured using the Coefficient of Determination also known as R Squared. The $R^2$ value is an estimate of how much the changes in the $y$ values (tip amount) is due to changes in the $x$ values (the total bill amounts) compared to all the other factors that affect the $y$ value.

$$ R^2 = 1 - \frac{\sum_i (y_i - m x_i - c)^2}{\sum_i (y_i - \bar{y})^2} $$

The Pearson correlation coefficient can be squared to get the R-squared value. Numpy has a function corrcoef() that calculates this value. It returns a matrix of correlation coefficient between each pair of variables which can be squared to get the coefficient of Determination - R squared.

# Calculate the R-squared value for the Tips dataset using numpy corrcoef, 
np.corrcoef(df['total_bill'],df['tip'])
# just get a single value for correlation between total bill and tip from the correlation matrix
print(f" R-squared is:{np.corrcoef(df['total_bill'],df['tip'])[0][1]**2:.4f} ")
 R-squared is:0.4566 

There are clearly other factors that affect the tip amount other than the size of the total bill. According to the R-squared statistic only 46% of the variation in the tip amount is related to the total bill amount.

statmodels

just a quick look at how to use the statsmodel package for linear regression

# import ols for ordinary least squares regression
from statsmodels.formula.api import ols
# fit the model
model = ols('tip ~ total_bill', data=df).fit()
# the detailed statistics derived from the fit
model.summary()
# just the paramters for the intercept and coefficient
print(model.summary())
print(model.params)
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    tip   R-squared:                       0.457
Model:                            OLS   Adj. R-squared:                  0.454
Method:                 Least Squares   F-statistic:                     203.4
Date:                Fri, 29 Nov 2019   Prob (F-statistic):           6.69e-34
Time:                        20:39:57   Log-Likelihood:                -350.54
No. Observations:                 244   AIC:                             705.1
Df Residuals:                     242   BIC:                             712.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.9203      0.160      5.761      0.000       0.606       1.235
total_bill     0.1050      0.007     14.260      0.000       0.091       0.120
==============================================================================
Omnibus:                       20.185   Durbin-Watson:                   2.151
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               37.750
Skew:                           0.443   Prob(JB):                     6.35e-09
Kurtosis:                       4.711   Cond. No.                         53.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Intercept     0.920270
total_bill    0.105025
dtype: float64
  • Adjusted R-squared reflects the fit of the model. A higher value indicates a better fit assuming certain conditions are met
  • Intercept coefficient is the Y intercept (doesnt always make sense as in this case as it means the expected output (Tip) is equal to this amount when total bill is zero
  • total_bill coefficient represents the change in the output of Y (tip) due to a change of one unit in the total bill amount.
  • std err represents the level of accuracy of the coefficients. The lower the better.
  • P>|t| is the p-value. A p-value less than 0.05 is considered statistically significant
  • Confidence interval represents the range in which the coefficients are likely to fall with a likelihood of 95%.

Summary of regression of total bill and tip amount so far.

# correlation of total bill and tip amounts.
print("The correlation coefficient between total bill and tip amount is %.3f" %df['total_bill'].corr(df['tip']))
print(f"The coefficient of determination R squared is {np.corrcoef(df['total_bill'],df['tip'])[0][1]**2:.4f}")
print(f"The estimates for the slope and intercept parameters are {np.polyfit(df['total_bill'],df['tip'],1)[1]:.4f} and {np.polyfit(df['total_bill'],df['tip'],1)[0]:.4f} ")
The correlation coefficient between total bill and tip amount is 0.676
The coefficient of determination R squared is 0.4566
The estimates for the slope and intercept parameters are 0.9203 and 0.1050 
  • A scatterplot is used to show the shape of the relationship between the variables.

  • There is quite a strong positive relationship between total bill and tip amount but it is not perfectly linear.

  • There can be two or more independent variable and one independent variable. While there is a relationship between the total bill amount and the tip amount, we could also see that there is a positive relationship between the size of party and the tip amount.

  • The coefficient of determination (also known as R squared) is a better indicator of the strength of a linear relationship between total bill and tip amoount than the correlation coefficient because it shows the percentage of the variation of the dependent variable (tip) that is directly attributed to the independent variables (total bill). The $R^2$ value is an estimate of how much the changes in the $y$ values (tip amount) is due to changes in the $x$ values (the total bill amounts) compared to all the other factors that affect the $y$ value.

  • Numpy corrcoef() function calculates the Pearson correlation coefficient which can be squared to get the R-squared value.

  • While the correlation coefficient is 0.676, the $R^2$ values is lower at 0.456.

  • The coefficient of determination is got by squaring the correlation coefficient then converting the result to a percentage.

  • The standard error of the estimate is an estimate of the standard deviation of the y values about the predicted $\hat{y_i}$ values

  • The standard error of estimates can be used to construct a prediction interval

Seaborn regression plots to show the relationship between tip and total bill.

I will now look more at the regression plots in the seaborn package and see how to interpret them given the information on regression above. The lmplot or regplot functions visualise the relationship between the total bill and tip amount. The first plot here is the simple linear regression between tip as the outcome and total bill as the explanatory variable.

plt.figure(figsize=(7, 4))
sns.regplot(x="total_bill", y="tip", data=df)
plt.title("Simple Linear Regression: tip ~ total_bill");

png

The plot shows a positive relationship between total bill and tip amount as the tip generally increases as the total bill increases. The confidence interval gets wider towards the top of the regression line. The correlation and R squared statistics suggest that while there is a positive relationship between total bill and the tip amount, there is something else going on. I will next look at using the regression plots to condition on another variable to see how the relationship between total bill and tip changes as a function of a third variable.

Effect of another variable on the relationship between total bill and tip amount.

Conditioning on other variables.

How does the relationship between tip and total_bill amount change as a function of a third variable?

The lmplot combines the regplot function with a FacetGrid and can be used to see if there are any interactions with up to three additional variables. The best way to separate out a relationship is to plot both levels on the same axes and to use color to distinguish them which I will do here.

I am going to use the lmplot function to see if there are any interactions on the relationship between total bill and tip amount by using colour through the hue semantic. Further variables could be added by drawing multiple “facets” for each level of the variable to appear in a different row or col on a grid. However I do think that this can get very complicated to read.

Effect of party size on the relationship between total bill and tip amount

Here I use the lmplot to look at the relationship between total bill and tip amount taking into account the size of the party. Firstly conditioning on party size shows that steeper lines for parties of 1 then 4, 3 and 2. For the larger parties the regression line is almost flat. However parties of 5 and 6 are not very common in the dataset as can be seen by the sparsity of points. There is overlap in the confidence intervals.

sns.lmplot(x="total_bill", y="tip", hue="size",data=df)
plt.title("Effect of party size on relationship between total bill and tip");

png

Effect of Sex of bill payer on the relationship between total bill and tip amount.

Here I have conditioned on sex of the bill payer using the hue semantic and on the smoker status using columns. There are more male bill payers in the dataset than female bill payers, however the difference on the regression line is very small with male bill payers paying slightly less tips on average for smaller bills than females but at the other end of the scale males are paying slightly higher tips than females for higher bills. There are fewer females paying higher bill amounts. The confidence intervals for males and females overlap.

sns.lmplot(x="total_bill", y="tip", hue="sex",data=df, palette=["pink","skyblue"])
plt.title("Effect of sex of bill payer on the relationship between total bill and tip amount");

png

Effect of Smoker on the relationship between total bill and tip amount.

Here I have conditioned on a smoker in the party using the hue semantic. It seems than non-smokers tend to be less generous for smaller bills but more generous at the higher end while non-smokers pay higher tips on lower bills but seem to get less generous relative to the bill as the bill increases. The confidence interval for smokers is much wider at the higher end than for non-smokers. From the plots it does look like smoker status does influence the tip amount.

sns.lmplot(x="total_bill", y="tip", hue="smoker",data=df, palette=["g","r"])
plt.title("Effect of smoker in the party on the relationship between total bill and tip amount"); 

png

Effect of Smoker and sex on the relationship between total bill and tip amount.

It seems to me that smoker status does have some influence on the relationship between the total bill and the tip amount while the sex of the bill payer does not make much difference. The regression plots below show that higher bills tend to lead to higher tips by parties with non-smokers than with smokers. The confidence intervals for males and females overlap but the confidence intervals for smokers and non-smokers do not. In part 3 I will look more closely at the difference between male and female smokers.

sns.lmplot(x="total_bill", y="tip", hue="sex",col="smoker",data=df, palette=["pink","skyblue"]);

png

Effect of size, smoker and sex on the relationship between total bill and tip amount.

You can also fit a linear regression when one of the variables takes discrete values for example if using the size variable which only takes 6 possible values in the Tips dataset. In this case you can add some random noise known as jitter to the discrete values to make the distribution of those values more clear although this jitter does not influence the regression line fit itself. sns.lmplot(x="size", y="tip", data=tips, x_jitter=.05). Alternatively you can collapse over the observations in each discrete bin to plot an estimate of central tendency along with a confidence interval. sns.lmplot(x="size", y="tip", data=tips, x_estimator=np.mean). The regression plot of tip against size shows that the tip amount increases by the size of the party up to a party size of 4 then the tip amount seems to fall a bit relative to the size for parties of 5 and 6 people. There are very few parties of more than 4 in this dataset.

sns.lmplot(x="size", y="tip", hue="smoker",x_estimator=np.mean, data=df, col="sex",palette=smoker_pal);

png

Combining size, smoker and sex shows that there is an upward trend overall in tip amount as party size gets bigger. The tip amount increases by a higher amount for larger parties with smokers and a female bill payer. Tip amount tends to increase by party size for male non-smokers more than smokers in the same group.

Effect of Day of week and time on the relationship between total bill and tip amount.

The regression plots below show how the relationship between total bill and tip amount varies by day of the week and whether it is lunch or dinner time. There are very few lunches in the dataset on days other than Thursdays - in fact only 7 on Fridays and none at all on Saturdays on Sundays. Looking at dinners by day, the steepest lines is for dinner on a Thursday but this was a single observation so we can’t infer much from this. Saturdays shows the steepest slope so tips become more generous relative to the bill as the bill amount grows. On the other hand people are more generous with tips for smaller bills. The confidence intervals for each day overlap.

# faceting by time and also by day of week
sns.lmplot(x='total_bill', y='tip', hue="day",col="time",data=df);

png

Fitting different regression models to look at the relationship between total bill and tip amount.

The seaborn tutorial shows ways of fitting different regression models to a dataset where simple linear regression might not be the most suitable. Both lmplot and regplot can be used to fit a polynomial regression model for higher order polynomials. To do so you use the order argument to the regression plot.

Regression of total bill and tip amount using higher order polynomials

The plots below show the difference between linear regression and higher order polynomials. In my opinion the polynomial with order 3 looks like a slightly better fit to the line than the first order linear regression line.

f, axes = plt.subplots(2, 2, figsize=(12, 6))
sns.regplot(x="total_bill", y="tip", data=df, ax=axes[0,0], label="order = 1", ci=False); axes[0,0].legend()
sns.regplot(x="total_bill", y="tip", data=df, order=2, ax=axes[0,1], label="order =2", ci=False); axes[0,1].legend()
sns.regplot(x="total_bill", y="tip", data=df, order=3, ax=axes[1,0], label="order =3", ci=False); axes[1,0].legend()
sns.regplot(x="total_bill", y="tip", data=df, order=4, ax=axes[1,1], label = "order=4"); axes[1,1].legend()
plt.legend()
plt.suptitle("Trying higher order polynomial regression functions to Total Bill and Tips")
plt.show()

png

Robust regression of total bill and tip amount

A robust regression can be fitted to total bill and tip amount using different loss functions to deal with outliers by downweighting relatively large residuals.

sns.regplot(x="total_bill", y="tip", data=df, robust=True);

png

Logistic regression of total bill and tip

A logistic regression model can be used where the outcome y is a binary variable. This would shows the estimated probability of y being 1 for a given value of x or alternatively y being 0. A tip falls into either the big tip category (1) or not (0). The seaborn tutorial shows a logistic regression fitted to the Tips data where a binary variable is first created called ‘big_tip’ for tips that are greater than 15% of the total bill amount. This plot shows that the number of “big_tips” actually falls as the size of the total bill gets biggers.

# big tip variable for tips greater than 16%

df["big_tip"] = (df.tip / df.total_bill) >.16
sns.lmplot(x="total_bill", y="big_tip",data=df,logistic=True,ci=None, y_jitter=.03)
plt.title("logistic regression of big_tip on total bill");
# Drop this variable again from df
df=df.drop("big_tip", axis=1)

png

Using a lowess smoother to fit a regression of total bill and tip

The seaborn tutorial shows another way of fitting a nonparametric regression using a lowess smooth. lowess smoothing is used to fit a line to a scatter plot where noisy data values, sparse data points or weak interrelationships interfere with your ability to see a line of best fit. It also used for linear regression where least squares fitting doesn’t create a line of good fit or is too labor-intensive to use. Lowess are non-parametric strategies for fitting a smooth curve to data points. A parametric fitting assumes the data fits some distribution which can misrepresent the data whereas non-parametric smoothers try to fund a curve of best fit without assuming the data must fit some distribution shape.

sns.lmplot(x="total_bill", y="tip", data=df, lowess=True)
plt.title("using a lowess smoother")
plt.show()

png

Using Residual plots to check whether the simple regression model of tip ~ total bill is appropriate.

Seaborns residplot() function is used for checking whether the simple regression model is appropriate for a dataset. A residplot fits and removes a simple linear regression and then plots the residual values for each observation. Ideally, these values should be randomly scattered around y = 0 as explained earlier. If there is structure in the residuals, this suggests that simple linear regression is not appropriate.

If the residual plot does have a shape this suggest non-linearity in the data set. A funnel shape pattern suggests that the data is suffering from heteroskedasticity, i.e. the error terms have non-constant variance. Hackerearth[7]

Applying the residplot to the data does seem to show this funnel shape suggesting that the tips dataset does suffer from heteroskedasticity. It doesn’t look like the variance of the error terms are consistent as the reiduals become more spread out from zero as the total bill amount increases.

sns.residplot(x="total_bill", y="tip", data=df, scatter_kws={"s": 80})
plt.title("Residual plot"); plt.show()

png

According to this blogpost on statisticsbyjim linear regression assumptions if you want to see if the residuals follow a normal distribution then you should look at a normal probability plot. They should follow a straight line. seaborn qqplot. The qqplot shows that this line is fairly straight for total bills but only up to a certain total bill amount .

# qqi plot for checking normality
import seaborn_qqplot as sqp

sqp.qqplot(df, x="total_bill", y="tip");

png


Regression using the percentage tip rate as the outcome variable

A case study of restaurant tipping[9] applied a general linear regression model to find the variables which best predicted the percentage tip given to a waiter and found that the size of the party was the most important predictor of tip rate where the tip expressed as a percentage of total bill actually decreases as the party size increases.

The aim of this project however was to look at the relationship between the total bill and tip amount so the analysis above was done using the tip amount as the outcome but just for completion the pairplots below show how tips change as other variables changes using actual tip amount in the top row and then using tip rate (tip amount/total bill amount) in the bottom row. The pairplots summarise the main relationships in the dataset between tip and the other variables individually using scatter plots show the joint distribution of two variables. The points are coloured according to the day of week. This gives some insight into the factors that may influence the tip along with the total bill amount which will be looked at further in part 3.

# pairplot with tip in first row, tip% in second row as Y variables against other variables as X, conditioned on day
g = sns.pairplot(df, x_vars=["total_bill", "size","sex","smoker"],y_vars=["tip","Tip%"], hue="day")
plt.suptitle("Relationships in the Tips dataset");

png

Review of Part 2 on Regression: Is there a relationship between total bill and tip amount?

  • In this section I looked first at the scatter plots of total bill and tip amounts. Using scatter plots is a graphical method of depicting the relationship between one variable and another. A positive relationship between total bill and tip was visible where the tip amount tends to rise with the total bill amount.

  • The covariance and correlation statistics also indicated that there was a positive and fairly strong linear (0.68) relationship between the total bill amount and the tip given.

  • I looked at what linear regression actually is and how it is used to find the equation of the line that best fits the data. I outlined some of the steps involved and how the method could be applied to the total bill and tip variables in the Tips dataset to find a relationship and also to be able to predict an expected tip amount given a total bill amount.

  • Seaborn does not actually give you any statistical values such as the regression coefficients or the regression equation. There ar many other packages to do this. I used the numpy polyfit function to fit a linear eqaution to the data that minimises the cost function.

  • The coefficient of determination, also known as R-squared was also calculated using numpy. While the correlation coefficients provide a measure of the strength and direction of the relationship between variables, the coefficient of determination is an estimate of how much of the changes in one variable such as tip are due to changes in the total bill amount.

  • The R-Squared value of 0.4566 indicates that only 46% of the variation in the tip amount is due to the variation in the total bill amount which means that there are other factors that account for some of the variation in tip amount.

  • The effect of the other variables in the dataset on the relationship between the total bill and tip amount were next explored. Seaborn lmplots were used here with hue semantics and facets to show the effects of other variables by conditioning the relationship between total bill and tip on one or more other variables. Smoker status in particular stood out as a variable of interest.

  • I then looked at using the tip as a percentage of total bill as the outcome variable with the total bill amount as the explanatory variable.

  • Using seaborn plots I also considered looked at the various other regression models such as logistic regression which could be applied to the dataset if the tip variable was converted to a binary variable with 0 representing values under a particular percentage tip and 1 representing values over.

  • Higher order polynomial regression functions were applied to the data to compare with the linear regression model.

  • Regression using a lowess smoother was also briefly explored. This could be suitable for the tips dataset as the least squares line doesn’t create a line of good fit for the higher total bill amounts.

  • I used the residual plot function to look at the error terms and this did seem to indicate that the data may be suffering from heteroskedascity and suggests that the simple linear regression of tip ~ total_bill may not be appropriate model to use.

  • Using the tip as a percentage of total bill showed that the customers in the dataset did tend to tip using percentages of the bill as guidelines, but the rate given did fall somewhat as the group size got larger. anatory variable.

  • The plots showed that tip amount is certainly related to the total bill amount but as the total bill creeps up the tip rate seems to fall. Tip rate starts to fall off as the size of the party increases but more so for smokers than non-smokers.

Tip amount in general does rise with the amount of the total bill. Tip amount does increase as the size of the group increases as the total bill increases with size. The percentage of tip looks fairly constant but tends to fall as group size increases.

I will be looking further at relationships between the variable in part 3.



Tech used:
  • Python
  • pandas
  • Seaborn
  • Jupyter