Logistic Regression Project¶

In this project we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

This data set contains the following features:

'Daily Time Spent on Site': consumer time on site in minutes
'Age': cutomer age in years
'Area Income': Avg. Income of geographical area of consumer
'Daily Internet Usage': Avg. minutes a day consumer is on the internet
'Ad Topic Line': Headline of the advertisement
'City': City of consumer
'Male': Whether or not consumer was male
'Country': Country of consumer
'Timestamp': Time at which consumer clicked on Ad or closed window
'Clicked on Ad': 0 or 1 indicated clicking on Ad

Import Libraries¶

Import a few libraries you think you'll need (Or just import them as you go along!)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
%matplotlib inline
sns.set_style('whitegrid')

import warnings
warnings.filterwarnings('ignore')

Get the Data¶

Read in the advertising.csv file and set it to a data frame called ad_data.

ad_data = pd.read_csv('advertising.csv')

Check the head of ad_data

ad_data.head()

Use info and describe() on ad_data

ad_data.describe()

Exploratory Data Analysis¶

Let's use seaborn to explore the data!

Try recreating the plots shown below!

Create a histogram of the Age

sns.distplot(ad_data['Age'], bins = 30, kde = False, color = "steelblue", hist_kws = { "alpha" : 1 } )

<matplotlib.axes._subplots.AxesSubplot at 0x10b0540f0>

Create a jointplot showing Area Income versus Age.

sns.jointplot(x = 'Age', y = 'Area Income', data = ad_data)

<seaborn.axisgrid.JointGrid at 0x11179c668>

Create a jointplot showing the kde distributions of Daily Time spent on site vs. Age.

sns.jointplot(x = 'Age', y = 'Daily Time Spent on Site', data = ad_data, kind = 'kde', color = 'red')

<seaborn.axisgrid.JointGrid at 0x111956710>

Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'

sns.jointplot(x = 'Daily Time Spent on Site', y = 'Daily Internet Usage', data = ad_data, color = 'green')

<seaborn.axisgrid.JointGrid at 0x111b2c4e0>

Finally, create a pairplot with the hue defined by the 'Clicked on Ad' column feature.

sns.pairplot(ad_data, hue = 'Clicked on Ad', palette = 'bwr')

<seaborn.axisgrid.PairGrid at 0x111c873c8>

Logistic Regression¶

Now it's time to do a train test split, and train our model!

You'll have the freedom here to choose columns that you want to train on!

Split the data into training set and testing set using train_test_split

X = ad_data[['Daily Time Spent on Site','Daily Internet Usage']]

y = ad_data['Clicked on Ad']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

Train and fit a logistic regression model on the training set.

logistic_model = LogisticRegression()

logistic_model.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Predictions and Evaluations¶

Now predict values for the testing data.

y_predict = logistic_model.predict(X_test)

Create a classification report for the model.

print(confusion_matrix(y_test, y_predict))

[[155   2]
 [ 11 132]]

print(classification_report(y_test, y_predict))

             precision    recall  f1-score   support

          0       0.93      0.99      0.96       157
          1       0.99      0.92      0.95       143

avg / total       0.96      0.96      0.96       300

	Daily Time Spent on Site	Age	Area Income	Daily Internet Usage	Ad Topic Line	City	Male	Country	Timestamp
0	68.95	35	61833.90	256.09	Cloned 5thgeneration orchestration	Wrightburgh	0	Tunisia	2016-03-27 00:53:11
1	80.23	31	68441.85	193.77	Monitored national standardization	West Jodi	1	Nauru	2016-04-04 01:39:02
2	69.47	26	59785.94	236.50	Organic bottom-line service-desk	Davidton	0	San Marino	2016-03-13 20:35:42
3	74.15	29	54806.18	245.89	Triple-buffered reciprocal time-frame	West Terrifurt	1	Italy	2016-01-10 02:31:19
4	68.37	35	73889.99	225.58	Robust logistical utilization	South Manuel	0	Iceland	2016-06-03 03:36:18