The titanic dataset¶

This is a very famous data set and very often is a student's first step in machine learning! We'll be trying to predict a classification- survival or deceased. Let's begin our understanding of implementing Logistic Regression in Python for classification. We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

My imports¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Importing the Titanic training set¶

In [2]:

train=pd.read_csv("titanic_train.csv")

In [3]:

In [4]:

train.head()

Out[4]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

EDA¶

In [12]:

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap="viridis");

In [26]:

sns.set_style("whitegrid")
#style must be one of white, dark, whitegrid, darkgrid, ticks

In [118]:

sns.countplot(x="Survived",data=train);

In [37]:

sns.distplot(train['Age'].dropna(),kde=False,bins=50)

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x2185d316608>

In [38]:

train['Age'].dropna().plot(kind="hist",bins=50)

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x2185d4350c8>

In [40]:

sns.countplot(x="SibSp",data=train)

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x2185d802e88>

In [48]:

train["Fare"].hist(bins=50,figsize=(10,4))

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x2185cbcd508>

In [61]:

import cufflinks as cf


cf.go_offline()

In [121]:

train["Age"].iplot(bins=50)

In [64]:

train.head()

Out[64]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

In [69]:

plt.figure(figsize=(12,8))
sns.boxplot(x="Pclass",y="Age",data=train)

Out[69]:

<matplotlib.axes._subplots.AxesSubplot at 0x21863cd9a08>

Filling in missing data¶

In [ ]:

### Calculate the mean of each class and use it to fill in the missing data

In [70]:

def impute_age(cols):
    Age=cols[0]
    Pclass=cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
        
    else:
        return Age

In [75]:

train["Age"]=train[["Age","Pclass"]].apply(impute_age,axis=1)

In [74]:

train.columns

Out[74]:

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [125]:

sns.heatmap(train.isnull())

Out[125]:

<matplotlib.axes._subplots.AxesSubplot at 0x2186745a508>

In [80]:

train.drop("Cabin",axis=1,inplace=True)

In [81]:

train.columns

Out[81]:

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked'],
      dtype='object')

In [83]:

train.dropna(inplace=True)

In [84]:

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Out[84]:

<matplotlib.axes._subplots.AxesSubplot at 0x21863f5bbc8>

In [86]:

sex=pd.get_dummies(train["Sex"],drop_first=True)

In [4]:

train['embark']=pd.get_dummies(train["Age"],drop_first=True)

In [90]:

train=pd.concat([train,sex,embark],axis=1)

In [91]:

train.head()

Out[91]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Embarked	male	S
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	S	1	1
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C	0	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	S	0	1
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	S	0	1
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	S	1	1

In [94]:

train.drop(["Sex","Embarked","Name","Ticket"],axis=1,inplace=True)

In [95]:

train.drop("PassengerId",axis=1,inplace=True)

In [96]:

train.head()

Out[96]:

	Survived	Pclass	Age	SibSp	Fare	male	S
0	0	3	22.0	1	7.2500	1	1
1	1	1	38.0	1	71.2833	0	0
2	1	3	26.0	0	7.9250	0	1
3	1	1	35.0	1	53.1000	0	1
4	0	3	35.0	0	8.0500	1	1

Train test split, and training the model¶

In [97]:

from sklearn.model_selection import train_test_split

In [98]:

x=train.drop('Survived',axis=1)
y=train['Survived']

In [99]:

X_train, X_test, y_train, y_test = train_test_split( x, y, test_size=0.3, random_state=101)

In [109]:

from sklearn.linear_model import LogisticRegression

In [110]:

logmodel=LogisticRegression()

In [112]:

logmodel.fit(X_train,y_train)

C:\Users\Taku\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:939: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Out[112]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [113]:

 predictions=logmodel.predict(X_test)

In [114]:

from sklearn.metrics import classification_report

In [115]:

print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.83      0.90      0.86       163
           1       0.82      0.71      0.76       104

    accuracy                           0.83       267
   macro avg       0.83      0.81      0.81       267
weighted avg       0.83      0.83      0.83       267

In [116]:

from sklearn.metrics import confusion_matrix

In [117]:

confusion_matrix(y_test,predictions)

Out[117]:

array([[147,  16],
       [ 30,  74]], dtype=int64)

In [ ]: