The titanic dataset


This is a very famous data set and very often is a student's first step in machine learning! We'll be trying to predict a classification- survival or deceased. Let's begin our understanding of implementing Logistic Regression in Python for classification. We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown in this lecture notebook.

My imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Importing the Titanic training set

In [2]:
train=pd.read_csv("titanic_train.csv")
In [3]:
 
In [4]:
train.head()
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

EDA

In [12]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap="viridis");
In [26]:
sns.set_style("whitegrid")
#style must be one of white, dark, whitegrid, darkgrid, ticks
In [118]:
sns.countplot(x="Survived",data=train);
In [37]:
sns.distplot(train['Age'].dropna(),kde=False,bins=50)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x2185d316608>
In [38]:
train['Age'].dropna().plot(kind="hist",bins=50)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x2185d4350c8>
In [40]:
sns.countplot(x="SibSp",data=train)
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x2185d802e88>
In [48]:
train["Fare"].hist(bins=50,figsize=(10,4))
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x2185cbcd508>
In [61]:
import cufflinks as cf


cf.go_offline()
In [121]:
train["Age"].iplot(bins=50)
In [64]:
train.head()
Out[64]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [69]:
plt.figure(figsize=(12,8))
sns.boxplot(x="Pclass",y="Age",data=train)
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x21863cd9a08>

Filling in missing data

In [ ]:
### Calculate the mean of each class and use it to fill in the missing data
In [70]:
def impute_age(cols):
    Age=cols[0]
    Pclass=cols[1]
    
    if pd.isnull(Age):
        if Pclass == 1:
            return 37
        elif Pclass == 2:
            return 29
        else:
            return 24
        
    else:
        return Age
In [75]:
train["Age"]=train[["Age","Pclass"]].apply(impute_age,axis=1)
In [74]:
train.columns
Out[74]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [125]:
sns.heatmap(train.isnull())
Out[125]:
<matplotlib.axes._subplots.AxesSubplot at 0x2186745a508>
In [80]:
train.drop("Cabin",axis=1,inplace=True)
In [81]:
train.columns
Out[81]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked'],
      dtype='object')
In [83]:
train.dropna(inplace=True)
In [84]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[84]:
<matplotlib.axes._subplots.AxesSubplot at 0x21863f5bbc8>
In [86]:
sex=pd.get_dummies(train["Sex"],drop_first=True)
In [4]:
train['embark']=pd.get_dummies(train["Age"],drop_first=True)
In [90]:
train=pd.concat([train,sex,embark],axis=1)
In [91]:
train.head()
Out[91]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked male Q S
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S 1 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C 0 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 S 0 0 1
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 S 0 0 1
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 S 1 0 1
In [94]:
train.drop(["Sex","Embarked","Name","Ticket"],axis=1,inplace=True)
In [95]:
train.drop("PassengerId",axis=1,inplace=True)
In [96]:
train.head()
Out[96]:
Survived Pclass Age SibSp Parch Fare male Q S
0 0 3 22.0 1 0 7.2500 1 0 1
1 1 1 38.0 1 0 71.2833 0 0 0
2 1 3 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000 0 0 1
4 0 3 35.0 0 0 8.0500 1 0 1

Train test split, and training the model

In [97]:
from sklearn.model_selection import train_test_split
In [98]:
x=train.drop('Survived',axis=1)
y=train['Survived']
In [99]:
X_train, X_test, y_train, y_test = train_test_split( x, y, test_size=0.3, random_state=101)
In [109]:
from sklearn.linear_model import LogisticRegression
In [110]:
logmodel=LogisticRegression()
In [112]:
logmodel.fit(X_train,y_train)
C:\Users\Taku\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:939: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Out[112]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
In [113]:
 predictions=logmodel.predict(X_test)
In [114]:
from sklearn.metrics import classification_report
In [115]:
print(classification_report(y_test,predictions))
              precision    recall  f1-score   support

           0       0.83      0.90      0.86       163
           1       0.82      0.71      0.76       104

    accuracy                           0.83       267
   macro avg       0.83      0.81      0.81       267
weighted avg       0.83      0.83      0.83       267

In [116]:
from sklearn.metrics import confusion_matrix
In [117]:
confusion_matrix(y_test,predictions)
Out[117]:
array([[147,  16],
       [ 30,  74]], dtype=int64)
In [ ]: