Logistic Regression with Python

Machine Learning
AI Engineering
Implementing Logistic Regression for classification tasks using Python.
Author

DOSSEH Ameck Guy-Max Désiré

Published

September 5, 2025

Estimated reading time: ~30 minutes

Logistic Regression with Python

Objectives

  • Use Logistic Regression for classification
  • Preprocess data for modeling
  • Implement Logistic regression on real world data

Classification with Logistic Regression

Assume you are working for a telecommunications company concerned about customer churn. The goal is to predict which customers are likely to leave.

Load the Telco Churn data

Telco Churn is a hypothetical dataset for predicting customer churn. Each row represents a customer with various demographic and service usage information.

Data Preprocessing

For this lab, we use a subset of fields: ‘tenure’, ‘age’, ‘address’, ‘income’, ‘ed’, ‘employ’, ‘equip’, and ‘churn’.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%203/data/ChurnData.csv"
churn_df = pd.read_csv(url)
churn_df = churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'churn']]
churn_df['churn'] = churn_df['churn'].astype('int')

Feature Selection

X = np.asarray(churn_df[['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']])
y = np.asarray(churn_df['churn'])

Standardization

X_norm = StandardScaler().fit(X).transform(X)

Splitting the dataset

X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.2, random_state=4)

Logistic Regression Classifier modeling

  • Model (binary outcome)

    \[P(y=1\mid x)=\sigma(\eta)=\frac{1}{1+e^{-\eta}},\quad \eta=x^\top \beta.\]

    Equivalently, the log-odds are linear:

    \[\log\frac{P(y=1\mid x)}{1-P(y=1\mid x)}=x^\top \beta.\]

  • Random-utility (discrete-choice) view

    Latent index model:

    \[y^\ast=x^\top\beta+\varepsilon,\quad y=\mathbf{1}\{y^\ast\ge 0\}.\]

    If \[\varepsilon\sim\text{Logistic}(0,1)\] (difference of i.i.d. Gumbel/Type‑I EV errors), then

    \[P(y=1\mid x)=\sigma(x^\top\beta).\]

    This is the same logit link used in discrete-choice models.

  • Likelihood, log-likelihood, cross-entropy

    With observations \[\{(y_i,x_i)\}_{i=1}^n\] and \[p_i=\sigma(x_i^\top\beta)\], the Bernoulli likelihood is

    \[L(\beta)=\prod_{i=1}^n p_i^{y_i}(1-p_i)^{1-y_i},\qquad \ell(\beta)=\sum_{i=1}^n\left[y_i\log p_i+(1-y_i)\log(1-p_i)\right].\]

    The empirical cross-entropy (what sklearn’s log_loss reports) is

    \[\text{CE}(\beta)=-\frac{1}{n}\,\ell(\beta).\]

  • Score, Hessian, convexity

    Stack rows in \[X\in\mathbb{R}^{n\times p}\], let \[p=(p_1,\dots,p_n)^\top\] and \[W=\mathrm{diag}(p_i(1-p_i))\]. Then

    \[\nabla\ell(\beta)=X^\top(y-p),\qquad \nabla^2\ell(\beta)=-X^\top W X.\]

    Since \[-\,\nabla^2\ell(\beta)\succeq 0\], \[\ell(\beta)\] is concave, so the negative log-likelihood (cross-entropy) is convex and has a unique minimizer when it exists.

  • IRLS / Newton updates

    One Newton step yields

    \[\beta_{\text{new}}=\beta_{\text{old}}+(X^\top W X)^{-1}X^\top(y-p),\]

    which is the iteratively reweighted least squares (IRLS) algorithm. A WLS form uses the pseudo-response

    \[z=\eta+\frac{y-p}{p(1-p)},\quad \text{solve } (X^\top W X)\beta=X^\top W z.\]

  • Standard errors and tests (MLE)

    At the MLE \[\hat\beta\], the asymptotic covariance is

    \[\mathrm{Var}(\hat\beta)\approx (X^\top W X)^{-1}.\]

    Wald: \[\hat\beta_j/\mathrm{SE}(\hat\beta_j)\]; LR test: \[2(\ell_{\text{full}}-\ell_{\text{restr}})\sim\chi^2\].

  • Marginal effects and odds ratios

    \[\frac{\partial}{\partial x_j}P(y=1\mid x)=p(1-p)\,\beta_j,\qquad \text{odds ratio}=\exp(\beta_j).\]

  • Regularization (as in sklearn by default)

    L2-penalized log-likelihood:

    \[\ell_\lambda(\beta)=\ell(\beta)-\frac{\lambda}{2}\lVert\beta\rVert_2^2,\]

    so

    \[\nabla\ell_\lambda(\beta)=X^\top(y-p)-\lambda\beta,\quad \nabla^2\ell_\lambda(\beta)=-X^\top W X-\lambda I.\]

    L1 replaces \[\lVert\beta\rVert_2^2\] with \[\lVert\beta\rVert_1\] (subgradients, promotes sparsity).

  • Existence issues

    With (quasi) separation, the unpenalized MLE may not exist (coefficients diverge). L2 or bias-reduction fixes help.

LR = LogisticRegression().fit(X_train, y_train)
yhat = LR.predict(X_test)
yhat_prob = LR.predict_proba(X_test)

Feature Importance

coefficients = pd.Series(LR.coef_[0], index=churn_df.columns[:-1])
coefficients.sort_values().plot(kind='barh')
plt.title("Feature Coefficients in Logistic Regression Churn Model")
plt.xlabel("Coefficient Value")
plt.show()

Performance Evaluation

log_loss(y_test, yhat_prob)

The log loss is a measure of how well the model’s predicted probabilities match the actual outcomes. A lower log loss indicates better performance.

Feature Selection

Let’s experiment on adding and removing some features and assess how the performance of the model change.

# Helper to recompute pipeline and log loss for a given feature list

def compute_log_loss_for(features):
    # Prefer full dataset if available to access optional columns
    df = globals().get('churn_full_df', churn_df)

    # Build feature matrix and target
    X_local = np.asarray(df[features])
    y_local = np.asarray(df['churn'].astype(int))

    # Scale then split with fixed seed to match above
    X_local_norm = StandardScaler().fit_transform(X_local)
    X_tr, X_te, y_tr, y_te = train_test_split(X_local_norm, y_local, test_size=0.2, random_state=4)

    # Train and evaluate
    model = LogisticRegression()
    model.fit(X_tr, y_tr)
    y_proba = model.predict_proba(X_te)
    return log_loss(y_te, y_proba)

# Sanity check: baseline features from earlier section
base_features = ['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip']
base_log_loss = compute_log_loss_for(base_features)
base_log_loss

0.6257718410257236

#add 'wireless'
features_b = ['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'wireless']
log_loss_b = compute_log_loss_for(features_b)

#add both 'callcard' and 'wireless'
features_c = ['tenure', 'age', 'address', 'income', 'ed', 'employ', 'equip', 'callcard', 'wireless']
log_loss_c = compute_log_loss_for(features_c)

#remove 'equip'
features_d = ['tenure', 'age', 'address', 'income', 'ed', 'employ']
log_loss_d = compute_log_loss_for(features_d)

#remove 'income' and 'employ'
features_e = ['tenure', 'age', 'address', 'ed', 'equip']
log_loss_e = compute_log_loss_for(features_e)

pd.DataFrame({
    'Features': ['Base', 'Base + wireless', 'Base + callcard + wireless', 'Base - equip', 'Base - income - employ'],
    'Log Loss': [base_log_loss, log_loss_b, log_loss_c, log_loss_d, log_loss_e]
}).sort_values('Log Loss')

The best performance is obtained by removing equip feature from the base set. So adding features does not necessarily improve the model.