Building Predictive Models with Python

Katlego Morwamohube
Software Developer & Security Specialist

Predictive modeling transforms raw data into actionable insights. Whether forecasting sales, predicting equipment failures, or classifying customer behavior, the workflow remains consistent. Let's walk through it.

1

Data Collection

Gather historical data from databases, APIs, or files

2

Data Cleaning

Handle missing values, outliers, and inconsistencies

3

Feature Engineering

Create meaningful variables that improve model performance

4

Model Training

Select algorithms and train on prepared data

5

Evaluation

Test performance using validation sets and metrics

6

Deployment

Put model into production with monitoring

Data Preprocessing with Pandas

Real-world data is messy. Pandas makes it manageable:

Data Cleaning Pipeline
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load data
df = pd.read_csv('dataset.csv')

# Handle missing values
imputer = SimpleImputer(strategy='median')
df['age'] = imputer.fit_transform(df[['age']])

# Encode categorical variables
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])

# Feature scaling
scaler = StandardScaler()
df[['price', 'quantity']] = scaler.fit_transform(df[['price', 'quantity']])

# Create new features
df['total_value'] = df['price'] * df['quantity']
df['age_group'] = pd.cut(df['age'], bins=[0, 25, 50, 100], labels=['young', 'middle', 'senior'])
Feature Engineering Tip

The best features often come from domain knowledge. A data scientist who understands the business context will outperform one who only knows algorithms.

Model Selection & Training

Start simple, then complex. A baseline model reveals if your problem is even predictable:

Complete ML Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import joblib

# Prepare data
X = df.drop('target', axis=1)
y = df['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train model
model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)

model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

# Cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"CV Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})")

# Save model
joblib.dump(model, 'model.pkl')

Choosing the Right Algorithm

Model Deployment

A model in Jupyter Notebook provides zero business value. Deploy it:

Flask API for Model Serving
from flask import Flask, request, jsonify
import joblib
import pandas as pd

app = Flask(__name__)
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    df = pd.DataFrame([data])
    
    prediction = model.predict(df)[0]
    probability = model.predict_proba(df)[0].max()
    
    return jsonify({
        'prediction': int(prediction),
        'confidence': float(probability),
        'status': 'success'
    })

if __name__ == '__main__':
    app.run(debug=True)

For production, containerize with Docker, deploy to AWS/GCP, and implement monitoring to detect model drift over time.