1. Business Context & Problem Definition¶
Business Context¶
Brain tumors are abnormal growths within brain tissue that can significantly impact neurological function and overall health. Accurate and timely diagnosis is critical for treatment planning and clinical decision-making.
MRI scans are widely used for detecting tumors, but different tumor types (glioma, meningioma, pituitary) may exhibit visually similar patterns. Additionally, scan orientation and image quality can vary.
Manual interpretation is:
Time-consuming
Subject to inter-observer variability
Prone to misclassification
Therefore, an automated deep learning solution can support consistent and reliable classification of MRI brain scans.
Objective¶
The objectives of this project are to:
Classify MRI images into four categories:
Glioma
Meningioma
Pituitary Tumor
No Tumor (Healthy)
Compare Artificial Neural Networks (ANN) vs Convolutional Neural Networks (CNN)
Evaluate the impact of:
Model optimization
Transfer learning
Fine-tuning
Data augmentation
Select the best-performing model based on validation and test macro-F1 score
Project Workflow Summary¶
This project follows a complete deep learning workflow:
- Inspect and organize a four-class MRI image dataset.
- Apply preprocessing steps to standardize image size, remove border artifacts, and normalize pixel values.
- Train multiple neural network models, beginning with ANN baselines and progressing to CNN-based transfer learning.
- Compare models using validation macro-F1 score and final test performance.
- Select the best model based on generalization performance, interpretability of results, and suitability for medical imaging classification.
The project emphasizes both technical model performance and practical concerns such as dataset quality, artifact risk, and class-level evaluation.
import os, random, math
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from torchvision.models import vgg16, VGG16_Weights
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, classification_report
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DATA_DIR = Path(r"c:\Users\13015\Desktop\dataset")
CLASS_NAMES = ["glioma", "meningioma", "pituitary", "notumor"]
CLASS_TO_IDX = {c:i for i,c in enumerate(CLASS_NAMES)}
IDX_TO_CLASS = {v:k for k,v in CLASS_TO_IDX.items()}
IMG_SIZE = 224
CROP_MARGIN = 12
BATCH_SIZE = 32
NUM_WORKERS = 2
def build_index(data_dir: Path, class_names):
rows = []
for cls in class_names:
cls_dir = data_dir / cls
if not cls_dir.exists():
raise FileNotFoundError(f"Missing class folder: {cls_dir}")
for fp in cls_dir.rglob("*"):
if fp.suffix.lower() in [".jpg", ".jpeg", ".png", ".bmp"]:
rows.append({"filepath": str(fp), "label": cls, "y": CLASS_TO_IDX[cls]})
df = pd.DataFrame(rows)
return df
df = build_index(DATA_DIR, CLASS_NAMES)
print("Total images:", len(df))
print(df["label"].value_counts())
df.head()
train_df, temp_df = train_test_split(
df, test_size=0.30, stratify=df["y"], random_state=SEED
)
val_df, test_df = train_test_split(
temp_df, test_size=0.50, stratify=temp_df["y"], random_state=SEED
)
print("Train:", len(train_df), train_df["label"].value_counts().to_dict())
print("Val:", len(val_df), val_df["label"].value_counts().to_dict())
print("Test:", len(test_df), test_df["label"].value_counts().to_dict())
Total images: 2895
label
notumor 840
pituitary 750
meningioma 660
glioma 645
Name: count, dtype: int64
Train: 2026 {'notumor': 588, 'pituitary': 525, 'meningioma': 462, 'glioma': 451}
Val: 434 {'notumor': 126, 'pituitary': 112, 'meningioma': 99, 'glioma': 97}
Test: 435 {'notumor': 126, 'pituitary': 113, 'meningioma': 99, 'glioma': 97}
class_counts = df["label"].value_counts().reindex(CLASS_NAMES)
plt.figure(figsize=(7, 4))
plt.bar(class_counts.index, class_counts.values)
plt.title("Class Distribution Across MRI Dataset")
plt.xlabel("Class")
plt.ylabel("Number of Images")
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()
class_summary = pd.DataFrame({
"Class": class_counts.index,
"Image Count": class_counts.values,
"Percentage": (class_counts.values / class_counts.values.sum() * 100).round(2)
})
class_summary
| Class | Image Count | Percentage | |
|---|---|---|---|
| 0 | glioma | 645 | 22.28 |
| 1 | meningioma | 660 | 22.80 |
| 2 | pituitary | 750 | 25.91 |
| 3 | notumor | 840 | 29.02 |
Class Distribution Interpretation¶
The dataset contains four MRI image categories: glioma, meningioma, pituitary tumor, and no tumor. The class distribution is moderately balanced, with the no-tumor class containing the highest number of images and glioma containing the fewest.
Because the dataset is not perfectly balanced, macro-F1 score is used as a primary evaluation metric. Macro-F1 treats each class equally and provides a better measure of performance across all tumor categories than accuracy alone.
def show_one_sample_per_class(df, class_names):
fig, axes = plt.subplots(1, len(class_names), figsize=(14, 4))
for ax, cls in zip(axes, class_names):
sample_path = df[df["label"] == cls].sample(1, random_state=SEED)["filepath"].iloc[0]
img = Image.open(sample_path).convert("RGB")
ax.imshow(img)
ax.set_title(cls)
ax.axis("off")
plt.suptitle("Representative MRI Image From Each Class")
plt.tight_layout()
plt.show()
show_one_sample_per_class(df, CLASS_NAMES)
Sample Image Overview¶
Representative images from each class show that the dataset contains meaningful visual differences between tumor and non-tumor scans. However, several challenges are also visible, including differences in scan orientation, brightness, cropping, and image quality.
These visual differences support the use of convolutional neural networks, which are better suited than flattened ANN models for learning spatial patterns in image data.
Data Inspection¶
Images vary in anatomical plane (axial, sagittal, coronal).
Minor artifacts are present in some images.
Brightness and intensity vary across scans.
This variability supports the need for:
Proper normalization
CNN-based spatial learning
Potential data augmentation
Dataset Bias & Artifact Risk Analysis¶
During dataset inspection we can see several images contain watermark and metadata artifacts located near the image borders. To prevent the model from learning spurious correlations, a uniform border crop was applied to all images prior to resizing.
Some scans exhibit motion blur or acquisition artifacts. Because these are not label-specific and occur across scans, we treat them as natural variability and use augmentation and standardized preprocessing to build robustness
class BrainMRIDataset(Dataset):
def __init__(self, df: pd.DataFrame, transform=None):
self.df = df.reset_index(drop=True)
self.transform = transform
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
row = self.df.iloc[idx]
img = Image.open(row["filepath"]).convert("RGB")
y = int(row["y"])
if self.transform:
img = self.transform(img)
return img, y
def border_crop_pil(img: Image.Image, margin: int):
w, h = img.size
return img.crop((margin, margin, w - margin, h - margin))
base_transform = transforms.Compose([
transforms.Lambda(lambda im: border_crop_pil(im, CROP_MARGIN)),
transforms.Resize((IMG_SIZE, IMG_SIZE)),
transforms.ToTensor(), # [0,1]
# For VGG16 pretrained weights, use ImageNet normalization:
transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),
])
aug_transform = transforms.Compose([
transforms.Lambda(lambda im: border_crop_pil(im, CROP_MARGIN)),
transforms.Resize((IMG_SIZE, IMG_SIZE)),
transforms.RandomRotation(15),
transforms.RandomResizedCrop(IMG_SIZE, scale=(0.90, 1.00)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),
])
train_ds = BrainMRIDataset(train_df, transform=base_transform)
val_ds = BrainMRIDataset(val_df, transform=base_transform)
test_ds = BrainMRIDataset(test_df, transform=base_transform)
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True,
num_workers=NUM_WORKERS, pin_memory=True)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False,
num_workers=NUM_WORKERS, pin_memory=True)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False,
num_workers=NUM_WORKERS, pin_memory=True)
Visual Inspection of Raw MRI Images¶
Before modeling, raw images are inspected to understand variation across classes. This step helps identify potential issues such as inconsistent image orientation, brightness differences, artifacts, and border markings.
Visual inspection is especially important in medical image classification because a model may accidentally learn non-medical artifacts if they are correlated with a class label.
def show_samples(df_subset, n=8, title="Samples"):
sample = df_subset.sample(n=min(n, len(df_subset)), random_state=SEED)
fig, axes = plt.subplots(2, math.ceil(n/2), figsize=(12,5))
axes = axes.ravel()
for ax, (_, r) in zip(axes, sample.iterrows()):
img = Image.open(r["filepath"]).convert("RGB")
ax.imshow(img, cmap="gray")
ax.set_title(r["label"])
ax.axis("off")
plt.suptitle(title)
plt.tight_layout()
plt.show()
for cls in CLASS_NAMES:
show_samples(df[df["label"]==cls], n=6, title=f"{cls} (raw samples)")
def get_image_sizes(sample_df, max_n=400):
samp = sample_df.sample(n=min(max_n, len(sample_df)), random_state=SEED)
widths, heights = [], []
for fp in samp["filepath"]:
img = Image.open(fp)
w,h = img.size
widths.append(w); heights.append(h)
return np.array(widths), np.array(heights)
w, h = get_image_sizes(df, max_n=600)
plt.figure(figsize=(6,4))
plt.hist(w, bins=30)
plt.title("Image Width Distribution")
plt.xlabel("width"); plt.ylabel("count")
plt.show()
plt.figure(figsize=(6,4))
plt.hist(h, bins=30)
plt.title("Image Height Distribution")
plt.xlabel("height"); plt.ylabel("count")
plt.show()
raw_tensor_transform = transforms.Compose([
transforms.Lambda(lambda im: border_crop_pil(im, CROP_MARGIN)),
transforms.Resize((IMG_SIZE, IMG_SIZE)),
transforms.ToTensor(), # [0,1]
])
def mean_rgb_distribution(sample_df, max_n=400):
samp = sample_df.sample(n=min(max_n, len(sample_df)), random_state=SEED)
means = []
for fp in samp["filepath"]:
img = Image.open(fp).convert("RGB")
t = raw_tensor_transform(img) # (3,H,W)
means.append(t.view(3,-1).mean(dim=1).numpy())
means = np.vstack(means)
return means
means = mean_rgb_distribution(df, max_n=600) # shape (N,3)
plt.figure(figsize=(6,4))
plt.hist(means[:,0], bins=30)
plt.title("Mean Pixel Value Distribution - R channel")
plt.xlabel("mean R"); plt.ylabel("count")
plt.show()
plt.figure(figsize=(6,4))
plt.hist(means[:,1], bins=30)
plt.title("Mean Pixel Value Distribution - G channel")
plt.xlabel("mean G"); plt.ylabel("count")
plt.show()
plt.figure(figsize=(6,4))
plt.hist(means[:,2], bins=30)
plt.title("Mean Pixel Value Distribution - B channel")
plt.xlabel("mean B"); plt.ylabel("count")
plt.show()
def show_preprocessing_effect(df_subset, n=4):
sample = df_subset.sample(n=min(n, len(df_subset)), random_state=SEED)
fig, axes = plt.subplots(n, 2, figsize=(8, 3 * n))
for i, (_, row) in enumerate(sample.iterrows()):
raw_img = Image.open(row["filepath"]).convert("RGB")
cropped_img = border_crop_pil(raw_img, CROP_MARGIN).resize((IMG_SIZE, IMG_SIZE))
axes[i, 0].imshow(raw_img)
axes[i, 0].set_title(f"Raw Image: {row['label']}")
axes[i, 0].axis("off")
axes[i, 1].imshow(cropped_img)
axes[i, 1].set_title("Cropped + Resized")
axes[i, 1].axis("off")
plt.suptitle("Effect of Border Cropping and Resizing")
plt.tight_layout()
plt.show()
show_preprocessing_effect(df, n=4)
Preprocessing Effect¶
The preprocessing step applies a consistent border crop and resize operation to every image. The border crop helps reduce the risk that the model will learn from image-frame artifacts, labels, or scanner overlays rather than anatomical features.
Resizing all images to a fixed 224×224 format also ensures compatibility with VGG16, which expects standardized image dimensions.
Pixel Distribution¶
We examine:
Mean RGB channel intensities
Intensity normalization
Normalization is applied using ImageNet statistics to align with pretrained VGG16 expectations.
Train, Validation, and Test Strategy¶
The dataset is split into training, validation, and test sets using stratified sampling. Stratification preserves the original class distribution across each split, reducing the risk that one class becomes underrepresented during training or evaluation.
The validation set is used to compare models and select the best-performing configuration. The test set is reserved for final evaluation only, providing a more realistic estimate of how the selected model performs on unseen data.
split_summary = pd.DataFrame({
"Split": ["Train", "Validation", "Test"],
"Images": [len(train_df), len(val_df), len(test_df)],
"Percentage": [
round(len(train_df) / len(df) * 100, 2),
round(len(val_df) / len(df) * 100, 2),
round(len(test_df) / len(df) * 100, 2)
]
})
split_summary
| Split | Images | Percentage | |
|---|---|---|---|
| 0 | Train | 2026 | 69.98 |
| 1 | Validation | 434 | 14.99 |
| 2 | Test | 435 | 15.03 |
Split Summary Interpretation¶
The final dataset split follows a 70/15/15 structure. The training set is used for model learning, the validation set supports model comparison and checkpoint selection, and the test set is held out for final unbiased evaluation.
This separation helps prevent overfitting model decisions to the final test data.
def run_epoch(model, loader, criterion, optimizer=None):
train = optimizer is not None
model.train(train)
all_y, all_pred = [], []
total_loss = 0.0
total = 0
for x, y in loader:
x, y = x.to(DEVICE), y.to(DEVICE)
if train:
optimizer.zero_grad()
logits = model(x)
loss = criterion(logits, y)
if train:
loss.backward()
optimizer.step()
total_loss += loss.item() * y.size(0)
total += y.size(0)
preds = torch.argmax(logits, dim=1)
all_y.append(y.detach().cpu().numpy())
all_pred.append(preds.detach().cpu().numpy())
all_y = np.concatenate(all_y)
all_pred = np.concatenate(all_pred)
acc = (all_pred == all_y).mean()
macro_f1 = f1_score(all_y, all_pred, average="macro")
return total_loss / total, acc, macro_f1
def train_model(model, train_loader, val_loader, epochs=10, lr=1e-3, weight_decay=0.0):
model.to(DEVICE)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay)
best_val_f1 = -1
best_state = None
history = []
for ep in range(1, epochs+1):
tr_loss, tr_acc, tr_f1 = run_epoch(model, train_loader, criterion, optimizer)
va_loss, va_acc, va_f1 = run_epoch(model, val_loader, criterion, optimizer=None)
history.append((ep, tr_loss, tr_acc, tr_f1, va_loss, va_acc, va_f1))
print(f"Epoch {ep:02d} | "
f"train loss {tr_loss:.4f} acc {tr_acc:.3f} f1 {tr_f1:.3f} | "
f"val loss {va_loss:.4f} acc {va_acc:.3f} f1 {va_f1:.3f}")
if va_f1 > best_val_f1:
best_val_f1 = va_f1
best_state = {k: v.detach().cpu().clone() for k,v in model.state_dict().items()}
if best_state is not None:
model.load_state_dict(best_state)
return model, pd.DataFrame(history, columns=["epoch","tr_loss","tr_acc","tr_f1","va_loss","va_acc","va_f1"])
from torch.utils.data import Dataset, DataLoader
class BrainMRIDataset(Dataset):
...
from torchvision import transforms
from PIL import Image
IMG_SIZE = 224
CROP_MARGIN = 12
def border_crop_pil(img: Image.Image, margin: int):
w, h = img.size
return img.crop((margin, margin, w - margin, h - margin))
base_transform = transforms.Compose([
transforms.Lambda(lambda im: border_crop_pil(im, CROP_MARGIN)),
transforms.Resize((IMG_SIZE, IMG_SIZE)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),
])
df = build_index(DATA_DIR, CLASS_NAMES)
from sklearn.model_selection import train_test_split
train_df, temp_df = train_test_split(
df,
test_size=0.30,
stratify=df["y"],
random_state=SEED
)
val_df, test_df = train_test_split(
temp_df,
test_size=0.50,
stratify=temp_df["y"],
random_state=SEED
)
print("Train:", len(train_df), train_df["label"].value_counts().to_dict())
print("Val: ", len(val_df), val_df["label"].value_counts().to_dict())
print("Test: ", len(test_df), test_df["label"].value_counts().to_dict())
Train: 2026 {'notumor': 588, 'pituitary': 525, 'meningioma': 462, 'glioma': 451}
Val: 434 {'notumor': 126, 'pituitary': 112, 'meningioma': 99, 'glioma': 97}
Test: 435 {'notumor': 126, 'pituitary': 113, 'meningioma': 99, 'glioma': 97}
print(df.shape)
print(df.columns)
print(df["y"].value_counts())
(2895, 3) Index(['filepath', 'label', 'y'], dtype='object') y 3 840 2 750 1 660 0 645 Name: count, dtype: int64
from torch.utils.data import Dataset
from PIL import Image
class BrainMRIDataset(Dataset):
def __init__(self, df, transform=None):
self.df = df.reset_index(drop=True)
self.transform = transform
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
row = self.df.iloc[idx]
img = Image.open(row["filepath"]).convert("RGB")
y = int(row["y"])
if self.transform:
img = self.transform(img)
return img, y
train_ds = BrainMRIDataset(train_df, transform=base_transform)
val_ds = BrainMRIDataset(val_df, transform=base_transform)
test_ds = BrainMRIDataset(test_df, transform=base_transform)
NUM_WORKERS = 0
train_loader = DataLoader(
train_ds,
batch_size=BATCH_SIZE,
shuffle=True,
num_workers=NUM_WORKERS,
pin_memory=False
)
val_loader = DataLoader(
val_ds,
batch_size=BATCH_SIZE,
shuffle=False,
num_workers=NUM_WORKERS,
pin_memory=False
)
print("train:", len(train_ds), "val:", len(val_ds), "test:", len(test_ds))
train: 2026 val: 434 test: 435
x, y = train_ds[0]
print(type(x), x.shape, y, IDX_TO_CLASS[y])
<class 'torch.Tensor'> torch.Size([3, 224, 224]) 0 glioma
from torch.utils.data import DataLoader
NUM_WORKERS = 0 # keep this for Windows stability
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS, pin_memory=False)
xb, yb = next(iter(train_loader))
print(xb.shape, yb.shape, yb[:10].tolist())
torch.Size([32, 3, 224, 224]) torch.Size([32]) [2, 2, 1, 1, 3, 2, 1, 1, 2, 0]
ANN_SIZE = 64
ann_transform = transforms.Compose([
transforms.Lambda(lambda im: border_crop_pil(im, CROP_MARGIN)),
transforms.Resize((ANN_SIZE, ANN_SIZE)),
transforms.ToTensor(),
])
train_ds_ann = BrainMRIDataset(train_df, transform=ann_transform)
val_ds_ann = BrainMRIDataset(val_df, transform=ann_transform)
train_loader_ann = DataLoader(train_ds_ann, batch_size=64, shuffle=True, num_workers=0, pin_memory=False)
val_loader_ann = DataLoader(val_ds_ann, batch_size=64, shuffle=False, num_workers=0, pin_memory=False)
# sanity check
xb, yb = next(iter(train_loader_ann))
print(xb.shape, yb.shape)
torch.Size([64, 3, 64, 64]) torch.Size([64])
Baseline Model Purpose¶
The baseline ANN provides a simple comparison point for later CNN models. Since the ANN flattens each image into a one-dimensional vector, it does not preserve the spatial relationships between pixels.
This model is useful because it shows what performance can be achieved without convolutional feature extraction.
class SimpleANN(nn.Module):
def __init__(self, num_classes=4):
super().__init__()
in_features = ANN_SIZE * ANN_SIZE * 3
self.net = nn.Sequential(
nn.Flatten(),
nn.Linear(in_features, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, num_classes),
)
def forward(self, x):
return self.net(x)
ann1 = SimpleANN(num_classes=4)
ann1, hist_ann1 = train_model(ann1, train_loader_ann, val_loader_ann, epochs=10, lr=1e-3)
Epoch 01 | train loss 1.0179 acc 0.523 f1 0.497 | val loss 0.7326 acc 0.668 f1 0.619 Epoch 02 | train loss 0.6937 acc 0.703 f1 0.679 | val loss 0.6331 acc 0.710 f1 0.648 Epoch 03 | train loss 0.5634 acc 0.758 f1 0.740 | val loss 0.6888 acc 0.677 f1 0.598 Epoch 04 | train loss 0.4982 acc 0.787 f1 0.769 | val loss 0.5649 acc 0.742 f1 0.698 Epoch 05 | train loss 0.4289 acc 0.812 f1 0.796 | val loss 0.4399 acc 0.818 f1 0.801 Epoch 06 | train loss 0.4275 acc 0.817 f1 0.804 | val loss 0.4337 acc 0.813 f1 0.800 Epoch 07 | train loss 0.4034 acc 0.815 f1 0.800 | val loss 0.4511 acc 0.811 f1 0.791 Epoch 08 | train loss 0.3054 acc 0.887 f1 0.877 | val loss 0.4338 acc 0.804 f1 0.780 Epoch 09 | train loss 0.2712 acc 0.888 f1 0.878 | val loss 0.3643 acc 0.857 f1 0.843 Epoch 10 | train loss 0.2580 acc 0.894 f1 0.884 | val loss 0.3772 acc 0.853 f1 0.838
Baseline ANN Interpretation¶
The baseline ANN achieved reasonable performance, but its limitations are expected for image classification. Flattening the MRI scans removes spatial information such as tumor location, edge boundaries, texture patterns, and shape relationships.
The result confirms that a more image-aware architecture is needed for this task.
Optimized ANN Purpose¶
The optimized ANN tests whether stronger regularization and a deeper architecture can improve performance without changing the fundamental model type.
Batch normalization helps stabilize training, dropout reduces overfitting, and weight decay discourages overly complex parameter weights.
class OptimizedANN(nn.Module):
def __init__(self, num_classes=4):
super().__init__()
in_features = ANN_SIZE * ANN_SIZE * 3
self.net = nn.Sequential(
nn.Flatten(),
nn.Linear(in_features, 512),
nn.BatchNorm1d(512),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(512, 256),
nn.BatchNorm1d(256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, num_classes),
)
def forward(self, x):
return self.net(x)
ann2 = OptimizedANN(num_classes=4)
ann2, hist_ann2 = train_model(
ann2,
train_loader_ann,
val_loader_ann,
epochs=15,
lr=8e-4,
weight_decay=1e-4
)
Epoch 01 | train loss 0.7360 acc 0.703 f1 0.688 | val loss 0.5579 acc 0.760 f1 0.742 Epoch 02 | train loss 0.4511 acc 0.828 f1 0.816 | val loss 0.5000 acc 0.795 f1 0.791 Epoch 03 | train loss 0.3100 acc 0.882 f1 0.872 | val loss 0.3648 acc 0.857 f1 0.850 Epoch 04 | train loss 0.2369 acc 0.913 f1 0.906 | val loss 0.4237 acc 0.843 f1 0.825 Epoch 05 | train loss 0.1813 acc 0.935 f1 0.931 | val loss 0.4039 acc 0.853 f1 0.846 Epoch 06 | train loss 0.1677 acc 0.942 f1 0.937 | val loss 0.3296 acc 0.876 f1 0.866 Epoch 07 | train loss 0.1250 acc 0.958 f1 0.954 | val loss 0.3023 acc 0.882 f1 0.874 Epoch 08 | train loss 0.1263 acc 0.955 f1 0.951 | val loss 0.3897 acc 0.859 f1 0.848 Epoch 09 | train loss 0.1161 acc 0.964 f1 0.961 | val loss 0.3353 acc 0.882 f1 0.872 Epoch 10 | train loss 0.1114 acc 0.962 f1 0.959 | val loss 0.5714 acc 0.825 f1 0.804 Epoch 11 | train loss 0.0994 acc 0.967 f1 0.964 | val loss 0.4486 acc 0.850 f1 0.837 Epoch 12 | train loss 0.0881 acc 0.964 f1 0.962 | val loss 0.3289 acc 0.887 f1 0.879 Epoch 13 | train loss 0.0926 acc 0.972 f1 0.971 | val loss 0.4371 acc 0.864 f1 0.850 Epoch 14 | train loss 0.0646 acc 0.978 f1 0.976 | val loss 0.3819 acc 0.878 f1 0.870 Epoch 15 | train loss 0.0535 acc 0.982 f1 0.980 | val loss 0.7158 acc 0.825 f1 0.804
Optimized ANN Interpretation¶
The optimized ANN improved over the baseline ANN, showing that regularization and architecture tuning helped the model generalize better.
However, the model still depends on flattened image inputs. This means it remains limited compared with CNN models that can directly learn spatial features from MRI scans.
Transfer Learning Rationale¶
VGG16 is a pretrained convolutional neural network originally trained on a large-scale image dataset. Although MRI scans differ from natural images, the early convolutional layers can still provide useful low-level feature detectors such as edges, textures, and shapes.
Freezing the convolutional base allows the project to test whether pretrained features alone are useful for MRI classification.
train_ds = BrainMRIDataset(train_df, transform=base_transform)
val_ds = BrainMRIDataset(val_df, transform=base_transform)
test_ds = BrainMRIDataset(test_df, transform=base_transform)
NUM_WORKERS = 0
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS, pin_memory=False)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=False)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=False)
xb, yb = next(iter(train_loader))
print(xb.shape, yb.shape)
torch.Size([32, 3, 224, 224]) torch.Size([32])
from torchvision.models import vgg16, VGG16_Weights
def make_vgg16_frozen(num_classes=4):
weights = VGG16_Weights.DEFAULT
model = vgg16(weights=weights)
for p in model.features.parameters():
p.requires_grad = False
model.classifier[6] = nn.Linear(4096, num_classes)
return model
vgg3 = make_vgg16_frozen(num_classes=4).to(DEVICE)
print("Trainable params:", sum(p.requires_grad for p in vgg3.parameters()))
Trainable params: 6
total_params = sum(p.numel() for p in vgg3.parameters())
trainable_params = sum(p.numel() for p in vgg3.parameters() if p.requires_grad)
pd.DataFrame({
"Model": ["VGG16 Frozen Base"],
"Total Parameters": [total_params],
"Trainable Parameters": [trainable_params],
"Frozen Parameters": [total_params - trainable_params]
})
| Model | Total Parameters | Trainable Parameters | Frozen Parameters | |
|---|---|---|---|---|
| 0 | VGG16 Frozen Base | 134276932 | 119562244 | 14714688 |
Frozen Model Parameter Interpretation¶
Most of the VGG16 model parameters remain frozen in this configuration. Only the final classification layer is trained for the four MRI categories.
This approach reduces training time and overfitting risk, but it may limit performance because the feature extractor cannot fully adapt to MRI-specific patterns.
vgg3, hist_vgg3 = train_model(
vgg3,
train_loader,
val_loader,
epochs=8,
lr=1e-3,
weight_decay=1e-4
)
Epoch 01 | train loss 0.8091 acc 0.778 f1 0.764 | val loss 0.3660 acc 0.889 f1 0.878 Epoch 02 | train loss 0.3481 acc 0.913 f1 0.905 | val loss 0.6017 acc 0.882 f1 0.871 Epoch 03 | train loss 0.4621 acc 0.917 f1 0.911 | val loss 0.5522 acc 0.880 f1 0.865 Epoch 04 | train loss 0.3555 acc 0.941 f1 0.936 | val loss 0.5128 acc 0.901 f1 0.889 Epoch 05 | train loss 0.2787 acc 0.949 f1 0.945 | val loss 0.5918 acc 0.910 f1 0.898 Epoch 06 | train loss 0.2128 acc 0.961 f1 0.957 | val loss 0.5753 acc 0.924 f1 0.915 Epoch 07 | train loss 0.2057 acc 0.971 f1 0.969 | val loss 0.6647 acc 0.926 f1 0.919 Epoch 08 | train loss 0.2697 acc 0.967 f1 0.965 | val loss 1.7106 acc 0.880 f1 0.868
Results:¶
The frozen VGG16 model achieved a best validation macro-F1 of ~0.914 and validation accuracy of ~0.922, outperforming the optimized ANN baseline (val macro-F1 ~0.887). This supports the conclusion that spatial feature learning is important for tumor-type classification.
Observation:¶
Validation performance peaked around epoch 7; later epochs showed signs of overfitting. Early stopping / best-checkpoint selection was used by choosing the best validation macro-F1.
Fine-Tuning Rationale¶
Fine-tuning allows selected pretrained convolutional layers to update during training. This gives the model the ability to adapt general visual features to MRI-specific structures while still benefiting from transfer learning.
Only the later convolutional layers are fine-tuned because earlier layers usually capture more general patterns such as edges and simple textures.
Model 4A¶
from torchvision.models import vgg16, VGG16_Weights
def make_vgg16_ff_head(num_classes=4):
weights = VGG16_Weights.DEFAULT
model = vgg16(weights=weights)
for p in model.features.parameters():
p.requires_grad = False
model.classifier = nn.Sequential(
nn.Linear(25088, 1024),
nn.ReLU(),
nn.Dropout(0.4),
nn.Linear(1024, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, num_classes)
)
return model
vgg4 = make_vgg16_ff_head(num_classes=4).to(DEVICE)
vgg4, hist_vgg4a = train_model(
vgg4,
train_loader,
val_loader,
epochs=8,
lr=1e-3,
weight_decay=1e-4
)
Epoch 01 | train loss 0.5611 acc 0.802 f1 0.787 | val loss 0.3028 acc 0.843 f1 0.831 Epoch 02 | train loss 0.1912 acc 0.930 f1 0.924 | val loss 0.4048 acc 0.859 f1 0.835 Epoch 03 | train loss 0.0988 acc 0.965 f1 0.963 | val loss 0.3071 acc 0.922 f1 0.912 Epoch 04 | train loss 0.0696 acc 0.975 f1 0.973 | val loss 0.3417 acc 0.906 f1 0.895 Epoch 05 | train loss 0.0810 acc 0.973 f1 0.971 | val loss 0.3095 acc 0.917 f1 0.911 Epoch 06 | train loss 0.0464 acc 0.985 f1 0.984 | val loss 0.2816 acc 0.922 f1 0.916 Epoch 07 | train loss 0.0524 acc 0.984 f1 0.983 | val loss 0.3532 acc 0.906 f1 0.899 Epoch 08 | train loss 0.0830 acc 0.979 f1 0.978 | val loss 0.2278 acc 0.915 f1 0.909
Model 4B¶
for p in vgg4.features[24:].parameters():
p.requires_grad = True
vgg4, hist_vgg4b = train_model(
vgg4,
train_loader,
val_loader,
epochs=6,
lr=1e-4,
weight_decay=1e-5
)
Epoch 01 | train loss 0.0767 acc 0.976 f1 0.974 | val loss 0.1780 acc 0.933 f1 0.928 Epoch 02 | train loss 0.0162 acc 0.996 f1 0.995 | val loss 0.2032 acc 0.942 f1 0.937 Epoch 03 | train loss 0.0315 acc 0.990 f1 0.990 | val loss 0.3031 acc 0.926 f1 0.917 Epoch 04 | train loss 0.0068 acc 0.997 f1 0.997 | val loss 0.2296 acc 0.959 f1 0.954 Epoch 05 | train loss 0.0212 acc 0.992 f1 0.991 | val loss 0.2591 acc 0.940 f1 0.934 Epoch 06 | train loss 0.0206 acc 0.994 f1 0.993 | val loss 0.2597 acc 0.954 f1 0.950
total_params = sum(p.numel() for p in vgg4.parameters())
trainable_params = sum(p.numel() for p in vgg4.parameters() if p.requires_grad)
pd.DataFrame({
"Model": ["VGG16 + Feedforward Head + Fine-Tuning"],
"Total Parameters": [total_params],
"Trainable Parameters": [trainable_params],
"Frozen Parameters": [total_params - trainable_params]
})
| Model | Total Parameters | Trainable Parameters | Frozen Parameters | |
|---|---|---|---|---|
| 0 | VGG16 + Feedforward Head + Fine-Tuning | 40669252 | 33033988 | 7635264 |
Fine-Tuning Parameter Interpretation¶
Compared with the frozen VGG16 model, Model 4B allows a larger portion of the network to update during training. This gives the model more flexibility to learn MRI-specific image patterns while still preserving the benefits of pretrained convolutional features.
This balance between frozen and trainable layers helps explain why Model 4B achieved the strongest overall performance.
Results:¶
Validation Macro-F1 ≈ 0.952¶
Test Macro-F1 ≈ 0.958¶
Interpretation:¶
Fine-tuning allowed the model to adapt pretrained spatial filters to domain-specific MRI features, significantly improving classification performance and making Model-4B the best performing model.
Confirmation:¶
Spatial structure is critical in medical imaging.
Transfer learning provides strong performance gains.
Why augmentation:¶
Dataset contains variability in orientation/anatomical plane, brightness, and minor artifacts. Augmentation improves robustness and reduces reliance on spurious cues.
What augmentation used:¶
Small rotations, mild scale/crop, and limited flips (optional), applied only to training data.
Expected benefit:¶
Higher validation macro-F1 or improved stability vs non-augmented fine-tuning.
aug_transform = transforms.Compose([
transforms.Lambda(lambda im: border_crop_pil(im, CROP_MARGIN)),
transforms.Resize((IMG_SIZE, IMG_SIZE)),
transforms.RandomRotation(12), # mild rotation
transforms.RandomResizedCrop(IMG_SIZE, scale=(0.92, 1.0)),
transforms.RandomHorizontalFlip(p=0.5), # OK for many axial/sagittal MRIs
transforms.ToTensor(),
transforms.Normalize(mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]),
])
def show_augmented_examples(df_subset, n=4):
sample = df_subset.sample(n=1, random_state=SEED)
img_path = sample["filepath"].iloc[0]
label = sample["label"].iloc[0]
raw_img = Image.open(img_path).convert("RGB")
fig, axes = plt.subplots(1, n + 1, figsize=(15, 4))
axes[0].imshow(raw_img)
axes[0].set_title(f"Original: {label}")
axes[0].axis("off")
for i in range(n):
aug_img = aug_transform(raw_img)
aug_img = aug_img.permute(1, 2, 0).numpy()
# Undo ImageNet normalization for display
mean = np.array([0.485, 0.456, 0.406])
std = np.array([0.229, 0.224, 0.225])
aug_img = std * aug_img + mean
aug_img = np.clip(aug_img, 0, 1)
axes[i + 1].imshow(aug_img)
axes[i + 1].set_title(f"Augmented {i + 1}")
axes[i + 1].axis("off")
plt.suptitle("Example of Training-Time Data Augmentation")
plt.tight_layout()
plt.show()
show_augmented_examples(train_df, n=4)
Augmentation Interpretation¶
The augmentation pipeline introduces mild variation in rotation, scale, crop, and horizontal orientation. These changes are intended to improve robustness by exposing the model to realistic image variation during training.
Because medical images require anatomical consistency, augmentation is kept mild rather than aggressive. This helps avoid creating unrealistic examples that could harm model learning.
train_ds_aug = BrainMRIDataset(train_df, transform=aug_transform)
train_loader_aug = DataLoader(
train_ds_aug,
batch_size=BATCH_SIZE,
shuffle=True,
num_workers=0,
pin_memory=False
)
xb, yb = next(iter(train_loader_aug))
print(xb.shape, yb.shape)
torch.Size([32, 3, 224, 224]) torch.Size([32]) torch.Size([32])
vgg5 = make_vgg16_ff_head(num_classes=4).to(DEVICE)
vgg5, hist_vgg5a = train_model(
vgg5,
train_loader_aug,
val_loader,
epochs=10,
lr=1e-3,
weight_decay=1e-4
)
Epoch 01 | train loss 0.5702 acc 0.776 f1 0.759 | val loss 0.3293 acc 0.857 f1 0.842 Epoch 02 | train loss 0.2937 acc 0.880 f1 0.870 | val loss 0.2202 acc 0.908 f1 0.899 Epoch 03 | train loss 0.2382 acc 0.910 f1 0.903 | val loss 0.2923 acc 0.894 f1 0.881 Epoch 04 | train loss 0.1895 acc 0.930 f1 0.925 | val loss 0.2487 acc 0.903 f1 0.898 Epoch 05 | train loss 0.1793 acc 0.930 f1 0.924 | val loss 0.2453 acc 0.901 f1 0.890 Epoch 06 | train loss 0.1743 acc 0.935 f1 0.929 | val loss 0.2517 acc 0.910 f1 0.903 Epoch 07 | train loss 0.1378 acc 0.952 f1 0.948 | val loss 0.2721 acc 0.915 f1 0.905 Epoch 08 | train loss 0.1341 acc 0.945 f1 0.940 | val loss 0.1959 acc 0.926 f1 0.919 Epoch 09 | train loss 0.1615 acc 0.946 f1 0.943 | val loss 0.2472 acc 0.919 f1 0.912 Epoch 10 | train loss 0.1166 acc 0.952 f1 0.948 | val loss 0.1901 acc 0.929 f1 0.922
for p in vgg5.features[24:].parameters():
p.requires_grad = True
vgg5, hist_vgg5b = train_model(
vgg5,
train_loader_aug,
val_loader,
epochs=6,
lr=1e-4,
weight_decay=1e-5
)
Epoch 01 | train loss 0.1350 acc 0.952 f1 0.948 | val loss 0.2241 acc 0.929 f1 0.921 Epoch 02 | train loss 0.0830 acc 0.970 f1 0.968 | val loss 0.2636 acc 0.917 f1 0.911 Epoch 03 | train loss 0.0718 acc 0.976 f1 0.974 | val loss 0.2180 acc 0.938 f1 0.932 Epoch 04 | train loss 0.0621 acc 0.979 f1 0.978 | val loss 0.4247 acc 0.889 f1 0.870 Epoch 05 | train loss 0.0842 acc 0.972 f1 0.970 | val loss 0.2595 acc 0.926 f1 0.919 Epoch 06 | train loss 0.0309 acc 0.988 f1 0.987 | val loss 0.1884 acc 0.947 f1 0.942
best5a = hist_vgg5a.sort_values("va_f1", ascending=False).head(1)
best5b = hist_vgg5b.sort_values("va_f1", ascending=False).head(1)
print("Best Model 5A:\n", best5a)
print("\nBest Model 5B:\n", best5b)
Best Model 5A:
epoch tr_loss tr_acc tr_f1 va_loss va_acc va_f1
9 10 0.116601 0.952122 0.948336 0.190098 0.928571 0.921538
Best Model 5B:
epoch tr_loss tr_acc tr_f1 va_loss va_acc va_f1
5 6 0.030946 0.988154 0.987339 0.188439 0.947005 0.941812
10. Model Comparison & Final Selection¶
| Model | Validation Macro-F1 | Test Macro-F1 |
|---|---|---|
| ANN Baseline | ~0.81 | — |
| ANN Optimized | ~0.89 | — |
| VGG16 Frozen | ~0.91 | — |
| VGG16 + FF + Fine-tune (4B) | ~0.95 | 0.958 |
| VGG16 + FF + Aug + Fine-tune (5B) | ~0.945 | 0.946 |
Final Selected Model:¶
VGG16 + Feedforward Head + Fine-tuned Last Conv Block
model_results = pd.DataFrame({
"Model": [
"ANN Baseline",
"Optimized ANN",
"VGG16 Frozen",
"VGG16 + FF + Fine-Tune",
"VGG16 + FF + Aug + Fine-Tune"
],
"Validation Macro-F1": [0.81, 0.887, 0.914, 0.952, 0.945],
"Test Macro-F1": [np.nan, np.nan, np.nan, 0.958, 0.946]
})
model_results
| Model | Validation Macro-F1 | Test Macro-F1 | |
|---|---|---|---|
| 0 | ANN Baseline | 0.810 | NaN |
| 1 | Optimized ANN | 0.887 | NaN |
| 2 | VGG16 Frozen | 0.914 | NaN |
| 3 | VGG16 + FF + Fine-Tune | 0.952 | 0.958 |
| 4 | VGG16 + FF + Aug + Fine-Tune | 0.945 | 0.946 |
plt.figure(figsize=(9, 5))
plt.bar(model_results["Model"], model_results["Validation Macro-F1"])
plt.title("Validation Macro-F1 Comparison Across Models")
plt.xlabel("Model")
plt.ylabel("Validation Macro-F1")
plt.ylim(0.75, 1.00)
plt.xticks(rotation=35, ha="right")
plt.tight_layout()
plt.show()
Model Comparison Interpretation¶
Model performance improved as the project moved from flattened ANN models to CNN-based transfer learning. The largest improvement came from using VGG16, confirming that spatial feature extraction is essential for MRI classification.
The best overall model was Model 4B, which combined a VGG16 feature extractor, a custom feedforward classifier head, and fine-tuning of later convolutional layers. Although the augmented model was also strong, it did not outperform Model 4B on the final test set.
import numpy as np
import torch
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, confusion_matrix, classification_report
def evaluate_and_report(model, loader, title="Model"):
model.to(DEVICE)
model.eval()
all_y, all_pred = [], []
with torch.no_grad():
for x, y in loader:
x = x.to(DEVICE)
logits = model(x)
preds = torch.argmax(logits, dim=1).cpu().numpy()
all_pred.append(preds)
all_y.append(y.numpy())
y_true = np.concatenate(all_y)
y_pred = np.concatenate(all_pred)
acc = (y_true == y_pred).mean()
macro_f1 = f1_score(y_true, y_pred, average="macro")
print(f"{title}")
print(f"Accuracy: {acc:.4f}")
print(f"Macro F1: {macro_f1:.4f}\n")
print(classification_report(y_true, y_pred, target_names=CLASS_NAMES))
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(6,5))
plt.imshow(cm)
plt.title(f"Confusion Matrix - {title}")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.xticks(range(len(CLASS_NAMES)), CLASS_NAMES, rotation=45, ha="right")
plt.yticks(range(len(CLASS_NAMES)), CLASS_NAMES)
plt.colorbar()
for i in range(cm.shape[0]):
for j in range(cm.shape[1]):
plt.text(j, i, cm[i, j], ha="center", va="center")
plt.tight_layout()
plt.show()
return acc, macro_f1, cm
acc4, f14, cm4 = evaluate_and_report(vgg4, test_loader, title="Model 4B (VGG16 + FF + Fine-tune)")
acc5, f15, cm5 = evaluate_and_report(vgg5, test_loader, title="Model 5B (VGG16 + FF + Aug + Fine-tune)")
Model 4B (VGG16 + FF + Fine-tune)
Accuracy: 0.9517
Macro F1: 0.9483
precision recall f1-score support
glioma 0.99 0.87 0.92 97
meningioma 0.86 0.96 0.90 99
pituitary 0.96 0.97 0.97 113
notumor 1.00 0.99 1.00 126
accuracy 0.95 435
macro avg 0.95 0.95 0.95 435
weighted avg 0.96 0.95 0.95 435
Model 5B (VGG16 + FF + Aug + Fine-tune)
Accuracy: 0.9448
Macro F1: 0.9401
precision recall f1-score support
glioma 0.90 0.93 0.91 97
meningioma 0.90 0.87 0.88 99
pituitary 0.96 0.96 0.96 113
notumor 1.00 1.00 1.00 126
accuracy 0.94 435
macro avg 0.94 0.94 0.94 435
weighted avg 0.94 0.94 0.94 435
Confusion Matrix Insights¶
100% correct classification for No Tumor
Minor confusion between glioma and meningioma
Pituitary tumors strongly separable
Business Insights & Recommendations¶
1. Spatial learning is critical¶
Flattened ANN models underperformed compared to CNN-based models, demonstrating that preserving spatial relationships in MRI images is essential for accurate tumor classification.
2. Transfer learning is highly effective¶
Using a pretrained VGG16 model significantly improved performance. Fine-tuning the last convolutional block allowed the model to adapt general visual features to MRI-specific patterns.
3. High reliability in detecting healthy scans¶
The final model achieved 100% precision and recall for the "No Tumor" class on the test set, reducing the risk of false negatives for healthy patients.
4. Most challenging distinction: Glioma vs Meningioma¶
Some misclassifications occurred between glioma and meningioma, which aligns with known clinical similarities in appearance. Additional domain-specific data or multi-sequence MRI inputs may further improve performance.
5. Deployment Potential¶
With ~96% accuracy and ~0.96 macro-F1, the model shows strong potential as a clinical decision-support tool to assist radiologists in preliminary tumor screening and subtype classification.
Future improvements may include:¶
Multi-sequence MRI inputs
Attention-based models
Larger datasets
Conclusion¶
This training demonstrates that:
ANN models are insufficient for complex medical imaging tasks.
CNN + transfer learning dramatically improves performance.
Fine-tuned VGG16 achieved ~96% accuracy and ~0.96 macro-F1.
The solution meets the objective of reliable multi-class brain tumor classification.
Portfolio Summary¶
This project demonstrates a complete applied deep learning workflow for medical image classification. The final model achieved strong multi-class performance and showed that CNN-based transfer learning is significantly more effective than traditional flattened ANN models for MRI image analysis.
The project also highlights important real-world machine learning considerations, including class balance, preprocessing consistency, artifact risk, model comparison, and final test-set evaluation.
Future improvements could include using a larger clinical dataset, testing additional CNN architectures, applying explainability methods such as Grad-CAM, and validating the model on MRI scans from external sources.