MultiModelClassification
A multimodal classification project that combines text and image inputs to predict a target class. This repository includes ready-to-use pre-trained models, a simple application interface, and Jupyter/py notebooks for experimentation and evaluation.
- Language composition: Jupyter Notebook (99.4%), Python (0.6%)
- Models included:
- Text: BiLSTM + Attention (
text_classification_model.h5) - Vision: ResNet50 (
image_classification_model.pt) - Tokenizer: Pre-fitted tokenizer for the text pipeline (
tokenizer.joblib)
- Text: BiLSTM + Attention (
Tip: This README includes Jekyll front matter and is optimized for the GitHub Pages Slate theme. When you enable GitHub Pages with the Slate theme, this page will render as your site’s homepage.
Table of contents
- Features
- Repository structure
- Quickstart
- Data expectations
- Models and architecture
- Configuration
- Evaluation
- Reproducibility
- Troubleshooting
- Roadmap
- Contributing
- License
- Citations and acknowledgments
- Publish with GitHub Pages (Slate theme)
Features
- Multimodal classification by fusing:
- Visual signal from a ResNet50 encoder
- Textual signal from a BiLSTM + Attention encoder
- Pre-trained weights included for fast, reproducible inference
- Simple application interface (
app.py) to try the model end-to-end - Notebook-first workflow (EDA, inference, and evaluation)
- Clean separation of assets (models, tokenizer) and notebooks
Repository structure
MultiModelClassification/
├── NotebooksPY/ # Jupyter/Python notebooks for experimentation
├── app.py # Main application interface (run locally)
├── image_classification_model.pt # Pretrained ResNet50 image classification model
├── text_classification_model.h5 # Pretrained BiLSTM+Attention text classification model
├── tokenizer.joblib # Pre-fitted tokenizer for text preprocessing
├── requirements.txt # Python dependencies
└── README.md # Project documentation (renders on GitHub Pages)
Notes:
- The included model artifacts enable immediate local inference without training.
- Large data should not be committed; point notebooks to your local data directory.
Quickstart
1) Clone and environment
git clone https://github.com/JaswanthRemiel/MultiModelClassification.git
cd MultiModelClassification
# Create and activate a virtual environment (choose one)
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
# or
conda create -n mmc python=3.10 -y && conda activate mmc
2) Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
If you need GPU acceleration for PyTorch, ensure the wheel matches your CUDA version (see PyTorch Get Started for the correct index-url). CPU-only is fine for smaller demos.
3) Run the app
The repository includes a main application interface in app.py. Depending on how the app is implemented, use one of the following:
- If it’s a standard Python script (CLI or simple server):
python app.pyThen open the printed local URL (commonly http://127.0.0.1:5000 or http://127.0.0.1:8000).
- If it’s a Streamlit app:
pip install streamlit # if not already installed streamlit run app.pyThen visit the local Streamlit URL (commonly http://localhost:8501).
Tip: If the app prints usage help, run python app.py --help.
4) Run the notebooks
Launch Jupyter and open the notebooks inside NotebooksPY/:
jupyter lab
# or
jupyter notebook
Inside each notebook, adjust:
- Paths to your data and images
- Model checkpoint paths (defaults point to repo root)
- Batch size, device (cpu/cuda), and other hyperparameters as needed
Data expectations
The typical multimodal row contains paired text and an image path. A common format is a CSV:
id,text,image_path,label
0001,"Short description for the image","data/images/0001.jpg","class_a"
0002,"Another description","data/images/0002.jpg","class_b"
Recommended layout:
data/
images/
0001.jpg
0002.jpg
train.csv
val.csv
test.csv
- image_path can be absolute or relative to the notebook/app working directory.
- For multilabel targets, you can store a delimiter-separated string or adapt the notebook to multi-hot labels.
Models and architecture
- Text encoder: BiLSTM + Attention
- Pretrained weights:
text_classification_model.h5 - Tokenizer:
tokenizer.joblib(ensure you use this tokenizer for consistent preprocessing)
- Pretrained weights:
- Image encoder: ResNet50
- Pretrained weights:
image_classification_model.pt
- Pretrained weights:
- Fusion/classifier:
- Fusion and classification occur in the application/notebooks. Typical approaches include:
- Concatenation of pooled text and image embeddings + MLP head
- Late fusion (e.g., weighted voting) for quick baselines
- Fusion and classification occur in the application/notebooks. Typical approaches include:
Tips:
- Keep the tokenizer and text model versions in sync.
- If you fine-tune or retrain, save new checkpoints under a
models/folder and update paths in the app/notebooks.
Configuration
Common configuration points (set in notebooks or at the top of app.py):
- Paths: data root, models, tokenizer
- Device:
cpuorcuda - Inference settings: batch size, max tokens, image resize/crop dimensions
- Label mapping: class names and ids
Evaluation
Typical metrics for classification:
- Accuracy, Precision, Recall, F1 (macro/micro/weighted)
- ROC-AUC (binary, or one-vs-rest for multiclass)
- Confusion matrix and per-class breakdown
Most notebooks include cells to compute and visualize these. Save your runs’ metrics to CSV/JSON for comparison across configurations.
Reproducibility
Set seeds and control non-determinism when benchmarking:
import os, random, numpy as np, torch
def set_seed(seed: int = 42):
random.seed(seed)
np.random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
set_seed(42)
Keep notes of:
- Exact model/tag names and versions
- Dataset splits and preprocessing
- Environment details (Python, CUDA, CUDA driver)
Troubleshooting
- CUDA / GPU issues:
- Ensure the installed PyTorch build matches your CUDA driver
- Try CPU-only wheels if you don’t need GPU or are testing quickly
- Out-of-memory:
- Reduce batch size, image resolution, or use mixed precision (AMP)
- Tokenization mismatches:
- Always use the provided
tokenizer.joblibwithtext_classification_model.h5
- Always use the provided
- File not found:
- Verify relative paths from the current working directory (Jupyter vs. app differs)
Roadmap
- Add configurable fusion strategies (concat + MLP, gated fusion, cross-attention)
- Provide a Colab notebook for zero-setup demos
- Add model cards and training notes for included checkpoints
- Optional CLI for batch inference on folders/CSVs
Contributing
Contributions are welcome! You can:
- Open issues for bugs or feature requests
- Submit PRs that improve the notebooks, app UX, or documentation
- Share datasets/configs that highlight new use cases
Please include a clear description, reproduction steps (if applicable), and relevant logs/screenshots.
Citations and acknowledgments
If you use this project, please consider citing the libraries and models you rely on:
- PyTorch
- torchvision
- scikit-learn
- Keras/TensorFlow (for the BiLSTM + Attention implementation if applicable)
- Any pre-trained backbones and datasets used in your experiments