Gemma Model Benchmark Suite

Published: May 15, 2025

Goal

To democratize Large Language Model (LLM) analysis by providing an open-source evaluation toolkit that runs on accessible hardware, offering deep customization with minimal compute requirements.

Key Features

Broad Evaluation Scope: Run 4+ LLM families (Gemma, LLaMA 2, Phi-2) across 5+ standard tasks (MMLU+) and 15+ metrics, including BLEU, ROUGE, BERTScore, and Toxicity.
Extreme Customization: The entire pipeline is configurable via YAML files. Users can easily integrate their own models, datasets, evaluation logic, and prompt templates.
Low-Resource Optimization: Engineered to run efficiently on free-tier hardware (like Google Colab T4 GPUs) through techniques like 4-bit quantization and batch-size tuning.
Interactive Dashboard: Includes an interactive Streamlit dashboard for visual exploration, comparison of model performance, and analysis of results.
Robustness and Quality: The framework is built with modular components, features Pydantic-validated configuration, and has a 73% Pytest coverage to ensure reliability.

Key Metrics & Impact

Performance on Free Hardware: Achieves a 4-minute benchmark run for 500 samples using the Gemma-2B model in 4-bit mode on a single Colab T4 GPU.
Democratized Access: Enables students, researchers, and practitioners without access to expensive hardware to conduct meaningful and complex LLM evaluations.
Community Adoption: The toolkit has been featured in tutorials and demos as a prime example of accessible LLM experimentation.
Open Science: The results, models, and code were communicated through an in-depth Medium article and fully open-sourced on GitHub.

Tech Stack

Python, PyTorch, TensorFlow, Transformers, Hugging Face, Scikit-learn, Streamlit

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Domenico Lacavalla

Goal

Key Features

Key Metrics & Impact

Tech Stack

Share on