Gemma Model Benchmark Suite

Published:

Goal

To democratize Large Language Model (LLM) analysis by providing an open-source evaluation toolkit that runs on accessible hardware, offering deep customization with minimal compute requirements.

Key Features

  • Broad Evaluation Scope: Run 4+ LLM families (Gemma, LLaMA 2, Phi-2) across 5+ standard tasks (MMLU+) and 15+ metrics, including BLEU, ROUGE, BERTScore, and Toxicity.
  • Extreme Customization: The entire pipeline is configurable via YAML files. Users can easily integrate their own models, datasets, evaluation logic, and prompt templates.
  • Low-Resource Optimization: Engineered to run efficiently on free-tier hardware (like Google Colab T4 GPUs) through techniques like 4-bit quantization and batch-size tuning.
  • Interactive Dashboard: Includes an interactive Streamlit dashboard for visual exploration, comparison of model performance, and analysis of results.
  • Robustness and Quality: The framework is built with modular components, features Pydantic-validated configuration, and has a 73% Pytest coverage to ensure reliability.

Key Metrics & Impact

  • Performance on Free Hardware: Achieves a 4-minute benchmark run for 500 samples using the Gemma-2B model in 4-bit mode on a single Colab T4 GPU.
  • Democratized Access: Enables students, researchers, and practitioners without access to expensive hardware to conduct meaningful and complex LLM evaluations.
  • Community Adoption: The toolkit has been featured in tutorials and demos as a prime example of accessible LLM experimentation.
  • Open Science: The results, models, and code were communicated through an in-depth Medium article and fully open-sourced on GitHub.

Tech Stack

  • Python, PyTorch, TensorFlow, Transformers, Hugging Face, Scikit-learn, Streamlit