Gemma Model Benchmark Suite
Published:
Goal
To democratize Large Language Model (LLM) analysis by providing an open-source evaluation toolkit that runs on accessible hardware, offering deep customization with minimal compute requirements.
Key Features
- Broad Evaluation Scope: Run 4+ LLM families (Gemma, LLaMA 2, Phi-2) across 5+ standard tasks (MMLU+) and 15+ metrics, including BLEU, ROUGE, BERTScore, and Toxicity.
- Extreme Customization: The entire pipeline is configurable via YAML files. Users can easily integrate their own models, datasets, evaluation logic, and prompt templates.
- Low-Resource Optimization: Engineered to run efficiently on free-tier hardware (like Google Colab T4 GPUs) through techniques like 4-bit quantization and batch-size tuning.
- Interactive Dashboard: Includes an interactive Streamlit dashboard for visual exploration, comparison of model performance, and analysis of results.
- Robustness and Quality: The framework is built with modular components, features Pydantic-validated configuration, and has a 73% Pytest coverage to ensure reliability.
Key Metrics & Impact
- Performance on Free Hardware: Achieves a 4-minute benchmark run for 500 samples using the Gemma-2B model in 4-bit mode on a single Colab T4 GPU.
- Democratized Access: Enables students, researchers, and practitioners without access to expensive hardware to conduct meaningful and complex LLM evaluations.
- Community Adoption: The toolkit has been featured in tutorials and demos as a prime example of accessible LLM experimentation.
- Open Science: The results, models, and code were communicated through an in-depth Medium article and fully open-sourced on GitHub.
Tech Stack
- Python, PyTorch, TensorFlow, Transformers, Hugging Face, Scikit-learn, Streamlit