A comprehensive testing and benchmarking suite for Large Language Models (LLMs) focused on Python code generation. The project enables automatic quality assessment of generated code through various metrics and generates detailed HTML reports.
# Clone the repository
git clone https://github.com/wronai/allama.git
cd allama
# Install dependencies
pip install poetry
poetry install
# Activate the virtual environment
poetry shell
# Clone the repository
git clone https://github.com/wronai/allama.git
cd allama
pip install .
Create or edit the models.csv
file to configure your models:
model_name,url,auth_header,auth_value,think,description
mistral:latest,http://localhost:11434/api/chat,,,false,Mistral Latest on Ollama
llama3:8b,http://localhost:11434/api/chat,,,false,Llama 3 8B
gpt-4,https://api.openai.com/v1/chat/completions,Authorization,Bearer sk-...,false,OpenAI GPT-4
CSV Columns:
model_name
- Name of the model (e.g., mistral:latest, gpt-4)url
- API endpoint URLauth_header
- Authorization header (if required, e.g., “Authorization”)auth_value
- Authorization value (e.g., “Bearer your-api-key”)think
- Whether the model supports “think” parameter (true/false)description
- Description of the modelThe application is configured using external files, with config.json
being the primary configuration file.
config.json
)This file, located in the root directory, contains all the main settings for the application:
prompts_file
: Path to the file containing test prompts (e.g., prompts.json
).evaluation_weights
: Points awarded for different code quality metrics.timeouts
: Time limits for API requests and code execution.report_config
: Settings for the generated HTML report, such as the title.colors
: Color scheme used in the HTML report.You can create your own configuration file (e.g., my_config.yaml
) and use it with the --config
flag during runtime.
prompts.json
)This file contains a list of test cases (prompts) that will be sent to the language models. Each prompt is a JSON object with the following keys:
name
: A descriptive name for the test (e.g., “Simple Addition Function”).prompt
: The full text of the prompt to be sent to the model.expected_keywords
: A list of keywords that are expected to be present in the generated code.The system will automatically generate default configuration files (config.json
and prompts.json
) if they don’t exist when you run the tool. This means you can simply run the allama
command without any setup, and the necessary configuration files will be created for you with sensible defaults.
You can run tests with a custom configuration file (in either JSON or YAML format) using the --config
or -c
flag. The settings from your custom file will be merged with the defaults.
Example with JSON:
allama --config my_config.json
Example with YAML:
allama --config custom_settings.yaml
Allama generates comprehensive reports to help you analyze and compare model performance:
allama.html
)An interactive HTML report is generated after each test run, containing:
allama.json
)All test results are also saved in a structured JSON format for:
The JSON file contains complete information about:
*_summary.csv
)A CSV summary file is also generated with key metrics for quick analysis in spreadsheet applications.
The HTML report allows you to:
To view the report, simply open allama.html
in any modern web browser after running tests.
Allama allows you to publish your benchmark results to a central repository at allama.sapletta.com
, making it easy to share and compare results with others:
# Run benchmark and publish results
allama --benchmark --publish
# Specify a custom server URL
allama --benchmark --publish --server-url https://your-server.com/upload.php
The publishing system includes:
After publishing, you’ll receive a URL where you can view your results online.
All benchmark results are automatically saved locally in a timestamped folder structure:
data/
└── test_YYYYMMDD_HHMMSS/
├── allama.json # Complete benchmark results
├── allama.html # HTML report
└── prompts.json # Detailed prompt information
This allows you to:
The benchmark server provides several visualization features:
# Install dependencies and setup
make install
# Run tests
make test
# Run all tests including end-to-end
make test-all
# Run benchmark suite
make benchmark
# Test a single model (set MODEL=name)
make single-model
# Generate HTML report
make report
# Run code formatters
make format
# Run linters
make lint
# Run all tests with default configuration
allama
# Run benchmark suite
allama --benchmark
# Test specific models
allama --models "mistral:latest,llama3:8b,gemma2:2b"
# Test a single model
allama --single-model "mistral:latest"
# Compare specific models
allama --compare "mistral:latest" "llama3:8b"
# Generate HTML report
allama --output benchmark_report.html
# Run with verbose output
allama --verbose
# Run with custom configuration
allama --config custom_config.json
# Test with a specific prompt
allama --single-model "mistral:latest" --prompt-index 0
# Set request timeout (in seconds)
allama --timeout 60
The system evaluates generated code based on the following criteria:
Create tests/ansible/inventory.ini
with:
[all]
localhost ansible_connection=local
llama3:8b,http://localhost:11434/api/chat,,,false,Llama 3 8B
gpt-4,https://api.openai.com/v1/chat/completions,Authorization,Bearer sk-your-key,false,OpenAI GPT-4
claude-3,https://api.anthropic.com/v1/messages,x-api-key,your-key,false,Claude 3
local-model,http://localhost:8080/generate,,,false,Local Model
allama/
├── allama/ # Main package
│ ├── __init__.py # Package initialization
│ ├── main.py # Main module
│ ├── config_loader.py # Configuration loading and generation
│ └── runner.py # Test runner implementation
├── tests/ # Test files
│ └── test_allama.py # Unit tests
├── models.csv # Model configurations
├── config.json # Main configuration (auto-generated if missing)
├── prompts.json # Test prompts (auto-generated if missing)
├── pyproject.toml # Project metadata and dependencies
├── Makefile # Common tasks
└── README.md # This file
After running the benchmark, you’ll get:
If you encounter any issues or have questions:
Contributions are welcome! Please read our Contributing Guidelines for details on how to contribute to this project.
This project is licensed under the MIT License - see the LICENSE file for details.