This document explains how to benchmark NLP2CMD performance for single and sequential command processing.
The NLP2CMD benchmarking tool measures:
# Generate a complete performance report
make report
# View the last benchmark report
make benchmark-view
# Clean benchmark reports
make benchmark-clean
# Run the main benchmark script
python3 benchmark_nlp2cmd.py
# Run benchmark WITHOUT cache (forces fresh LLM calls for every query)
python3 examples/benchmark_nlp2cmd.py --no-cache
# Run the sequential commands example
python3 examples/benchmark_sequential_commands.py
For true LLM performance testing without cache influence:
# Standard benchmark (uses cache and template pipeline)
make benchmark
# Benchmark without cache (pure LLM performance)
python3 examples/benchmark_nlp2cmd.py --no-cache
# The --no-cache flag disables:
# - Cache lookups (exact, fuzzy, similarity)
# - Template pipeline (1615 patterns)
# Forces fresh LLM calls for every query
Use --no-cache when:
Hereβs a sample benchmark output:
π Overall Performance:
Average single command latency: 32.0ms
Average sequential latency: 25.8ms
Average throughput: 38.81 commands/sec
π Top Performers:
Fastest single command: docker adapter
Fastest sequential: docker adapter
Highest throughput: docker adapter
π Detailed Results by Adapter:
SHELL:
Single command: 29.9ms
Sequential avg: 25.8ms
Throughput: 38.67 cmd/s
Total sequential time: 258.6ms
When using the LLM benchmark (examples/benchmark_nlp2cmd.py), additional files are generated in benchmark_output/:
pattern_match=false or error), grouped by model and domainRun benchmarks after code changes to detect performance regressions:
# Before changes
make report
mv benchmark_report.json benchmark_before.json
# After changes
make report
mv benchmark_report.json benchmark_after.json
# Compare results
jq '.summary.average_single_command_latency_ms' benchmark_before.json benchmark_after.json
Use throughput metrics to determine system capacity:
Based on benchmark results, you can:
Edit benchmark_nlp2cmd.py and modify the command lists:
commands = {
"shell": [
"Your custom command 1",
"Your custom command 2",
# ... more commands
],
# ... other adapters
}
Create custom benchmark scripts for specific use cases:
# Example: Test file operations specifically
file_commands = [
"Find all .log files",
"Compress old logs",
"Delete temporary files",
# ... more file operations
]
If you see high latency (>100ms per command):
For more consistent benchmarking:
If an adapter fails:
Add benchmarking to your CI pipeline:
# .github/workflows/benchmark.yml
- name: Run Benchmark
run: |
python3 benchmark_nlp2cmd.py
- name: Upload Results
uses: actions/upload-artifact@v3
with:
name: benchmark-results
path: |
benchmark_report.json
benchmark_results.csv