nlp2cmd

NLP2CMD Performance Benchmarking

This document explains how to benchmark NLP2CMD performance for single and sequential command processing.

Overview

The NLP2CMD benchmarking tool measures:

Single command processing latency
Sequential command processing performance (10 commands)
Throughput (commands per second)
Performance across different adapters (Shell, SQL, Docker)

Quick Start

Run Benchmark with Make

# Generate a complete performance report
make report

# View the last benchmark report
make benchmark-view

# Clean benchmark reports
make benchmark-clean

Run Benchmark Directly

# Run the main benchmark script
python3 benchmark_nlp2cmd.py

# Run benchmark WITHOUT cache (forces fresh LLM calls for every query)
python3 examples/benchmark_nlp2cmd.py --no-cache

# Run the sequential commands example
python3 examples/benchmark_sequential_commands.py

Benchmark Without Cache

For true LLM performance testing without cache influence:

# Standard benchmark (uses cache and template pipeline)
make benchmark

# Benchmark without cache (pure LLM performance)
python3 examples/benchmark_nlp2cmd.py --no-cache

# The --no-cache flag disables:
# - Cache lookups (exact, fuzzy, similarity)
# - Template pipeline (1615 patterns)
# Forces fresh LLM calls for every query

Use --no-cache when:

Testing pure LLM accuracy without cache influence
Measuring true LLM latency (not cached/template responses)
Validating model improvements
Avoiding cache hits from previous benchmark runs

Understanding the Results

Key Metrics

Single Command Latency: Average time to process one command (in milliseconds)
Sequential Latency: Average time per command when processing multiple commands sequentially
Throughput: Number of commands processed per second
Total Time: Total time to process all commands in sequence

Performance Factors

Adapter Type: Different adapters (Shell, SQL, Docker) have different performance characteristics
Command Complexity: More complex commands take longer to process
System Load: CPU and memory usage can affect performance
Warm-up Time: First command may be slower due to initialization

Example Results

Here’s a sample benchmark output:

📊 Overall Performance:
  Average single command latency: 32.0ms
  Average sequential latency: 25.8ms
  Average throughput: 38.81 commands/sec

🏆 Top Performers:
  Fastest single command: docker adapter
  Fastest sequential: docker adapter
  Highest throughput: docker adapter

📈 Detailed Results by Adapter:

  SHELL:
    Single command: 29.9ms
    Sequential avg: 25.8ms
    Throughput: 38.67 cmd/s
    Total sequential time: 258.6ms

Files Generated

benchmark_report.json: Detailed JSON report with all metrics
benchmark_results.csv: CSV file suitable for plotting in Excel or other tools
sequential_benchmark_results.json: Results from the sequential commands example

When using the LLM benchmark (examples/benchmark_nlp2cmd.py), additional files are generated in benchmark_output/:

benchmark_results.json: Raw per-query results + aggregated accuracy stats
benchmark_results.html: Interactive charts
benchmark_command_errors.md: Auto-generated list of incorrect commands (pattern_match=false or error), grouped by model and domain

Use Cases

Performance Regression Testing

Run benchmarks after code changes to detect performance regressions:

# Before changes
make report
mv benchmark_report.json benchmark_before.json

# After changes
make report
mv benchmark_report.json benchmark_after.json

# Compare results
jq '.summary.average_single_command_latency_ms' benchmark_before.json benchmark_after.json

System Capacity Planning

Use throughput metrics to determine system capacity:

If throughput is 40 commands/sec, your system can handle:
- 2,400 commands/minute
- 144,000 commands/hour
- 3.456 million commands/day

Optimization Targets

Based on benchmark results, you can:

Identify the fastest adapter for your use case
Set performance targets for optimization
Monitor performance over time

Customizing Benchmarks

Adding New Commands

Edit benchmark_nlp2cmd.py and modify the command lists:

commands = {
    "shell": [
        "Your custom command 1",
        "Your custom command 2",
        # ... more commands
    ],
    # ... other adapters
}

Testing Different Scenarios

Create custom benchmark scripts for specific use cases:

# Example: Test file operations specifically
file_commands = [
    "Find all .log files",
    "Compress old logs",
    "Delete temporary files",
    # ... more file operations
]

Troubleshooting

High Latency

If you see high latency (>100ms per command):

Check system resources (CPU, memory)
Verify Python dependencies are properly installed
Consider running on a more powerful machine

Inconsistent Results

For more consistent benchmarking:

Close other applications
Run multiple times and average the results
Use the same system configuration for comparisons

Adapter Errors

If an adapter fails:

Check that all dependencies are installed
Verify the adapter configuration
Check logs for specific error messages

Integrating with CI/CD

Add benchmarking to your CI pipeline:

# .github/workflows/benchmark.yml
- name: Run Benchmark
  run: |
    python3 benchmark_nlp2cmd.py
    
- name: Upload Results
  uses: actions/upload-artifact@v3
  with:
    name: benchmark-results
    path: |
      benchmark_report.json
      benchmark_results.csv

Performance Tips

Reuse Instances: Create one NLP2CMD instance and reuse it for multiple commands
Batch Processing: Process commands in batches for better throughput
Choose Right Adapter: Use the adapter that best matches your domain
Monitor Memory: Large numbers of commands may require more memory

This site is open source. Improve this page.