Benchmarking

Comprehensive benchmarking suite for webMCP performance analysis. Compare models, measure efficiency, and optimize your implementations.

Benchmarking Capabilities

Comprehensive performance analysis across multiple dimensions

Performance Benchmarking

Compare processing speed and optimization efficiency

  • Processing time comparisons
  • Optimization speed analysis
  • Throughput measurements
  • Resource utilization tracking

Cost Benchmarking

Analyze cost efficiency across models and configurations

  • Cost per optimization
  • Model cost comparisons
  • ROI analysis across projects
  • Budget efficiency tracking

Quality Benchmarking

Measure optimization quality and accuracy

  • Token reduction accuracy
  • Semantic preservation
  • Optimization consistency
  • Error rate analysis

Model Performance Comparison

Latest benchmark results across supported AI models

ModelProcessing Time (s)Cost per OptimizationToken Reduction (%)Accuracy ScoreThroughput (ops/s)
GPT-4o1.2s$0.003468.1%96.8%847
Claude-3.5-Sonnet1.8s$0.005271.3%98.2%563
GPT-42.1s$0.004564.9%94.7%478
GPT-3.5-Turbo0.8s$0.001262.4%91.3%1256

Configuration Performance Comparison

Basic Optimization

Processing Time0.9s
Token Reduction58.2%
Cost Efficiency85.3%

Simple forms and basic interactions

Advanced Optimization

Processing Time1.7s
Token Reduction71.8%
Cost Efficiency94.7%

Complex forms with validation rules

Premium Optimization

Processing Time2.4s
Token Reduction78.9%
Cost Efficiency98.1%

Enterprise applications with complex workflows

Industry Benchmarks

Performance benchmarks across different industries and use cases

E-commerce

69.4%
Avg Optimization
$0.0087
Avg Cost Savings

Common Use Cases:

  • Checkout forms
  • Product search
  • Customer support

Best Practices:

  • Focus on checkout flow optimization
  • Prioritize mobile form experiences
  • Implement progressive enhancement

Financial Services

74.2%
Avg Optimization
$0.0156
Avg Cost Savings

Common Use Cases:

  • KYC forms
  • Loan applications
  • Account management

Best Practices:

  • Emphasize security and compliance
  • Optimize complex multi-step forms
  • Focus on accessibility standards

Healthcare

71.8%
Avg Optimization
$0.0134
Avg Cost Savings

Common Use Cases:

  • Patient intake
  • Insurance forms
  • Appointment booking

Best Practices:

  • Ensure HIPAA compliance
  • Optimize for elderly user interfaces
  • Minimize form completion friction

SaaS

67.9%
Avg Optimization
$0.0078
Avg Cost Savings

Common Use Cases:

  • User onboarding
  • Settings panels
  • Feature forms

Best Practices:

  • Streamline onboarding flows
  • Optimize settings interfaces
  • Focus on user engagement metrics

Implementation Examples

Practical examples for implementing benchmarking in your applications

Basic Performance Benchmarking

Set up performance benchmarking for your webMCP implementations

JavaScript
import { WebMCPProcessor, BenchmarkRunner } from '@webmcp/core';

class WebMCPBenchmark {
  constructor() {
    this.benchmarkRunner = new BenchmarkRunner({
      iterations: 100,
      warmupRuns: 10,
      includeMemoryMetrics: true,
      includeTokenMetrics: true
    });
    
    this.models = [
      'gpt-4o',
      'claude-3.5-sonnet', 
      'gpt-4',
      'gpt-3.5-turbo'
    ];
  }
  
  async runPerformanceBenchmark(htmlSamples) {
    const results = [];
    
    for (const model of this.models) {
      console.log(`Benchmarking ${model}...`);
      
      const processor = new WebMCPProcessor({
        targetModel: model,
        compressionLevel: 'advanced'
      });
      
      const modelResults = await this.benchmarkRunner.run({
        name: `${model}_performance`,
        processor,
        testCases: htmlSamples,
        metrics: ['processingTime', 'tokenReduction', 'memoryUsage', 'cost']
      });
      
      results.push({
        model,
        ...modelResults.summary
      });
    }
    
    return this.analyzeResults(results);
  }
  
  analyzeResults(results) {
    // Calculate relative performance
    const baselineSpeed = Math.min(...results.map(r => r.avgProcessingTime));
    const baselineCost = Math.min(...results.map(r => r.avgCost));
    
    return results.map(result => ({
      ...result,
      speedRatio: result.avgProcessingTime / baselineSpeed,
      costRatio: result.avgCost / baselineCost,
      efficiencyScore: this.calculateEfficiencyScore(result),
      ranking: 0 // Will be calculated after sorting
    })).sort((a, b) => b.efficiencyScore - a.efficiencyScore)
      .map((result, index) => ({ ...result, ranking: index + 1 }));
  }
  
  calculateEfficiencyScore(result) {
    // Weighted efficiency score
    const weights = {
      tokenReduction: 0.4,
      processingTime: 0.3,
      cost: 0.2,
      accuracy: 0.1
    };
    
    return (
      (result.avgTokenReduction * weights.tokenReduction) +
      ((1 / result.avgProcessingTime) * 100 * weights.processingTime) +
      ((1 / result.avgCost) * weights.cost) +
      (result.accuracyScore * weights.accuracy)
    );
  }
  
  generateReport(results) {
    const report = {
      timestamp: new Date().toISOString(),
      summary: {
        modelsCompared: results.length,
        bestPerformingModel: results[0].model,
        avgTokenReduction: results.reduce((sum, r) => sum + r.avgTokenReduction, 0) / results.length,
        avgProcessingTime: results.reduce((sum, r) => sum + r.avgProcessingTime, 0) / results.length
      },
      detailed: results,
      recommendations: this.generateRecommendations(results)
    };
    
    return report;
  }
  
  generateRecommendations(results) {
    const recommendations = [];
    const best = results[0];
    const fastest = results.sort((a, b) => a.avgProcessingTime - b.avgProcessingTime)[0];
    const cheapest = results.sort((a, b) => a.avgCost - b.avgCost)[0];
    
    recommendations.push({
      type: 'overall_best',
      message: `${best.model} provides the best overall efficiency with a score of ${best.efficiencyScore.toFixed(1)}`,
      model: best.model
    });
    
    if (fastest.model !== best.model) {
      recommendations.push({
        type: 'fastest_processing',
        message: `${fastest.model} offers the fastest processing at ${fastest.avgProcessingTime.toFixed(2)}s average`,
        model: fastest.model,
        useCase: 'High-throughput applications requiring minimal latency'
      });
    }
    
    if (cheapest.model !== best.model) {
      recommendations.push({
        type: 'most_cost_effective',
        message: `${cheapest.model} provides the lowest cost at $${cheapest.avgCost.toFixed(4)} per optimization`,
        model: cheapest.model,
        useCase: 'Budget-conscious applications with high volume'
      });
    }
    
    return recommendations;
  }
}

// Usage example
const benchmark = new WebMCPBenchmark();

// Sample HTML test cases
const htmlSamples = [
  `<form><input name="email" type="email"><button>Submit</button></form>`,
  `<form><input name="name"><input name="phone"><select name="country"><option>US</option></select><button>Save</button></form>`,
  // Add more test cases...
];

// Run benchmark
benchmark.runPerformanceBenchmark(htmlSamples)
  .then(results => {
    const report = benchmark.generateReport(results);
    console.log('Benchmark Report:', report);
    
    // Export results
    fs.writeFileSync('benchmark_results.json', JSON.stringify(report, null, 2));
  })
  .catch(console.error);

A/B Testing Framework

Compare different optimization configurations

JavaScript
class WebMCPABTest {
  constructor() {
    this.experiments = new Map();
    this.results = [];
  }
  
  // Define A/B test experiment
  defineExperiment(name, configurations) {
    this.experiments.set(name, {
      name,
      configurations,
      status: 'defined',
      startTime: null,
      endTime: null,
      results: []
    });
  }
  
  // Run A/B test
  async runExperiment(experimentName, testData, options = {}) {
    const experiment = this.experiments.get(experimentName);
    if (!experiment) {
      throw new Error(`Experiment ${experimentName} not found`);
    }
    
    const {
      sampleSize = 100,
      confidenceLevel = 0.95,
      trafficSplit = 'equal' // equal, weighted, or custom array
    } = options;
    
    experiment.status = 'running';
    experiment.startTime = Date.now();
    
    console.log(`Starting A/B test: ${experimentName}`);
    
    const results = await this.executeVariants(
      experiment.configurations,
      testData,
      sampleSize,
      trafficSplit
    );
    
    experiment.results = results;
    experiment.endTime = Date.now();
    experiment.status = 'completed';
    
    const analysis = this.analyzeExperiment(experiment);
    this.results.push(analysis);
    
    return analysis;
  }
  
  async executeVariants(configurations, testData, sampleSize, trafficSplit) {
    const variantSizes = this.calculateTrafficSplit(configurations.length, sampleSize, trafficSplit);
    const results = [];
    
    for (let i = 0; i < configurations.length; i++) {
      const config = configurations[i];
      const variantSize = variantSizes[i];
      
      console.log(`Testing variant ${config.name} with ${variantSize} samples...`);
      
      const processor = new WebMCPProcessor(config.settings);
      const variantResults = [];
      
      // Run tests for this variant
      for (let j = 0; j < variantSize; j++) {
        const testCase = testData[j % testData.length];
        const startTime = Date.now();
        
        try {
          const result = await processor.parseWebMCP(testCase.html);
          const endTime = Date.now();
          
          variantResults.push({
            processingTime: endTime - startTime,
            tokenReduction: result.tokenReduction,
            cost: result.costAnalysis?.optimizedCost || 0,
            accuracy: this.calculateAccuracy(result, testCase.expected),
            success: true
          });
        } catch (error) {
          variantResults.push({
            processingTime: 0,
            tokenReduction: 0,
            cost: 0,
            accuracy: 0,
            success: false,
            error: error.message
          });
        }
      }
      
      // Calculate variant statistics
      const successfulResults = variantResults.filter(r => r.success);
      const variantStats = {
        name: config.name,
        configuration: config.settings,
        sampleSize: variantSize,
        successRate: (successfulResults.length / variantSize) * 100,
        avgProcessingTime: this.calculateMean(successfulResults.map(r => r.processingTime)),
        avgTokenReduction: this.calculateMean(successfulResults.map(r => r.tokenReduction)),
        avgCost: this.calculateMean(successfulResults.map(r => r.cost)),
        avgAccuracy: this.calculateMean(successfulResults.map(r => r.accuracy)),
        stdDevProcessingTime: this.calculateStdDev(successfulResults.map(r => r.processingTime)),
        rawResults: variantResults
      };
      
      results.push(variantStats);
    }
    
    return results;
  }
  
  analyzeExperiment(experiment) {
    const { results } = experiment;
    
    // Statistical significance testing
    const significanceTests = this.performSignificanceTests(results);
    
    // Determine winner
    const winner = this.determineWinner(results, significanceTests);
    
    // Calculate confidence intervals
    const confidenceIntervals = results.map(r => ({
      name: r.name,
      tokenReduction: this.calculateConfidenceInterval(
        r.rawResults.map(x => x.tokenReduction).filter(x => x > 0),
        0.95
      ),
      processingTime: this.calculateConfidenceInterval(
        r.rawResults.map(x => x.processingTime).filter(x => x > 0),
        0.95
      )
    }));
    
    return {
      experimentName: experiment.name,
      duration: experiment.endTime - experiment.startTime,
      variants: results,
      significanceTests,
      winner,
      confidenceIntervals,
      recommendations: this.generateABRecommendations(results, winner, significanceTests)
    };
  }
  
  performSignificanceTests(results) {
    const tests = [];
    
    // Compare each variant with the control (first variant)
    const control = results[0];
    
    for (let i = 1; i < results.length; i++) {
      const variant = results[i];
      
      // T-test for token reduction
      const tokenReductionTest = this.tTest(
        control.rawResults.map(r => r.tokenReduction),
        variant.rawResults.map(r => r.tokenReduction)
      );
      
      // T-test for processing time
      const processingTimeTest = this.tTest(
        control.rawResults.map(r => r.processingTime),
        variant.rawResults.map(r => r.processingTime)
      );
      
      tests.push({
        comparison: `${control.name} vs ${variant.name}`,
        tokenReduction: {
          pValue: tokenReductionTest.pValue,
          significant: tokenReductionTest.pValue < 0.05,
          improvement: variant.avgTokenReduction - control.avgTokenReduction
        },
        processingTime: {
          pValue: processingTimeTest.pValue,
          significant: processingTimeTest.pValue < 0.05,
          improvement: control.avgProcessingTime - variant.avgProcessingTime // Lower is better
        }
      });
    }
    
    return tests;
  }
  
  determineWinner(results, significanceTests) {
    // Calculate composite score for each variant
    const scores = results.map(result => ({
      name: result.name,
      score: this.calculateCompositeScore(result),
      result
    }));
    
    scores.sort((a, b) => b.score - a.score);
    
    const winner = scores[0];
    const runnerUp = scores[1];
    
    // Check if winner is statistically significant
    const relevantTest = significanceTests.find(test => 
      test.comparison.includes(winner.name)
    );
    
    return {
      variant: winner.name,
      score: winner.score,
      statisticallySignificant: relevantTest?.tokenReduction.significant || false,
      marginOfVictory: winner.score - runnerUp.score,
      confidence: relevantTest ? (1 - relevantTest.tokenReduction.pValue) * 100 : 0
    };
  }
  
  calculateCompositeScore(result) {
    // Weighted composite score
    return (
      result.avgTokenReduction * 0.4 +
      (100 - result.avgProcessingTime) * 0.3 + // Lower processing time is better
      (1 / result.avgCost) * 0.2 +
      result.avgAccuracy * 0.1
    );
  }
  
  // Helper methods for statistical calculations
  calculateMean(values) {
    return values.reduce((sum, val) => sum + val, 0) / values.length;
  }
  
  calculateStdDev(values) {
    const mean = this.calculateMean(values);
    const squaredDiffs = values.map(val => Math.pow(val - mean, 2));
    return Math.sqrt(this.calculateMean(squaredDiffs));
  }
  
  tTest(sample1, sample2) {
    // Simplified t-test implementation
    const mean1 = this.calculateMean(sample1);
    const mean2 = this.calculateMean(sample2);
    const var1 = this.calculateStdDev(sample1) ** 2;
    const var2 = this.calculateStdDev(sample2) ** 2;
    const n1 = sample1.length;
    const n2 = sample2.length;
    
    const tStat = (mean1 - mean2) / Math.sqrt(var1/n1 + var2/n2);
    const df = n1 + n2 - 2;
    
    // Simplified p-value calculation (normally would use t-distribution)
    const pValue = 2 * (1 - this.normalCDF(Math.abs(tStat)));
    
    return { tStat, pValue, df };
  }
  
  normalCDF(x) {
    // Approximation of normal CDF
    return 0.5 * (1 + Math.sign(x) * Math.sqrt(1 - Math.exp(-2 * x * x / Math.PI)));
  }
}

// Usage example
const abTest = new WebMCPABTest();

// Define experiment
abTest.defineExperiment('optimization_levels', [
  {
    name: 'control_basic',
    settings: { compressionLevel: 'basic', targetModel: 'gpt-3.5-turbo' }
  },
  {
    name: 'variant_advanced',
    settings: { compressionLevel: 'advanced', targetModel: 'gpt-4o' }
  },
  {
    name: 'variant_premium',
    settings: { compressionLevel: 'premium', targetModel: 'claude-3.5-sonnet' }
  }
]);

// Run experiment
const testData = [
  { html: '<form>...</form>', expected: { tokenCount: 100 } },
  // More test cases...
];

abTest.runExperiment('optimization_levels', testData, {
  sampleSize: 300,
  confidenceLevel: 0.95
}).then(results => {
  console.log('A/B Test Results:', results);
  console.log('Winner:', results.winner.variant);
  console.log('Statistical Significance:', results.winner.statisticallySignificant);
});

Python Benchmarking Suite

Comprehensive benchmarking toolkit for Python applications

Python
import time
import statistics
import pandas as pd
import numpy as np
from scipy import stats
from webmcp_core import WebMCPProcessor
from typing import List, Dict, Any
import matplotlib.pyplot as plt
import seaborn as sns

class WebMCPBenchmarkSuite:
    def __init__(self, models: List[str] = None):
        self.models = models or ['gpt-4o', 'claude-3.5-sonnet', 'gpt-4', 'gpt-3.5-turbo']
        self.results = []
        self.processors = {}
        
        # Initialize processors
        for model in self.models:
            self.processors[model] = WebMCPProcessor(target_model=model)
    
    def run_performance_benchmark(self, test_cases: List[str], iterations: int = 50) -> Dict:
        """Run comprehensive performance benchmark"""
        results = {}
        
        for model in self.models:
            print(f"Benchmarking {model}...")
            model_results = []
            
            processor = self.processors[model]
            
            for i in range(iterations):
                for test_case in test_cases:
                    start_time = time.time()
                    
                    try:
                        result = processor.parse_webmcp(test_case)
                        end_time = time.time()
                        
                        model_results.append({
                            'model': model,
                            'iteration': i,
                            'test_case_id': test_cases.index(test_case),
                            'processing_time': end_time - start_time,
                            'token_reduction': result.token_reduction,
                            'original_tokens': result.original_tokens,
                            'optimized_tokens': result.optimized_tokens,
                            'cost': result.cost_analysis.optimized_cost if hasattr(result, 'cost_analysis') else 0,
                            'success': True
                        })
                    except Exception as e:
                        model_results.append({
                            'model': model,
                            'iteration': i,
                            'test_case_id': test_cases.index(test_case),
                            'processing_time': 0,
                            'token_reduction': 0,
                            'original_tokens': 0,
                            'optimized_tokens': 0,
                            'cost': 0,
                            'success': False,
                            'error': str(e)
                        })
            
            results[model] = model_results
        
        return self.analyze_results(results)
    
    def analyze_results(self, raw_results: Dict) -> Dict:
        """Analyze benchmark results and generate statistics"""
        analysis = {}
        
        for model, results in raw_results.items():
            df = pd.DataFrame(results)
            successful_results = df[df['success'] == True]
            
            if len(successful_results) == 0:
                analysis[model] = {'error': 'No successful results'}
                continue
            
            analysis[model] = {
                'success_rate': len(successful_results) / len(results) * 100,
                'processing_time': {
                    'mean': successful_results['processing_time'].mean(),
                    'median': successful_results['processing_time'].median(),
                    'std': successful_results['processing_time'].std(),
                    'min': successful_results['processing_time'].min(),
                    'max': successful_results['processing_time'].max(),
                    'p95': successful_results['processing_time'].quantile(0.95)
                },
                'token_reduction': {
                    'mean': successful_results['token_reduction'].mean(),
                    'median': successful_results['token_reduction'].median(),
                    'std': successful_results['token_reduction'].std(),
                    'min': successful_results['token_reduction'].min(),
                    'max': successful_results['token_reduction'].max()
                },
                'cost': {
                    'mean': successful_results['cost'].mean(),
                    'median': successful_results['cost'].median(),
                    'total': successful_results['cost'].sum()
                },
                'throughput': len(successful_results) / successful_results['processing_time'].sum(),
                'raw_data': results
            }
        
        return analysis
    
    def compare_models(self, analysis: Dict) -> Dict:
        """Compare models and perform statistical tests"""
        comparisons = {}
        model_names = list(analysis.keys())
        
        for i in range(len(model_names)):
            for j in range(i + 1, len(model_names)):
                model_a = model_names[i]
                model_b = model_names[j]
                
                if 'error' in analysis[model_a] or 'error' in analysis[model_b]:
                    continue
                
                # Get raw data for statistical tests
                data_a = [r for r in analysis[model_a]['raw_data'] if r['success']]
                data_b = [r for r in analysis[model_b]['raw_data'] if r['success']]
                
                processing_times_a = [r['processing_time'] for r in data_a]
                processing_times_b = [r['processing_time'] for r in data_b]
                
                token_reductions_a = [r['token_reduction'] for r in data_a]
                token_reductions_b = [r['token_reduction'] for r in data_b]
                
                # Perform t-tests
                processing_time_test = stats.ttest_ind(processing_times_a, processing_times_b)
                token_reduction_test = stats.ttest_ind(token_reductions_a, token_reductions_b)
                
                comparisons[f"{model_a}_vs_{model_b}"] = {
                    'processing_time': {
                        'mean_difference': statistics.mean(processing_times_a) - statistics.mean(processing_times_b),
                        'p_value': processing_time_test.pvalue,
                        'significant': processing_time_test.pvalue < 0.05,
                        'winner': model_a if statistics.mean(processing_times_a) < statistics.mean(processing_times_b) else model_b
                    },
                    'token_reduction': {
                        'mean_difference': statistics.mean(token_reductions_a) - statistics.mean(token_reductions_b),
                        'p_value': token_reduction_test.pvalue,
                        'significant': token_reduction_test.pvalue < 0.05,
                        'winner': model_a if statistics.mean(token_reductions_a) > statistics.mean(token_reductions_b) else model_b
                    }
                }
        
        return comparisons
    
    def generate_visualizations(self, analysis: Dict, output_dir: str = './benchmark_plots'):
        """Generate benchmark visualization plots"""
        import os
        os.makedirs(output_dir, exist_ok=True)
        
        # Prepare data for plotting
        plot_data = []
        for model, data in analysis.items():
            if 'error' not in data:
                plot_data.append({
                    'model': model,
                    'avg_processing_time': data['processing_time']['mean'],
                    'avg_token_reduction': data['token_reduction']['mean'],
                    'avg_cost': data['cost']['mean'],
                    'throughput': data['throughput'],
                    'success_rate': data['success_rate']
                })
        
        df = pd.DataFrame(plot_data)
        
        # Processing Time Comparison
        plt.figure(figsize=(10, 6))
        sns.barplot(data=df, x='model', y='avg_processing_time')
        plt.title('Average Processing Time by Model')
        plt.ylabel('Processing Time (seconds)')
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.savefig(f'{output_dir}/processing_time_comparison.png')
        plt.close()
        
        # Token Reduction Comparison
        plt.figure(figsize=(10, 6))
        sns.barplot(data=df, x='model', y='avg_token_reduction')
        plt.title('Average Token Reduction by Model')
        plt.ylabel('Token Reduction (%)')
        plt.xticks(rotation=45)
        plt.tight_layout()
        plt.savefig(f'{output_dir}/token_reduction_comparison.png')
        plt.close()
        
        # Cost vs Performance Scatter Plot
        plt.figure(figsize=(10, 8))
        scatter = plt.scatter(df['avg_cost'], df['avg_token_reduction'], 
                            s=df['throughput']*10, alpha=0.7)
        
        for i, model in enumerate(df['model']):
            plt.annotate(model, (df.iloc[i]['avg_cost'], df.iloc[i]['avg_token_reduction']))
        
        plt.xlabel('Average Cost per Optimization')
        plt.ylabel('Average Token Reduction (%)')
        plt.title('Cost vs Performance (bubble size = throughput)')
        plt.tight_layout()
        plt.savefig(f'{output_dir}/cost_vs_performance.png')
        plt.close()
        
        # Multi-metric Radar Chart
        fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection='polar'))
        
        metrics = ['avg_processing_time', 'avg_token_reduction', 'avg_cost', 'throughput', 'success_rate']
        angles = np.linspace(0, 2 * np.pi, len(metrics), endpoint=False).tolist()
        angles += angles[:1]  # Complete the circle
        
        for _, row in df.iterrows():
            values = []
            for metric in metrics:
                # Normalize values to 0-100 scale
                if metric == 'avg_processing_time' or metric == 'avg_cost':
                    # Lower is better, so invert
                    normalized = 100 - (row[metric] / df[metric].max() * 100)
                else:
                    normalized = row[metric] / df[metric].max() * 100
                values.append(normalized)
            
            values += values[:1]  # Complete the circle
            ax.plot(angles, values, 'o-', linewidth=2, label=row['model'])
            ax.fill(angles, values, alpha=0.25)
        
        ax.set_xticks(angles[:-1])
        ax.set_xticklabels(metrics)
        ax.set_ylim(0, 100)
        plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
        plt.title('Multi-Metric Performance Comparison')
        plt.tight_layout()
        plt.savefig(f'{output_dir}/radar_comparison.png')
        plt.close()
        
        print(f"Visualizations saved to {output_dir}")
    
    def export_results(self, analysis: Dict, comparisons: Dict, filename: str = 'benchmark_results.xlsx'):
        """Export results to Excel file"""
        with pd.ExcelWriter(filename) as writer:
            # Summary sheet
            summary_data = []
            for model, data in analysis.items():
                if 'error' not in data:
                    summary_data.append({
                        'Model': model,
                        'Success Rate (%)': data['success_rate'],
                        'Avg Processing Time (s)': data['processing_time']['mean'],
                        'Avg Token Reduction (%)': data['token_reduction']['mean'],
                        'Avg Cost': data['cost']['mean'],
                        'Throughput (ops/s)': data['throughput']
                    })
            
            summary_df = pd.DataFrame(summary_data)
            summary_df.to_excel(writer, sheet_name='Summary', index=False)
            
            # Detailed results for each model
            for model, data in analysis.items():
                if 'error' not in data:
                    detailed_df = pd.DataFrame(data['raw_data'])
                    detailed_df.to_excel(writer, sheet_name=f'{model}_detailed', index=False)
            
            # Comparisons sheet
            comparison_data = []
            for comparison, data in comparisons.items():
                comparison_data.append({
                    'Comparison': comparison,
                    'Processing Time Winner': data['processing_time']['winner'],
                    'Processing Time P-Value': data['processing_time']['p_value'],
                    'Processing Time Significant': data['processing_time']['significant'],
                    'Token Reduction Winner': data['token_reduction']['winner'],
                    'Token Reduction P-Value': data['token_reduction']['p_value'],
                    'Token Reduction Significant': data['token_reduction']['significant']
                })
            
            comparison_df = pd.DataFrame(comparison_data)
            comparison_df.to_excel(writer, sheet_name='Statistical_Comparisons', index=False)
        
        print(f"Results exported to {filename}")

# Usage example
benchmark_suite = WebMCPBenchmarkSuite()

# Test cases
test_cases = [
    '<form><input name="email" type="email"><button>Submit</button></form>',
    '<form><input name="name"><input name="phone"><select name="country"><option>US</option></select><button>Save</button></form>',
    # Add more complex test cases...
]

# Run benchmark
analysis = benchmark_suite.run_performance_benchmark(test_cases, iterations=100)
comparisons = benchmark_suite.compare_models(analysis)

# Generate visualizations
benchmark_suite.generate_visualizations(analysis)

# Export results
benchmark_suite.export_results(analysis, comparisons)

print("Benchmark complete! Check the generated files for detailed results.")

Benchmarking Best Practices

Guidelines for conducting reliable and meaningful benchmarks

Benchmark Design

Use Representative Test Cases

Select test cases that represent your actual use cases

Include variety of form complexities, element types, and real-world scenarios

Sufficient Sample Size

Use adequate sample sizes for statistical significance

Minimum 30 samples per variant, 100+ for reliable results

Control for Variables

Isolate variables to ensure fair comparisons

Keep hardware, network conditions, and test data consistent

Performance Metrics

Multiple Metrics

Track multiple performance dimensions

Include processing time, cost, accuracy, and quality metrics

Percentile Analysis

Look beyond averages to understand distribution

Track P50, P95, P99 percentiles for comprehensive view

Baseline Comparisons

Always compare against established baselines

Maintain consistent baseline configurations for trending

Statistical Rigor

Statistical Significance

Use proper statistical tests for comparisons

Apply t-tests, ANOVA, or non-parametric tests as appropriate

Confidence Intervals

Report confidence intervals with point estimates

Use 95% confidence intervals for business decisions

Multiple Comparisons

Adjust for multiple comparison bias

Use Bonferroni correction or similar methods

Optimize with Confidence

Use comprehensive benchmarking to make data-driven optimization decisions