Run This Evaluation Tomorrow: A Step‑by‑Step Playbook for Apples‑to‑Apples VLM Comparisons

Even small inconsistencies—32 pixels here, a different stop sequence there—can swing vision–language model (VLM) scores by several points on flagship benchmarks. With APIs also introducing server-side variability, many “state-of-the-art” claims don’t replicate on a second pass. If you need credible, audit-ready results tomorrow, you must lock model versions, normalize inputs, fix decoding, and run the same harness across all models.

This article is a practical, end-to-end runbook for executing an apples-to-apples VLM evaluation using community tooling and reproducible artifacts. We’ll pin model IDs and endpoints (e.g., GLM-Image via ZhipuAI’s API alongside GPT‑4o, Claude, and Gemini ), standardize prompts and seeds, and lean on VLMEvalKit and LMMS‑Eval for orchestration and official scoring. You’ll learn exactly how to set up deterministic containers and GPUs, implement a neutral prompting/decoding regime, preprocess images and video consistently, and run a balanced benchmark suite (MMBench, MM‑Vet, MMMU, VQA/GQA, COCO, RefCOCO, TextVQA/TextCaps/DocVQA/ChartQA, NLVR2/ScienceQA, and MSRVTT‑QA/NExT‑QA) with robust statistics and clean archival [18–38]. We’ll also cover JSON schema enforcement, Vision–RAG ablations, latency/cost tracking, and common pitfalls.

Architecture/Implementation Details

Preparation checklist

Lock exact model IDs and API endpoints before any scoring; re‑validate right before the full run. For example: ZhipuAI GLM‑Image on the Zhipu Open Platform, OpenAI GPT‑4o family, Anthropic Claude 3.5 Vision, and Google Gemini Vision models. Record context/vision token limits and any JSON/tool-calling modes [2–4][8–9].
Freeze datasets and splits from their official sources: MMBench, MM‑Vet, MMMU, VQA v2, GQA, COCO with the COCO Caption toolkit, RefCOCO family, TextVQA/TextCaps, DocVQA/InfographicVQA, ChartQA, NLVR2, ScienceQA, MSRVTT‑QA, NExT‑QA, POPE/CHAIR/ImageNet‑C for robustness [36–38].
Publish container image digests and determinism toggles; unit-test preprocessing to catch drift.

Environment setup

Base OS: Ubuntu LTS image with pinned CUDA/cuDNN. Record CUDA driver/toolkit versions for reproducibility.
GPU: Single A100 80GB or H100 80GB, MIG disabled, power limits logged.
Containers: Pull by digest and record it in your run manifest. Prefer Docker’s immutable references and disable automatic layer updates.

Example Docker pull/run by digest:

docker pull nvcr.io/nvidia/pytorch@sha256:<digest>
docker run --gpus all --rm \
 -e CUDA_VISIBLE_DEVICES=0 \
 -e CUBLAS_WORKSPACE_CONFIG=:4096:8 \
 -e CUDNN_DETERMINISTIC=1 \
 -v $PWD:/workspace \
 nvcr.io/nvidia/pytorch@sha256:<digest> bash

Determinism switches

Seeds: set at Python, NumPy, and PyTorch; control CUDA/cuDNN determinism where available.
cuDNN: enable deterministic convolution algorithms; accept potential speed trade-offs.
cuBLAS: set CUBLAS_WORKSPACE_CONFIG to a small, fixed workspace to stabilize GEMM selection.

import os, random, numpy as np, torch
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
os.environ["CUDNN_DETERMINISTIC"] = "1"
random.seed(1337)
np.random.seed(1337)
torch.manual_seed(1337)
torch.cuda.manual_seed_all(1337)
torch.use_deterministic_algorithms(True)

Prompt and decode templates

Neutral system prompt (identical across models): “You are a helpful, precise vision-language assistant. Follow instructions exactly. If uncertain, say ‘not sure’. Avoid unsupported claims.”
Decoding defaults: temperature=0.2, top_p=0.9 (top_k off unless required), and fixed stop sequences. Keep max output tokens per task constant.
JSON mode or function/tool calling if offered by the provider to reduce schema violations [8–9].

Input pipeline details

Resize policy: cap long side at 2048 px, preserve aspect with letterboxing; log all preprocessing parameters. If a model has lower internal caps, use deterministic tiling with overlap and stitching so no model is penalized by its native limit.
Multi‑image prompts: enumerate fixed indices and preserve order.
Short video: sample K=8 or 16 evenly spaced frames for models that accept multiple images; log frame indices so the inputs are identical across models.

Robust API operations

Log every request/response payload, headers, chosen regional endpoint, timestamps, and model/version ID.
Use idempotency keys, exponential backoff, and jitter; randomize request order to reduce time‑of‑day artifacts.

Benchmark execution

Use community harnesses to avoid bespoke bugs: VLMEvalKit and LMMS‑Eval cover many datasets, official scripts, and standardized templates. Cross-check relative ranks with OpenCompass leaderboards to sanity‑check handling and splits.
For each dataset, honor official evaluation: COCO Caption toolkit (CIDEr, SPICE, BLEU‑4, METEOR, ROUGE‑L); accuracy for classification/QA tasks per official definitions (e.g., MMBench/MMMU/VQA/GQA/NLVR2/ScienceQA) [20–22]. RefCOCO family reports IoU≥0.5 for grounding; restrict comparisons to models that output boxes and mark others “not supported”.

Unsupported features and answer formatting

If a model cannot output bounding boxes or tool calls, mark that task as “not supported” rather than penalizing it.
Avoid chain‑of‑thought unless uniformly allowed. Constrain to short answers where benchmarks require exactness.

Multi‑seed scheduling and storage

Repeat generative items for multiple seeds (e.g., 5×) to estimate variance while keeping decode settings fixed.
Folder conventions: datasets//// for raw outputs; logs/ for request/response/latency; costs/ for token accounting; telemetry/ for hardware metrics.

Scoring and statistics

Use official scripts where provided.
Report 95% CIs via non‑parametric bootstrap (percentile or BCa).
Use paired tests: McNemar’s for accuracy; paired permutation tests for continuous metrics. Apply Benjamini–Hochberg for multiple comparisons.

Structured outputs and schema enforcement

Provide a strict JSON schema for tasks needing structure (extraction, grounding, tool calls). If the API supports JSON mode or function calling, enable it. Track invalid‑JSON rates and implement retry with backoff.

Tools and grounding

Test function-calling success with simple utilities (calculator, OCR post‑processing) to measure integration reliability.
For detection/grounding, Florence‑2 is a strong open baseline; normalize box JSON to a common format for IoU evaluation.

Vision–RAG evaluation

Fix embeddings and the retrieval index across models. Run an ablation: retrieval OFF vs ON to measure incremental gains attributable to RAG rather than the base VLM.

Efficiency profiling and cost accounting

Measure time‑to‑first‑token (TTFT) and time‑to‑last‑token (TLTT); report p50/p90/p99 latency under concurrency 1/8/32, in both streaming and non‑streaming modes. Log VRAM/CPU RAM and average power for on‑prem runs.
Compute per‑dataset costs using official provider pricing and vision‑token accounting; cross‑check with invoices or dashboards.

Reporting and archival

Publish per‑category results with confidence intervals, statistically significant deltas (e.g., vs GLM‑Image as your reference), latency/throughput visualizations, and cost per dataset.
Archive all artifacts: prompts, seeds, configs, container digests, harness versions, request logs, and raw predictions.

Comparison Tables

Community harnesses and scope

Harness	Datasets covered (examples)	Official scoring integration	Video support	Structured outputs	Notes
VLMEvalKit	MMBench, VQA/GQA, COCO, RefCOCO, TextVQA, ChartQA	Yes (e.g., COCO caption toolkit)	Partial via multi‑image frames	Can enforce formats via prompts	Actively maintained; wide model adapters
LMMS‑Eval	MMMU, ScienceQA, NLVR2, Text tasks	Yes	Limited	Schema via templates	Strong for reasoning/alignment suites
OpenCompass LB	Aggregated results across suites	N/A	N/A	N/A	Use to sanity‑check relative ranks

Determinism switches (on‑prem/open models)

Layer	Setting	How
Python/NumPy	Random seeds	random.seed, np.random.seed
PyTorch	torch.manual_seed; deterministic algorithms	torch.use_deterministic_algorithms(True)
cuDNN	Deterministic convs	CUDNN_DETERMINISTIC=1
cuBLAS	Fixed workspace	CUBLAS_WORKSPACE_CONFIG=:4096:8
Containers	Immutable env	Pull/run by digest; record image SHA

Benchmark suite and metrics

Category	Datasets	Primary metrics
Core perception/reasoning	MMBench, MM‑Vet, MMMU, VQA v2, GQA	Accuracy; rubric score (MM‑Vet) [18–22]
Captioning	COCO Captions	CIDEr, SPICE, BLEU‑4, METEOR, ROUGE‑L
Grounding	RefCOCO/+/g	IoU≥0.5 accuracy
OCR/doc/chart	TextVQA, TextCaps, DocVQA, InfographicVQA, ChartQA	Accuracy/EM/F1 depending on task [25–30]
Multi‑image/video	NLVR2, ScienceQA, MSRVTT‑QA, NExT‑QA	Accuracy [32–35]
Robustness	POPE, CHAIR, ImageNet‑C	Hallucination rates; degradation curves [36–38]

Best Practices

Pin everything, publish everything. Model IDs/endpoints, container digests, CUDA/cuDNN versions, harness SHAs, seeds, and prompts should all live in a run manifest.
Keep inputs identical. Cap long side at 2048 px with letterboxing; if caps differ, tile deterministically and stitch consistently.
Standardize decoding. temperature=0.2, top_p=0.9, fixed stop sequences and max tokens; avoid chain‑of‑thought unless uniformly allowed.
Treat unsupported features fairly. Mark RefCOCO as “not supported” for text‑only models; don’t shoehorn text boxes into IoU.
Measure uncertainty. Multi‑seed runs for generative tasks; bootstrap 95% CIs with SciPy. Use paired tests and BH correction.
Enforce structure. Use JSON mode or function calling when available; measure invalid‑JSON rate and retry with backoff.
Instrument everything. Log TTFT/TLTT, p50/p90/p99, concurrency 1/8/32, streaming vs non‑streaming, VRAM/CPU/power. For APIs, capture region and timestamps to diagnose variance.
Cost with receipts. Compute expected $ using provider pricing pages, then cross‑check invoices/dashboards.
Grounding baseline. Include Florence‑2 for open‑vocabulary detection and normalize box JSON.
Sanity‑check rankings. Compare your relative orderings to OpenCompass to flag dataset handling mistakes.
Safety plumbing in eval. Track toxicity via a third‑party classifier and log refusal behavior on sensitive prompts. Preserve and report image provenance fields where present (C2PA).

Practical Examples

1) VLMEvalKit one‑liner for COCO Captions and VQA v2

# Example: evaluate a model adapter on COCO Captions and VQA v2
vlmeval \
 --model openai:gpt-4o \
 --datasets coco_caption,vqa_v2 \
 --temperature 0.2 --top_p 0.9 --max_tokens 64 \
 --image_long_side 2048 --letterbox true \
 --seeds 1337,1338,1339,1340,1341 \
 --json_mode true \
 --out_dir runs/2026-01-15/coco_vqa_gpt4o

2) Strict JSON schema enforcement (Python)

from jsonschema import validate, ValidationError
schema = {
 "type": "object",
 "properties": {
 "answer": {"type": "string"},
 "confidence": {"type": "number", "minimum": 0, "maximum": 1}
 },
 "required": ["answer"]
}

def call_with_retries(client, prompt, schema, retries=2):
 for attempt in range(retries+1):
 resp = client.generate(prompt, json_mode=True) # JSON mode if supported 
 try:
 validate(resp, schema)
 return resp
 except ValidationError:
 if attempt == retries:
 raise

3) Consistent frame sampling for video QA

def sample_frames(total_frames, K=8):
 # even spacing inclusive of endpoints
 return [round(i*(total_frames-1)/(K-1)) for i in range(K)]

Use the same K and indices for all models that accept multi‑image inputs; log them alongside predictions.

4) cURL with idempotency and logging

curl https://api.openai.com/v1/chat/completions \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Idempotency-Key: $(uuidgen)" \
 -H "Content-Type: application/json" \
 -d @payload.json \
 -o logs/$(date +%s)-resp.json

Record region (if selectable), timestamps, and the returned model ID. Repeat similarly for ZhipuAI, Anthropic, and Google endpoints.

5) COCO scoring with official toolkit

python eval_coco.py \
 --annotations annotations/captions_val2017.json \
 --results runs/.../coco_predictions.json \
 --metrics CIDEr SPICE BLEU METEOR ROUGE \
 --bootstrap 10000 # percentile CI

6) Florence‑2 grounding baseline

python florence2_infer.py \
 --images data/refcoco/val/*.jpg \
 --task grounding \
 --out runs/florence2/refcoco_boxes.json # normalized [x0,y0,x1,y1]

Compare IoU≥0.5 accuracy with VLM outputs (only where the VLM emits boxes).

Conclusion

Replicable VLM comparisons aren’t about clever prompts; they’re about controls. Lock the model IDs and environments, unify preprocessing to a 2048‑px letterbox with deterministic tiling, fix decoding and seeds, and let established harnesses run the official metrics. Measure uncertainty with bootstrap CIs, track latency and cost under controlled concurrency, and archive every artifact so others can reproduce your numbers. If you follow this playbook, you’ll produce results that stand up to peer scrutiny and translate to reliable deployment decisions.

Key takeaways:

Freeze models, datasets, containers, seeds, and prompts; log every API interaction and environment detail.
Enforce identical input pipelines, decoding parameters, and structured-output schemas across models.
Use VLMEvalKit/LMMS‑Eval with official scoring and multi‑seed runs; report 95% CIs and paired tests.
Profile TTFT/TLTT and concurrency; compute costs from provider pages and cross‑check invoices.
Publish everything: raw predictions, logs, configs, and container digests.

Next steps:

Create a run manifest template and commit your first pinned config today.
Dry‑run 100 samples across two models to validate prompts, schemas, and logging.
Scale to the full suite with multi‑seed scheduling, then report CIs and significant deltas.

Looking ahead, expect harnesses to add deeper tool use, richer video suites, and provenance checks by default; until then, this runbook is your shortest path to apples‑to‑apples VLM evaluations that others can actually reproduce. ✅

Sources & References

ZhipuAI Open Platform (API) Primary API reference to pin GLM-Image model IDs, endpoints, and request/response handling for reproducible evaluations.

OpenAI Models (GPT-4o and others) Documents model IDs, capabilities, and limits needed for consistent prompting/decoding and logging across runs.

OpenAI Vision Guide Provides guidance on image inputs, JSON mode, and multimodal prompting to standardize evaluation inputs and outputs.

OpenAI Function/Tool Calling Supports structured outputs and tool-calling evaluations with enforced JSON schemas and reduced hallucination.

Anthropic Claude Vision Docs Defines Claude Vision model usage and constraints to align prompts, decoding, and capability checks across models.

Google Gemini API Models Specifies Gemini Vision model IDs and features for consistent endpoint pinning and capability documentation.

Google Gemini Vision Guide Details image handling and multimodal prompting needed to standardize preprocessing and templates for Gemini.

OpenAI API Pricing Used to compute dataset-level and per-request cost accounting and to cross-check observed costs.

Anthropic Pricing Provides official rates for cost estimation and sensitivity analysis across datasets and regions.

Google Gemini Pricing Provides pricing inputs for reproducible cost accounting of multimodal evaluations.

Microsoft Florence-2 GitHub Open grounding/detection baseline used to normalize box JSON and compare IoU against VLM outputs.

MMBench (OpenCompass/MMBench) Official benchmark and scripts for broad multimodal reasoning with per-category breakdowns.

MM-Vet Benchmark Open-ended generative evaluation with rubric-based scoring used to complement closed-form QA.

MMMU Benchmark Expert multi-discipline reasoning benchmark providing category-level diagnostics in the suite.

VQA v2 Dataset Core visual question answering dataset with official splits and scoring used in the evaluation.

GQA Dataset Compositional scene understanding benchmark required for standardized QA evaluation.

COCO Dataset (Captions) Standard captioning dataset whose official toolkit yields comparable text-generation metrics.

COCO Caption Evaluation Toolkit Official scoring scripts for CIDEr/SPICE/BLEU/METEOR/ROUGE used with bootstrap CIs.

TextVQA OCR-in-the-wild QA dataset included to assess text reading and reasoning capabilities.

TextCaps Reading-aware captioning dataset included to evaluate OCR-conditioned generation.

DocVQA Document VQA suite measuring layout- and page-aware comprehension for structured tasks.

InfographicVQA Tests visually dense, infographic-style document reasoning and extraction.

ChartQA Chart understanding benchmark used to evaluate quantitative reasoning over plots.

RefCOCO/RefCOCO+/RefCOCOg (refer) Official references for referring expression grounding tasks with IoU≥0.5 evaluation.

NLVR2 Multi-image compositional reasoning dataset requiring fixed index enumeration and ordering.

ScienceQA Image subset used to evaluate instruction following and short-form reasoning with visuals.

MSRVTT-QA Short-video QA dataset used with fixed frame sampling policies for fairness.

NExT-QA Temporal reasoning over video clips with standardized frame sampling across models.

POPE Object hallucination stress test to quantify spurious mentions in VLM outputs.

Object Hallucination in Image Captioning (CHAIR) Metric and analysis for hallucinated objects in generated captions.

ImageNet-C (Corruptions) Corruption suite for measuring robustness and degradation curves under distribution shift.

VLMEvalKit Community harness providing dataset adapters, standardized prompts, and official scorers.

OpenCompass Leaderboards (Multimodal) External reference to sanity-check relative rankings and dataset handling.

LMMS-Eval Evaluation harness for multimodal reasoning suites complementary to VLMEvalKit.

PyTorch Reproducibility/Randomness Authoritative guidance on seeds, deterministic algorithms, and CUDA/cuDNN controls.

SciPy Bootstrap CI Statistical method for computing non-parametric 95% confidence intervals over items.

C2PA Specification Standard for provenance metadata used to test preservation and reporting in safety/trust checks.

Perspective API Third-party classifier used to quantify toxicity rates during safety evaluations.

NVIDIA A100 Hardware reference for reproducible on-prem benchmarking and power/memory telemetry.

NVIDIA H100 Hardware reference for reproducible on-prem benchmarking and power/memory telemetry.

Docker Docs Source for pulling containers by digest and documenting immutable environments.

NVIDIA CUDA Docs Reference for CUDA/cuBLAS/cuDNN behaviors and determinism environment variables.