Run This Evaluation Tomorrow: A Step‑by‑Step Playbook for Apples‑to‑Apples VLM Comparisons
Even small inconsistencies—32 pixels here, a different stop sequence there—can swing vision–language model (VLM) scores by several points on flagship benchmarks. With APIs also introducing server-side variability, many “state-of-the-art” claims don’t replicate on a second pass. If you need credible, audit-ready results tomorrow, you must lock model versions, normalize inputs, fix decoding, and run the same harness across all models.
This article is a practical, end-to-end runbook for executing an apples-to-apples VLM evaluation using community tooling and reproducible artifacts. We’ll pin model IDs and endpoints (e.g., GLM-Image via ZhipuAI’s API alongside GPT‑4o, Claude, and Gemini ), standardize prompts and seeds, and lean on VLMEvalKit and LMMS‑Eval for orchestration and official scoring. You’ll learn exactly how to set up deterministic containers and GPUs, implement a neutral prompting/decoding regime, preprocess images and video consistently, and run a balanced benchmark suite (MMBench, MM‑Vet, MMMU, VQA/GQA, COCO, RefCOCO, TextVQA/TextCaps/DocVQA/ChartQA, NLVR2/ScienceQA, and MSRVTT‑QA/NExT‑QA) with robust statistics and clean archival [18–38]. We’ll also cover JSON schema enforcement, Vision–RAG ablations, latency/cost tracking, and common pitfalls.
Architecture/Implementation Details
Preparation checklist
- Lock exact model IDs and API endpoints before any scoring; re‑validate right before the full run. For example: ZhipuAI GLM‑Image on the Zhipu Open Platform, OpenAI GPT‑4o family, Anthropic Claude 3.5 Vision, and Google Gemini Vision models. Record context/vision token limits and any JSON/tool-calling modes [2–4][8–9].
- Freeze datasets and splits from their official sources: MMBench, MM‑Vet, MMMU, VQA v2, GQA, COCO with the COCO Caption toolkit, RefCOCO family, TextVQA/TextCaps, DocVQA/InfographicVQA, ChartQA, NLVR2, ScienceQA, MSRVTT‑QA, NExT‑QA, POPE/CHAIR/ImageNet‑C for robustness [36–38].
- Publish container image digests and determinism toggles; unit-test preprocessing to catch drift.
Environment setup
- Base OS: Ubuntu LTS image with pinned CUDA/cuDNN. Record CUDA driver/toolkit versions for reproducibility.
- GPU: Single A100 80GB or H100 80GB, MIG disabled, power limits logged.
- Containers: Pull by digest and record it in your run manifest. Prefer Docker’s immutable references and disable automatic layer updates.
Example Docker pull/run by digest:
docker pull nvcr.io/nvidia/pytorch@sha256:<digest>
docker run --gpus all --rm \
-e CUDA_VISIBLE_DEVICES=0 \
-e CUBLAS_WORKSPACE_CONFIG=:4096:8 \
-e CUDNN_DETERMINISTIC=1 \
-v $PWD:/workspace \
nvcr.io/nvidia/pytorch@sha256:<digest> bash
Determinism switches
- Seeds: set at Python, NumPy, and PyTorch; control CUDA/cuDNN determinism where available.
- cuDNN: enable deterministic convolution algorithms; accept potential speed trade-offs.
- cuBLAS: set CUBLAS_WORKSPACE_CONFIG to a small, fixed workspace to stabilize GEMM selection.
import os, random, numpy as np, torch
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
os.environ["CUDNN_DETERMINISTIC"] = "1"
random.seed(1337)
np.random.seed(1337)
torch.manual_seed(1337)
torch.cuda.manual_seed_all(1337)
torch.use_deterministic_algorithms(True)
Prompt and decode templates
- Neutral system prompt (identical across models): “You are a helpful, precise vision-language assistant. Follow instructions exactly. If uncertain, say ‘not sure’. Avoid unsupported claims.”
- Decoding defaults: temperature=0.2, top_p=0.9 (top_k off unless required), and fixed stop sequences. Keep max output tokens per task constant.
- JSON mode or function/tool calling if offered by the provider to reduce schema violations [8–9].
Input pipeline details
- Resize policy: cap long side at 2048 px, preserve aspect with letterboxing; log all preprocessing parameters. If a model has lower internal caps, use deterministic tiling with overlap and stitching so no model is penalized by its native limit.
- Multi‑image prompts: enumerate fixed indices and preserve order.
- Short video: sample K=8 or 16 evenly spaced frames for models that accept multiple images; log frame indices so the inputs are identical across models.
Robust API operations
- Log every request/response payload, headers, chosen regional endpoint, timestamps, and model/version ID.
- Use idempotency keys, exponential backoff, and jitter; randomize request order to reduce time‑of‑day artifacts.
Benchmark execution
- Use community harnesses to avoid bespoke bugs: VLMEvalKit and LMMS‑Eval cover many datasets, official scripts, and standardized templates. Cross-check relative ranks with OpenCompass leaderboards to sanity‑check handling and splits.
- For each dataset, honor official evaluation: COCO Caption toolkit (CIDEr, SPICE, BLEU‑4, METEOR, ROUGE‑L); accuracy for classification/QA tasks per official definitions (e.g., MMBench/MMMU/VQA/GQA/NLVR2/ScienceQA) [20–22]. RefCOCO family reports IoU≥0.5 for grounding; restrict comparisons to models that output boxes and mark others “not supported”.
Unsupported features and answer formatting
- If a model cannot output bounding boxes or tool calls, mark that task as “not supported” rather than penalizing it.
- Avoid chain‑of‑thought unless uniformly allowed. Constrain to short answers where benchmarks require exactness.
Multi‑seed scheduling and storage
- Repeat generative items for multiple seeds (e.g., 5×) to estimate variance while keeping decode settings fixed.
- Folder conventions: datasets/
/ / / for raw outputs; logs/ for request/response/latency; costs/ for token accounting; telemetry/ for hardware metrics.
Scoring and statistics
- Use official scripts where provided.
- Report 95% CIs via non‑parametric bootstrap (percentile or BCa).
- Use paired tests: McNemar’s for accuracy; paired permutation tests for continuous metrics. Apply Benjamini–Hochberg for multiple comparisons.
Structured outputs and schema enforcement
- Provide a strict JSON schema for tasks needing structure (extraction, grounding, tool calls). If the API supports JSON mode or function calling, enable it. Track invalid‑JSON rates and implement retry with backoff.
Tools and grounding
- Test function-calling success with simple utilities (calculator, OCR post‑processing) to measure integration reliability.
- For detection/grounding, Florence‑2 is a strong open baseline; normalize box JSON to a common format for IoU evaluation.
Vision–RAG evaluation
- Fix embeddings and the retrieval index across models. Run an ablation: retrieval OFF vs ON to measure incremental gains attributable to RAG rather than the base VLM.
Efficiency profiling and cost accounting
- Measure time‑to‑first‑token (TTFT) and time‑to‑last‑token (TLTT); report p50/p90/p99 latency under concurrency 1/8/32, in both streaming and non‑streaming modes. Log VRAM/CPU RAM and average power for on‑prem runs.
- Compute per‑dataset costs using official provider pricing and vision‑token accounting; cross‑check with invoices or dashboards.
Reporting and archival
- Publish per‑category results with confidence intervals, statistically significant deltas (e.g., vs GLM‑Image as your reference), latency/throughput visualizations, and cost per dataset.
- Archive all artifacts: prompts, seeds, configs, container digests, harness versions, request logs, and raw predictions.
Comparison Tables
Community harnesses and scope
| Harness | Datasets covered (examples) | Official scoring integration | Video support | Structured outputs | Notes |
|---|---|---|---|---|---|
| VLMEvalKit | MMBench, VQA/GQA, COCO, RefCOCO, TextVQA, ChartQA | Yes (e.g., COCO caption toolkit) | Partial via multi‑image frames | Can enforce formats via prompts | Actively maintained; wide model adapters |
| LMMS‑Eval | MMMU, ScienceQA, NLVR2, Text tasks | Yes | Limited | Schema via templates | Strong for reasoning/alignment suites |
| OpenCompass LB | Aggregated results across suites | N/A | N/A | N/A | Use to sanity‑check relative ranks |
Determinism switches (on‑prem/open models)
| Layer | Setting | How |
|---|---|---|
| Python/NumPy | Random seeds | random.seed, np.random.seed |
| PyTorch | torch.manual_seed; deterministic algorithms | torch.use_deterministic_algorithms(True) |
| cuDNN | Deterministic convs | CUDNN_DETERMINISTIC=1 |
| cuBLAS | Fixed workspace | CUBLAS_WORKSPACE_CONFIG=:4096:8 |
| Containers | Immutable env | Pull/run by digest; record image SHA |
Benchmark suite and metrics
| Category | Datasets | Primary metrics |
|---|---|---|
| Core perception/reasoning | MMBench, MM‑Vet, MMMU, VQA v2, GQA | Accuracy; rubric score (MM‑Vet) [18–22] |
| Captioning | COCO Captions | CIDEr, SPICE, BLEU‑4, METEOR, ROUGE‑L |
| Grounding | RefCOCO/+/g | IoU≥0.5 accuracy |
| OCR/doc/chart | TextVQA, TextCaps, DocVQA, InfographicVQA, ChartQA | Accuracy/EM/F1 depending on task [25–30] |
| Multi‑image/video | NLVR2, ScienceQA, MSRVTT‑QA, NExT‑QA | Accuracy [32–35] |
| Robustness | POPE, CHAIR, ImageNet‑C | Hallucination rates; degradation curves [36–38] |
Best Practices
- Pin everything, publish everything. Model IDs/endpoints, container digests, CUDA/cuDNN versions, harness SHAs, seeds, and prompts should all live in a run manifest.
- Keep inputs identical. Cap long side at 2048 px with letterboxing; if caps differ, tile deterministically and stitch consistently.
- Standardize decoding. temperature=0.2, top_p=0.9, fixed stop sequences and max tokens; avoid chain‑of‑thought unless uniformly allowed.
- Treat unsupported features fairly. Mark RefCOCO as “not supported” for text‑only models; don’t shoehorn text boxes into IoU.
- Measure uncertainty. Multi‑seed runs for generative tasks; bootstrap 95% CIs with SciPy. Use paired tests and BH correction.
- Enforce structure. Use JSON mode or function calling when available; measure invalid‑JSON rate and retry with backoff.
- Instrument everything. Log TTFT/TLTT, p50/p90/p99, concurrency 1/8/32, streaming vs non‑streaming, VRAM/CPU/power. For APIs, capture region and timestamps to diagnose variance.
- Cost with receipts. Compute expected $ using provider pricing pages, then cross‑check invoices/dashboards.
- Grounding baseline. Include Florence‑2 for open‑vocabulary detection and normalize box JSON.
- Sanity‑check rankings. Compare your relative orderings to OpenCompass to flag dataset handling mistakes.
- Safety plumbing in eval. Track toxicity via a third‑party classifier and log refusal behavior on sensitive prompts. Preserve and report image provenance fields where present (C2PA).
Practical Examples
1) VLMEvalKit one‑liner for COCO Captions and VQA v2
# Example: evaluate a model adapter on COCO Captions and VQA v2
vlmeval \
--model openai:gpt-4o \
--datasets coco_caption,vqa_v2 \
--temperature 0.2 --top_p 0.9 --max_tokens 64 \
--image_long_side 2048 --letterbox true \
--seeds 1337,1338,1339,1340,1341 \
--json_mode true \
--out_dir runs/2026-01-15/coco_vqa_gpt4o
2) Strict JSON schema enforcement (Python)
from jsonschema import validate, ValidationError
schema = {
"type": "object",
"properties": {
"answer": {"type": "string"},
"confidence": {"type": "number", "minimum": 0, "maximum": 1}
},
"required": ["answer"]
}
def call_with_retries(client, prompt, schema, retries=2):
for attempt in range(retries+1):
resp = client.generate(prompt, json_mode=True) # JSON mode if supported
try:
validate(resp, schema)
return resp
except ValidationError:
if attempt == retries:
raise
3) Consistent frame sampling for video QA
def sample_frames(total_frames, K=8):
# even spacing inclusive of endpoints
return [round(i*(total_frames-1)/(K-1)) for i in range(K)]
Use the same K and indices for all models that accept multi‑image inputs; log them alongside predictions.
4) cURL with idempotency and logging
curl https://api.openai.com/v1/chat/completions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Idempotency-Key: $(uuidgen)" \
-H "Content-Type: application/json" \
-d @payload.json \
-o logs/$(date +%s)-resp.json
Record region (if selectable), timestamps, and the returned model ID. Repeat similarly for ZhipuAI, Anthropic, and Google endpoints.
5) COCO scoring with official toolkit
python eval_coco.py \
--annotations annotations/captions_val2017.json \
--results runs/.../coco_predictions.json \
--metrics CIDEr SPICE BLEU METEOR ROUGE \
--bootstrap 10000 # percentile CI
6) Florence‑2 grounding baseline
python florence2_infer.py \
--images data/refcoco/val/*.jpg \
--task grounding \
--out runs/florence2/refcoco_boxes.json # normalized [x0,y0,x1,y1]
Compare IoU≥0.5 accuracy with VLM outputs (only where the VLM emits boxes).
Conclusion
Replicable VLM comparisons aren’t about clever prompts; they’re about controls. Lock the model IDs and environments, unify preprocessing to a 2048‑px letterbox with deterministic tiling, fix decoding and seeds, and let established harnesses run the official metrics. Measure uncertainty with bootstrap CIs, track latency and cost under controlled concurrency, and archive every artifact so others can reproduce your numbers. If you follow this playbook, you’ll produce results that stand up to peer scrutiny and translate to reliable deployment decisions.
Key takeaways:
- Freeze models, datasets, containers, seeds, and prompts; log every API interaction and environment detail.
- Enforce identical input pipelines, decoding parameters, and structured-output schemas across models.
- Use VLMEvalKit/LMMS‑Eval with official scoring and multi‑seed runs; report 95% CIs and paired tests.
- Profile TTFT/TLTT and concurrency; compute costs from provider pages and cross‑check invoices.
- Publish everything: raw predictions, logs, configs, and container digests.
Next steps:
- Create a run manifest template and commit your first pinned config today.
- Dry‑run 100 samples across two models to validate prompts, schemas, and logging.
- Scale to the full suite with multi‑seed scheduling, then report CIs and significant deltas.
Looking ahead, expect harnesses to add deeper tool use, richer video suites, and provenance checks by default; until then, this runbook is your shortest path to apples‑to‑apples VLM evaluations that others can actually reproduce. ✅