ai 7 min read • intermediate

Run This Evaluation Tomorrow: A Step‑by‑Step Playbook for Apples‑to‑Apples VLM Comparisons

From container digests and seeds to VLMEvalKit, JSON schemas, and Vision‑RAG ablations, a practical guide for reproducible results

By AI Research Team
Run This Evaluation Tomorrow: A Step‑by‑Step Playbook for Apples‑to‑Apples VLM Comparisons

Run This Evaluation Tomorrow: A Step‑by‑Step Playbook for Apples‑to‑Apples VLM Comparisons

Even small inconsistencies—32 pixels here, a different stop sequence there—can swing vision–language model (VLM) scores by several points on flagship benchmarks. With APIs also introducing server-side variability, many “state-of-the-art” claims don’t replicate on a second pass. If you need credible, audit-ready results tomorrow, you must lock model versions, normalize inputs, fix decoding, and run the same harness across all models.

This article is a practical, end-to-end runbook for executing an apples-to-apples VLM evaluation using community tooling and reproducible artifacts. We’ll pin model IDs and endpoints (e.g., GLM-Image via ZhipuAI’s API alongside GPT‑4o, Claude, and Gemini ), standardize prompts and seeds, and lean on VLMEvalKit and LMMS‑Eval for orchestration and official scoring. You’ll learn exactly how to set up deterministic containers and GPUs, implement a neutral prompting/decoding regime, preprocess images and video consistently, and run a balanced benchmark suite (MMBench, MM‑Vet, MMMU, VQA/GQA, COCO, RefCOCO, TextVQA/TextCaps/DocVQA/ChartQA, NLVR2/ScienceQA, and MSRVTT‑QA/NExT‑QA) with robust statistics and clean archival [18–38]. We’ll also cover JSON schema enforcement, Vision–RAG ablations, latency/cost tracking, and common pitfalls.

Architecture/Implementation Details

Preparation checklist

  • Lock exact model IDs and API endpoints before any scoring; re‑validate right before the full run. For example: ZhipuAI GLM‑Image on the Zhipu Open Platform, OpenAI GPT‑4o family, Anthropic Claude 3.5 Vision, and Google Gemini Vision models. Record context/vision token limits and any JSON/tool-calling modes [2–4][8–9].
  • Freeze datasets and splits from their official sources: MMBench, MM‑Vet, MMMU, VQA v2, GQA, COCO with the COCO Caption toolkit, RefCOCO family, TextVQA/TextCaps, DocVQA/InfographicVQA, ChartQA, NLVR2, ScienceQA, MSRVTT‑QA, NExT‑QA, POPE/CHAIR/ImageNet‑C for robustness [36–38].
  • Publish container image digests and determinism toggles; unit-test preprocessing to catch drift.

Environment setup

  • Base OS: Ubuntu LTS image with pinned CUDA/cuDNN. Record CUDA driver/toolkit versions for reproducibility.
  • GPU: Single A100 80GB or H100 80GB, MIG disabled, power limits logged.
  • Containers: Pull by digest and record it in your run manifest. Prefer Docker’s immutable references and disable automatic layer updates.

Example Docker pull/run by digest:

docker pull nvcr.io/nvidia/pytorch@sha256:<digest>
docker run --gpus all --rm \
 -e CUDA_VISIBLE_DEVICES=0 \
 -e CUBLAS_WORKSPACE_CONFIG=:4096:8 \
 -e CUDNN_DETERMINISTIC=1 \
 -v $PWD:/workspace \
 nvcr.io/nvidia/pytorch@sha256:<digest> bash

Determinism switches

  • Seeds: set at Python, NumPy, and PyTorch; control CUDA/cuDNN determinism where available.
  • cuDNN: enable deterministic convolution algorithms; accept potential speed trade-offs.
  • cuBLAS: set CUBLAS_WORKSPACE_CONFIG to a small, fixed workspace to stabilize GEMM selection.
import os, random, numpy as np, torch
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
os.environ["CUDNN_DETERMINISTIC"] = "1"
random.seed(1337)
np.random.seed(1337)
torch.manual_seed(1337)
torch.cuda.manual_seed_all(1337)
torch.use_deterministic_algorithms(True)

Prompt and decode templates

  • Neutral system prompt (identical across models): “You are a helpful, precise vision-language assistant. Follow instructions exactly. If uncertain, say ‘not sure’. Avoid unsupported claims.”
  • Decoding defaults: temperature=0.2, top_p=0.9 (top_k off unless required), and fixed stop sequences. Keep max output tokens per task constant.
  • JSON mode or function/tool calling if offered by the provider to reduce schema violations [8–9].

Input pipeline details

  • Resize policy: cap long side at 2048 px, preserve aspect with letterboxing; log all preprocessing parameters. If a model has lower internal caps, use deterministic tiling with overlap and stitching so no model is penalized by its native limit.
  • Multi‑image prompts: enumerate fixed indices and preserve order.
  • Short video: sample K=8 or 16 evenly spaced frames for models that accept multiple images; log frame indices so the inputs are identical across models.

Robust API operations

  • Log every request/response payload, headers, chosen regional endpoint, timestamps, and model/version ID.
  • Use idempotency keys, exponential backoff, and jitter; randomize request order to reduce time‑of‑day artifacts.

Benchmark execution

  • Use community harnesses to avoid bespoke bugs: VLMEvalKit and LMMS‑Eval cover many datasets, official scripts, and standardized templates. Cross-check relative ranks with OpenCompass leaderboards to sanity‑check handling and splits.
  • For each dataset, honor official evaluation: COCO Caption toolkit (CIDEr, SPICE, BLEU‑4, METEOR, ROUGE‑L); accuracy for classification/QA tasks per official definitions (e.g., MMBench/MMMU/VQA/GQA/NLVR2/ScienceQA) [20–22]. RefCOCO family reports IoU≥0.5 for grounding; restrict comparisons to models that output boxes and mark others “not supported”.

Unsupported features and answer formatting

  • If a model cannot output bounding boxes or tool calls, mark that task as “not supported” rather than penalizing it.
  • Avoid chain‑of‑thought unless uniformly allowed. Constrain to short answers where benchmarks require exactness.

Multi‑seed scheduling and storage

  • Repeat generative items for multiple seeds (e.g., 5×) to estimate variance while keeping decode settings fixed.
  • Folder conventions: datasets//// for raw outputs; logs/ for request/response/latency; costs/ for token accounting; telemetry/ for hardware metrics.

Scoring and statistics

  • Use official scripts where provided.
  • Report 95% CIs via non‑parametric bootstrap (percentile or BCa).
  • Use paired tests: McNemar’s for accuracy; paired permutation tests for continuous metrics. Apply Benjamini–Hochberg for multiple comparisons.

Structured outputs and schema enforcement

  • Provide a strict JSON schema for tasks needing structure (extraction, grounding, tool calls). If the API supports JSON mode or function calling, enable it. Track invalid‑JSON rates and implement retry with backoff.

Tools and grounding

  • Test function-calling success with simple utilities (calculator, OCR post‑processing) to measure integration reliability.
  • For detection/grounding, Florence‑2 is a strong open baseline; normalize box JSON to a common format for IoU evaluation.

Vision–RAG evaluation

  • Fix embeddings and the retrieval index across models. Run an ablation: retrieval OFF vs ON to measure incremental gains attributable to RAG rather than the base VLM.

Efficiency profiling and cost accounting

  • Measure time‑to‑first‑token (TTFT) and time‑to‑last‑token (TLTT); report p50/p90/p99 latency under concurrency 1/8/32, in both streaming and non‑streaming modes. Log VRAM/CPU RAM and average power for on‑prem runs.
  • Compute per‑dataset costs using official provider pricing and vision‑token accounting; cross‑check with invoices or dashboards.

Reporting and archival

  • Publish per‑category results with confidence intervals, statistically significant deltas (e.g., vs GLM‑Image as your reference), latency/throughput visualizations, and cost per dataset.
  • Archive all artifacts: prompts, seeds, configs, container digests, harness versions, request logs, and raw predictions.

Comparison Tables

Community harnesses and scope

HarnessDatasets covered (examples)Official scoring integrationVideo supportStructured outputsNotes
VLMEvalKitMMBench, VQA/GQA, COCO, RefCOCO, TextVQA, ChartQAYes (e.g., COCO caption toolkit)Partial via multi‑image framesCan enforce formats via promptsActively maintained; wide model adapters
LMMS‑EvalMMMU, ScienceQA, NLVR2, Text tasksYesLimitedSchema via templatesStrong for reasoning/alignment suites
OpenCompass LBAggregated results across suitesN/AN/AN/AUse to sanity‑check relative ranks

Determinism switches (on‑prem/open models)

LayerSettingHow
Python/NumPyRandom seedsrandom.seed, np.random.seed
PyTorchtorch.manual_seed; deterministic algorithmstorch.use_deterministic_algorithms(True)
cuDNNDeterministic convsCUDNN_DETERMINISTIC=1
cuBLASFixed workspaceCUBLAS_WORKSPACE_CONFIG=:4096:8
ContainersImmutable envPull/run by digest; record image SHA

Benchmark suite and metrics

CategoryDatasetsPrimary metrics
Core perception/reasoningMMBench, MM‑Vet, MMMU, VQA v2, GQAAccuracy; rubric score (MM‑Vet) [18–22]
CaptioningCOCO CaptionsCIDEr, SPICE, BLEU‑4, METEOR, ROUGE‑L
GroundingRefCOCO/+/gIoU≥0.5 accuracy
OCR/doc/chartTextVQA, TextCaps, DocVQA, InfographicVQA, ChartQAAccuracy/EM/F1 depending on task [25–30]
Multi‑image/videoNLVR2, ScienceQA, MSRVTT‑QA, NExT‑QAAccuracy [32–35]
RobustnessPOPE, CHAIR, ImageNet‑CHallucination rates; degradation curves [36–38]

Best Practices

  • Pin everything, publish everything. Model IDs/endpoints, container digests, CUDA/cuDNN versions, harness SHAs, seeds, and prompts should all live in a run manifest.
  • Keep inputs identical. Cap long side at 2048 px with letterboxing; if caps differ, tile deterministically and stitch consistently.
  • Standardize decoding. temperature=0.2, top_p=0.9, fixed stop sequences and max tokens; avoid chain‑of‑thought unless uniformly allowed.
  • Treat unsupported features fairly. Mark RefCOCO as “not supported” for text‑only models; don’t shoehorn text boxes into IoU.
  • Measure uncertainty. Multi‑seed runs for generative tasks; bootstrap 95% CIs with SciPy. Use paired tests and BH correction.
  • Enforce structure. Use JSON mode or function calling when available; measure invalid‑JSON rate and retry with backoff.
  • Instrument everything. Log TTFT/TLTT, p50/p90/p99, concurrency 1/8/32, streaming vs non‑streaming, VRAM/CPU/power. For APIs, capture region and timestamps to diagnose variance.
  • Cost with receipts. Compute expected $ using provider pricing pages, then cross‑check invoices/dashboards.
  • Grounding baseline. Include Florence‑2 for open‑vocabulary detection and normalize box JSON.
  • Sanity‑check rankings. Compare your relative orderings to OpenCompass to flag dataset handling mistakes.
  • Safety plumbing in eval. Track toxicity via a third‑party classifier and log refusal behavior on sensitive prompts. Preserve and report image provenance fields where present (C2PA).

Practical Examples

1) VLMEvalKit one‑liner for COCO Captions and VQA v2

# Example: evaluate a model adapter on COCO Captions and VQA v2
vlmeval \
 --model openai:gpt-4o \
 --datasets coco_caption,vqa_v2 \
 --temperature 0.2 --top_p 0.9 --max_tokens 64 \
 --image_long_side 2048 --letterbox true \
 --seeds 1337,1338,1339,1340,1341 \
 --json_mode true \
 --out_dir runs/2026-01-15/coco_vqa_gpt4o

2) Strict JSON schema enforcement (Python)

from jsonschema import validate, ValidationError
schema = {
 "type": "object",
 "properties": {
 "answer": {"type": "string"},
 "confidence": {"type": "number", "minimum": 0, "maximum": 1}
 },
 "required": ["answer"]
}

def call_with_retries(client, prompt, schema, retries=2):
 for attempt in range(retries+1):
 resp = client.generate(prompt, json_mode=True) # JSON mode if supported 
 try:
 validate(resp, schema)
 return resp
 except ValidationError:
 if attempt == retries:
 raise

3) Consistent frame sampling for video QA

def sample_frames(total_frames, K=8):
 # even spacing inclusive of endpoints
 return [round(i*(total_frames-1)/(K-1)) for i in range(K)]

Use the same K and indices for all models that accept multi‑image inputs; log them alongside predictions.

4) cURL with idempotency and logging

curl https://api.openai.com/v1/chat/completions \
 -H "Authorization: Bearer $OPENAI_API_KEY" \
 -H "Idempotency-Key: $(uuidgen)" \
 -H "Content-Type: application/json" \
 -d @payload.json \
 -o logs/$(date +%s)-resp.json

Record region (if selectable), timestamps, and the returned model ID. Repeat similarly for ZhipuAI, Anthropic, and Google endpoints.

5) COCO scoring with official toolkit

python eval_coco.py \
 --annotations annotations/captions_val2017.json \
 --results runs/.../coco_predictions.json \
 --metrics CIDEr SPICE BLEU METEOR ROUGE \
 --bootstrap 10000 # percentile CI 

6) Florence‑2 grounding baseline

python florence2_infer.py \
 --images data/refcoco/val/*.jpg \
 --task grounding \
 --out runs/florence2/refcoco_boxes.json # normalized [x0,y0,x1,y1]

Compare IoU≥0.5 accuracy with VLM outputs (only where the VLM emits boxes).

Conclusion

Replicable VLM comparisons aren’t about clever prompts; they’re about controls. Lock the model IDs and environments, unify preprocessing to a 2048‑px letterbox with deterministic tiling, fix decoding and seeds, and let established harnesses run the official metrics. Measure uncertainty with bootstrap CIs, track latency and cost under controlled concurrency, and archive every artifact so others can reproduce your numbers. If you follow this playbook, you’ll produce results that stand up to peer scrutiny and translate to reliable deployment decisions.

Key takeaways:

  • Freeze models, datasets, containers, seeds, and prompts; log every API interaction and environment detail.
  • Enforce identical input pipelines, decoding parameters, and structured-output schemas across models.
  • Use VLMEvalKit/LMMS‑Eval with official scoring and multi‑seed runs; report 95% CIs and paired tests.
  • Profile TTFT/TLTT and concurrency; compute costs from provider pages and cross‑check invoices.
  • Publish everything: raw predictions, logs, configs, and container digests.

Next steps:

  • Create a run manifest template and commit your first pinned config today.
  • Dry‑run 100 samples across two models to validate prompts, schemas, and logging.
  • Scale to the full suite with multi‑seed scheduling, then report CIs and significant deltas.

Looking ahead, expect harnesses to add deeper tool use, richer video suites, and provenance checks by default; until then, this runbook is your shortest path to apples‑to‑apples VLM evaluations that others can actually reproduce. ✅

Sources & References

open.bigmodel.cn
ZhipuAI Open Platform (API) Primary API reference to pin GLM-Image model IDs, endpoints, and request/response handling for reproducible evaluations.
platform.openai.com
OpenAI Models (GPT-4o and others) Documents model IDs, capabilities, and limits needed for consistent prompting/decoding and logging across runs.
platform.openai.com
OpenAI Vision Guide Provides guidance on image inputs, JSON mode, and multimodal prompting to standardize evaluation inputs and outputs.
platform.openai.com
OpenAI Function/Tool Calling Supports structured outputs and tool-calling evaluations with enforced JSON schemas and reduced hallucination.
docs.anthropic.com
Anthropic Claude Vision Docs Defines Claude Vision model usage and constraints to align prompts, decoding, and capability checks across models.
ai.google.dev
Google Gemini API Models Specifies Gemini Vision model IDs and features for consistent endpoint pinning and capability documentation.
ai.google.dev
Google Gemini Vision Guide Details image handling and multimodal prompting needed to standardize preprocessing and templates for Gemini.
openai.com
OpenAI API Pricing Used to compute dataset-level and per-request cost accounting and to cross-check observed costs.
docs.anthropic.com
Anthropic Pricing Provides official rates for cost estimation and sensitivity analysis across datasets and regions.
ai.google.dev
Google Gemini Pricing Provides pricing inputs for reproducible cost accounting of multimodal evaluations.
github.com
Microsoft Florence-2 GitHub Open grounding/detection baseline used to normalize box JSON and compare IoU against VLM outputs.
github.com
MMBench (OpenCompass/MMBench) Official benchmark and scripts for broad multimodal reasoning with per-category breakdowns.
mm-vet.github.io
MM-Vet Benchmark Open-ended generative evaluation with rubric-based scoring used to complement closed-form QA.
mmmu-benchmark.github.io
MMMU Benchmark Expert multi-discipline reasoning benchmark providing category-level diagnostics in the suite.
visualqa.org
VQA v2 Dataset Core visual question answering dataset with official splits and scoring used in the evaluation.
cs.stanford.edu
GQA Dataset Compositional scene understanding benchmark required for standardized QA evaluation.
cocodataset.org
COCO Dataset (Captions) Standard captioning dataset whose official toolkit yields comparable text-generation metrics.
github.com
COCO Caption Evaluation Toolkit Official scoring scripts for CIDEr/SPICE/BLEU/METEOR/ROUGE used with bootstrap CIs.
textvqa.org
TextVQA OCR-in-the-wild QA dataset included to assess text reading and reasoning capabilities.
textvqa.org
TextCaps Reading-aware captioning dataset included to evaluate OCR-conditioned generation.
docvqa.org
DocVQA Document VQA suite measuring layout- and page-aware comprehension for structured tasks.
infographicvqa.github.io
InfographicVQA Tests visually dense, infographic-style document reasoning and extraction.
chartqa.github.io
ChartQA Chart understanding benchmark used to evaluate quantitative reasoning over plots.
github.com
RefCOCO/RefCOCO+/RefCOCOg (refer) Official references for referring expression grounding tasks with IoU≥0.5 evaluation.
lil.nlp.cornell.edu
NLVR2 Multi-image compositional reasoning dataset requiring fixed index enumeration and ordering.
scienceqa.github.io
ScienceQA Image subset used to evaluate instruction following and short-form reasoning with visuals.
github.com
MSRVTT-QA Short-video QA dataset used with fixed frame sampling policies for fairness.
github.com
NExT-QA Temporal reasoning over video clips with standardized frame sampling across models.
github.com
POPE Object hallucination stress test to quantify spurious mentions in VLM outputs.
arxiv.org
Object Hallucination in Image Captioning (CHAIR) Metric and analysis for hallucinated objects in generated captions.
github.com
ImageNet-C (Corruptions) Corruption suite for measuring robustness and degradation curves under distribution shift.
github.com
VLMEvalKit Community harness providing dataset adapters, standardized prompts, and official scorers.
opencompass.org.cn
OpenCompass Leaderboards (Multimodal) External reference to sanity-check relative rankings and dataset handling.
github.com
LMMS-Eval Evaluation harness for multimodal reasoning suites complementary to VLMEvalKit.
pytorch.org
PyTorch Reproducibility/Randomness Authoritative guidance on seeds, deterministic algorithms, and CUDA/cuDNN controls.
docs.scipy.org
SciPy Bootstrap CI Statistical method for computing non-parametric 95% confidence intervals over items.
c2pa.org
C2PA Specification Standard for provenance metadata used to test preservation and reporting in safety/trust checks.
developers.perspectiveapi.com
Perspective API Third-party classifier used to quantify toxicity rates during safety evaluations.
www.nvidia.com
NVIDIA A100 Hardware reference for reproducible on-prem benchmarking and power/memory telemetry.
www.nvidia.com
NVIDIA H100 Hardware reference for reproducible on-prem benchmarking and power/memory telemetry.
docs.docker.com
Docker Docs Source for pulling containers by digest and documenting immutable environments.
docs.nvidia.com
NVIDIA CUDA Docs Reference for CUDA/cuBLAS/cuDNN behaviors and determinism environment variables.

Advertisement