| name | tao-generate-referring-expressions |
| description | "Four-step image referring-expression pipeline: turns images plus KITTI bounding-box labels into region descriptions, scene captions, grounded referring expressions, and (optionally) verified expressions via VLM distillation. Use when the user wants to generate referring-expression annotations from images with KITTI labels, build region descriptions, produce grouped grounding phrases tied to bboxes, run a double-check verification pass on grounding expressions, auto-label traffic / scene images for referring datasets, or run the image_referring_expression pipeline. Triggers include 'referring expression', 'region description', 'KITTI labels', 'spatial relationship annotation', 'auto-label image referring expression', 'image_referring_expression'." |
| license | Apache-2.0 |
| compatibility | Requires docker + nvidia-container-toolkit + at least one VLM endpoint (Gemini API key or OpenAI-compatible). |
| metadata | author: NVIDIA Corporation version: "0.1.0" |
| tags | - image - referring-expression - kitti - bounding-boxes - auto-label - vlm |
| allowed-tools | Read Bash Write |
Image Referring Expression Pipeline
Generate referring-expression and grounding annotations from images with KITTI-format bounding box labels. A single VLM (Gemini or any OpenAI-compatible endpoint) runs four steps: per-object region descriptions, holistic image captions, grouped grounding expressions tied to bboxes, and an optional double-check verification pass.
Purpose
Transform (image, KITTI labels) pairs into a unified annotations.jsonl containing rich, grounded referring expressions. The VLM acts as a "teacher" annotator: Steps 0-1 see the image; Step 2 groups Step 0 outputs into grouping phrases with bbox lists; Step 3 (optional) re-examines those bboxes against the image and corrects mismatches.
Pipeline Architecture
Step 0: Region expression โโโ
โโโโถ Step 2: Grounding expression โโโถ [Step 3: Double check]
Step 1: Image caption โโโโโโโ (optional)
- Step 0 (region_expr) โ VLM emits one short discriminative phrase per KITTI bbox (
bbox_2d, type, color, description).
- Step 1 (image_caption) โ VLM emits a holistic, location-agnostic scene caption.
- Step 2 (grounding_expr) โ VLM groups Step 0 objects into grouping phrases and returns one bbox list per group, optionally using Step 1's caption as extra context.
- Step 3 (double_check) โ VLM re-checks each Step 2 bbox against the image; bad matches are removed, slightly-off boxes get tightened.
Steps 0 and 1 run in parallel within a single thread pool (they only depend on the seed records). Each step writes its own step_<N>_*/annotations.jsonl and skips already-processed images on re-run unless workflow.force_reprocess: true.
Instructions
Initial setup
When a user wants to run this pipeline, walk through these steps:
-
Images: Ask for data.image_dir, the directory containing .jpg, .jpeg, or .png images.
-
KITTI labels: Ask for data.kitti_label_dir, the directory containing one .txt label file per image. Each label line must use KITTI format: <type> <truncated> <occluded> <alpha> <bbox_left> <bbox_top> <bbox_right> <bbox_bottom> .... Lines with fewer than 8 fields are silently skipped. Set this even for Step 1-only runs because Steps 0 and 2 require it.
-
Resume from existing annotations: If the user already has a unified annotations.jsonl from a previous run, set data.input_annotations_jsonl to that file instead of seeding from data.image_dir and data.kitti_label_dir.
-
API access: Ask the user which VLM endpoint they want to use. Present these five options and act on the choice:
- Gemini โ set
vlm.backend: "gemini"; require GOOGLE_API_KEY (env var or vlm.gemini.api_key).
- NIM (e.g.
https://inference-api.nvidia.com/v1) โ set vlm.backend: "openai"; collect base_url, model_name, and api_key.
- TAO inference microservice (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
- Running โ collect
base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
- Not running โ guide the user through the
skills/applications/tao-run-inference-service skill, which stands up a local TAO inference microservice with an OpenAI-compatible API. Before promising a specific model, check skills/applications/tao-run-inference-service/references/service.yaml for valid_network_arch_config_basenames. Once the server is up, collect base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
- vLLM (self-hosted, OpenAI-compatible). Confirm whether the server is already running:
- Running โ collect
base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
- Not running โ follow references/vllm_server.md to install and launch a vLLM server, then collect
base_url, model_name, and (optionally) api_key; set vlm.backend: "openai".
- Custom (any other OpenAI-compatible endpoint) โ set
vlm.backend: "openai"; collect base_url, model_name, and (optionally) api_key.
If the user has no endpoint and does not want to set one up, stop and help resolve API access first.
-
Workflow steps: Choose one of:
- Full pipeline:
["0", "1", "2", "3"]
- No caption generation:
["0", "2", "3"], where Step 2 falls back to image-only context
- No verification:
["0", "1", "2"]
- Custom subset: any supported subset of steps
-
Output format: Choose one of:
jsonl: unified schema only
legacy: byte-compatible .txt.stepN files only
both: writes both formats and is the default for downstream tooling
Running the pipeline
The pipeline runs inside the TAO Toolkit container via the auto_label CLI:
auto_label generate -e /path/to/spec.yaml \
results_dir=/results \
image_referring_expression.data.image_dir=/data/images \
image_referring_expression.data.kitti_label_dir=/data/labels \
image_referring_expression.vlm.gemini.api_key=$GOOGLE_API_KEY
Generate a default spec: auto_label default_specs results_dir=/results module_name=auto_label, then set autolabel_type: "image_referring_expression". All fields support Hydra dot-notation overrides on the command line.
See references/configuration.md for the full YAML structure, all parameters, model/endpoint setup, and error patterns.
Recommended pilot workflow
- Run on 5-10 images with all four steps.
- Inspect
step_0_region_expr/annotations.jsonl โ are object types, colors, and discriminating phrases accurate?
- Inspect
step_2_grounding_expr/annotations.jsonl โ are objects grouped sensibly, and do bbox coordinates match the described groups?
- Inspect
step_3_double_check/annotations.jsonl โ were mismatched bboxes removed or tightened? Are any new errors introduced (rare)?
- If quality is insufficient, switch the VLM to a stronger model (e.g.
gemini-2.5-pro or a larger Qwen3-VL endpoint), raise media_resolution / max_output_tokens, then re-run with workflow.force_reprocess=true.
- Scale to the full dataset once satisfied.
Configuration
Key configuration fields (full reference in references/configuration.md):
| Field | Default | Description |
|---|
workflow.steps | ["0","1","2","3"] | Which steps to execute (0=region_expr, 1=image_caption, 2=grounding_expr, 3=double_check) |
workflow.max_workers | 4 | Parallel threads per step (watch API rate limits) |
workflow.force_reprocess | false | Ignore cached per-step outputs and reprocess from scratch |
workflow.output_format | "jsonl" (set to "both" in the default spec) | "jsonl", "legacy", or "both" |
vlm.backend | "gemini" | "gemini" or "openai" (OpenAI-compatible endpoint) |
data.image_dir | required | Directory of input images (.jpg / .jpeg / .png) |
data.kitti_label_dir | required (unless resuming) | Directory of KITTI-format .txt label files |
data.input_annotations_jsonl | "" | Optional pre-seeded annotations.jsonl (skips KITTI seeding) |
Inputs
Two ways to seed the pipeline:
- Image directory + KITTI labels (default). Set
data.image_dir and data.kitti_label_dir. The orchestrator walks the image directory, reads the matching <stem>.txt KITTI file, parses bboxes (fields 0 + 4-7), reads each image's width/height via PIL, and writes a seed_annotations.jsonl to results_dir/.
- Pre-seeded annotations JSONL (resume / pre-computed regions). Set
data.input_annotations_jsonl to a file with one {"image_id", "image_path", "width", "height", "kitti_bboxes": [...]} object per line.
Outputs
All outputs go to results_dir/:
seed_annotations.jsonl โ initial per-image records (unless input_annotations_jsonl was supplied).
step_0_region_expr/annotations.jsonl โ adds regions[] (each with bbox/bbox_2d, type, color, description).
step_1_image_caption/annotations.jsonl โ adds caption (string).
step_2_grounding_expr/annotations.jsonl โ adds expressions[] (each {text, instances: [{bbox: [x1,y1,x2,y2]}]}).
step_3_double_check/annotations.jsonl โ same shape as Step 2, with bboxes removed/updated.
results_dir/annotations.jsonl โ copy of the last completed step's output.
- When
workflow.output_format is "legacy" or "both", each step also writes byte-compatible step_<N>_*/labels/<stem>.txt.stepN files for the original 2d-data-engine tooling.
Prerequisites
- Container:
nvcr.io/nvidia/tao/tao-toolkit:6.26.3-pyt
- API access: At least one VLM endpoint (Gemini API key or OpenAI-compatible endpoint capable of image input)
- PIL / Pillow: Required to read image dimensions during seeding (already present in the TAO container)