Evaluation Metrics

This document explains the evaluation metrics used in the current HA-VLN evaluator. Understanding these metrics will help you interpret your agent's behavior during development and validation.

Overview

The current evaluator exposes four core human-aware metrics:

Strict Success (SR)
Trajectory Collision Rate (TCR)
Navigation Error (NE)
Collision Rate (CR)

The evaluator also keeps the underlying environment success signal. In HA-VLN, SR is stricter than plain success because it also accounts for human-collision behavior.

Current Evaluator Outputs

At the episode level, the current evaluator writes fields such as:

success
goal_distance
collision_count
baseline_collision_count
adjusted_collision_count
collision_indicator
strict_success

At the summary level, the current evaluator exposes two closely related summary views:

score_summary.json, which may write SR, TCR, CR, and NE
stats_ckpt_0_<split>.json, which keeps the raw aggregated metric keys, including distance_to_goal

In the current implementation, NE is the summary name for the aggregated distance_to_goal value.

Metric Definitions

1. Strict Success (`SR`)

Definition: Mean strict-success value across episodes.

At the episode level, the current implementation defines:

SR_episode = success * int(TCR == 0)

This means an episode contributes 1 only if:

the environment reports success == 1
the adjusted trajectory collision count TCR is zero

The dataset-level summary is then:

SR = Σ(SR_episode) / (Total episodes)

Higher is better.

Important note:

SR is not the same as plain environment success
in the current evaluator, SR is a strict success metric
the exported summary value is a mean in [0, 1], not a percentage unless you multiply it by 100 for presentation

2. Trajectory Collision Rate (`TCR`)

Definition: Average adjusted human-collision count per episode after subtracting the precomputed unavoidable collision component used by the metric implementation.

At the episode level, the current implementation defines:

TCR_episode = max(0, collisions.count - unavoidable_collision_baseline)

The dataset-level summary is then:

TCR = Σ(TCR_episode) / (Total episodes)

Lower is better.

Interpretation:

TCR = 0: no counted human-collision events after adjustment
TCR > 0: some counted human-collision events occurred after adjustment

3. Navigation Error (`NE`)

Definition: Mean final distance-to-goal value across episodes.

Units: meters

In the current evaluator implementation, NE is the summary name for the aggregated environment metric distance_to_goal.

NE = Σ(distance_to_goal at episode end) / (Total episodes)

Lower is better.

4. Collision Rate (`CR`)

Definition: Mean episode-level collision indicator across episodes.

At the episode level, the current implementation defines:

CR_episode = min(TCR_episode, 1)

So in the current evaluator:

an episode contributes 0 if its adjusted collision count is zero
an episode contributes 1 if its adjusted collision count is one or more

The dataset-level summary is then:

CR = Σ(CR_episode) / (Total episodes)

Lower is better.

Important note:

the exported summary value is a mean in [0, 1]
it is often interpreted like a rate, but the current evaluator does not multiply it by 100

Human-Aware Interpretation

What Counts as a Collision Here?

The human-aware metrics are meant to reflect interaction with dynamic humans rather than only static-scene collisions.

Why `TCR` and `CR` Both Matter

TCR measures how much adjusted human-collision behavior accumulates across episodes
CR measures how often an episode has at least one adjusted collision event

Together they help distinguish frequency from severity.

Why `success` and `SR` Both Matter

success tells you whether the episode satisfied the environment success condition
SR tells you whether the episode was both successful and collision-clean under the strict HA-VLN rule

This is why two agents with similar plain success can still differ on the stricter HA-VLN SR metric.

Practical Implications for Development

When iterating on your own agent, use these metrics together rather than optimizing only one of them.

Useful questions to ask are:

does the agent reach goals reliably under plain success?
does it also preserve strict success under SR?
are failures caused more by navigation error or by adjusted human collisions?

Notes

this page follows the current code path in HASimulator/metric.py and agent/eval.py
if future evaluator code changes the exported metric definitions, the documentation should be updated to match the code

Evaluation Metrics

Overview

Current Evaluator Outputs

Metric Definitions

1. Strict Success (SR)

2. Trajectory Collision Rate (TCR)

3. Navigation Error (NE)

4. Collision Rate (CR)

Human-Aware Interpretation

What Counts as a Collision Here?

Why TCR and CR Both Matter

Why success and SR Both Matter

Practical Implications for Development

Notes

Further Reading

1. Strict Success (`SR`)

2. Trajectory Collision Rate (`TCR`)

3. Navigation Error (`NE`)

4. Collision Rate (`CR`)

Why `TCR` and `CR` Both Matter

Why `success` and `SR` Both Matter