Evaluation Metrics
This document explains the evaluation metrics used in the current HA-VLN evaluator. Understanding these metrics will help you interpret your agent's behavior during development and validation.
Overview
The current evaluator exposes four core human-aware metrics:
- Strict Success (
SR) - Trajectory Collision Rate (
TCR) - Navigation Error (
NE) - Collision Rate (
CR)
The evaluator also keeps the underlying environment success signal. In HA-VLN, SR is stricter than plain success because it also accounts for human-collision behavior.
Current Evaluator Outputs
At the episode level, the current evaluator writes fields such as:
successgoal_distancecollision_countbaseline_collision_countadjusted_collision_countcollision_indicatorstrict_success
At the summary level, the current evaluator exposes two closely related summary views:
score_summary.json, which may writeSR,TCR,CR, andNEstats_ckpt_0_<split>.json, which keeps the raw aggregated metric keys, includingdistance_to_goal
In the current implementation, NE is the summary name for the aggregated distance_to_goal value.
Metric Definitions
1. Strict Success (SR)
Definition: Mean strict-success value across episodes.
At the episode level, the current implementation defines:
SR_episode = success * int(TCR == 0)
This means an episode contributes 1 only if:
- the environment reports
success == 1 - the adjusted trajectory collision count
TCRis zero
The dataset-level summary is then:
SR = Σ(SR_episode) / (Total episodes)
Higher is better.
Important note:
SRis not the same as plain environmentsuccess- in the current evaluator,
SRis a strict success metric - the exported summary value is a mean in
[0, 1], not a percentage unless you multiply it by100for presentation
2. Trajectory Collision Rate (TCR)
Definition: Average adjusted human-collision count per episode after subtracting the precomputed unavoidable collision component used by the metric implementation.
At the episode level, the current implementation defines:
TCR_episode = max(0, collisions.count - unavoidable_collision_baseline)
The dataset-level summary is then:
TCR = Σ(TCR_episode) / (Total episodes)
Lower is better.
Interpretation:
TCR = 0: no counted human-collision events after adjustmentTCR > 0: some counted human-collision events occurred after adjustment
3. Navigation Error (NE)
Definition: Mean final distance-to-goal value across episodes.
Units: meters
In the current evaluator implementation, NE is the summary name for the aggregated environment metric distance_to_goal.
NE = Σ(distance_to_goal at episode end) / (Total episodes)
Lower is better.
4. Collision Rate (CR)
Definition: Mean episode-level collision indicator across episodes.
At the episode level, the current implementation defines:
CR_episode = min(TCR_episode, 1)
So in the current evaluator:
- an episode contributes
0if its adjusted collision count is zero - an episode contributes
1if its adjusted collision count is one or more
The dataset-level summary is then:
CR = Σ(CR_episode) / (Total episodes)
Lower is better.
Important note:
- the exported summary value is a mean in
[0, 1] - it is often interpreted like a rate, but the current evaluator does not multiply it by
100
Human-Aware Interpretation
What Counts as a Collision Here?
The human-aware metrics are meant to reflect interaction with dynamic humans rather than only static-scene collisions.
Why TCR and CR Both Matter
TCRmeasures how much adjusted human-collision behavior accumulates across episodesCRmeasures how often an episode has at least one adjusted collision event
Together they help distinguish frequency from severity.
Why success and SR Both Matter
successtells you whether the episode satisfied the environment success conditionSRtells you whether the episode was both successful and collision-clean under the strict HA-VLN rule
This is why two agents with similar plain success can still differ on the stricter HA-VLN SR metric.
Practical Implications for Development
When iterating on your own agent, use these metrics together rather than optimizing only one of them.
Useful questions to ask are:
- does the agent reach goals reliably under plain
success? - does it also preserve strict success under
SR? - are failures caused more by navigation error or by adjusted human collisions?
Notes
- this page follows the current code path in
HASimulator/metric.pyandagent/eval.py - if future evaluator code changes the exported metric definitions, the documentation should be updated to match the code