Test Your Agent
This page is for researchers and developers who already have a runnable HA-VLN agent and want to validate it.
From a development perspective, testing answers one question:
- can my agent run correctly on the public HA-VLN workflow?
What to Test
Before considering your agent ready, test the following:
- the agent can start correctly in the HA-VLN environment
- the required data paths are resolved correctly
- the agent behaves correctly in dynamic human scenes
- the expected outputs and metrics can be produced
Recommended Testing Path
1. Test on public development splits
Run your agent on the public validation workflow first. This is the safest way to catch:
- path issues
- configuration mistakes
- broken environment synchronization
- missing dependencies
- incorrect metric handling
2. Inspect collision-aware behavior
Because HA-VLN is human-aware, test not only navigation success but also collision-related behavior.
Use:
Recommended First Validation Loop
A practical first loop is:
- run your package on a public split such as
val_unseen - confirm that the evaluation script starts cleanly and uses the intended HA-VLN task configuration
- inspect the result files produced by your current local workflow
- read the exported score summary before debugging deeper issues
Depending on your launch script, exported files often include:
score_summary.jsonepisode_metrics.jsonstats_ckpt_0_<split>.json- evaluation logs
Practical Testing Checklist
Before you move on, confirm that:
- your agent loads the correct HA-VLN task configuration
- the required public data exists under the expected paths
- your agent can finish evaluation runs without runtime errors
- the outputs you care about are written correctly
- your metrics are interpretable and consistent with HA-VLN definitions
Common Failure Patterns
Config or path mismatch
If the environment boots but evaluation fails quickly, check:
- the task config path used by your method
- checkpoint paths inside the mounted package
- whether your code is still using host-side absolute paths instead of container paths
Metrics look surprising
If plain success looks reasonable but strict success SR looks low, revisit:
In the current evaluator, SR is a strict success metric rather than a direct alias for plain environment success. For summary files, score_summary.json reports NE, while stats_ckpt_0_<split>.json keeps the underlying aggregated key name distance_to_goal.
If You Need More Detail
For low-level integration and evaluation details, see: