Refactor navigation commands into its own tool #35

ryanhoangt · 2024-12-23T08:43:43Z

Description

This PR is to:

Create a new tool for the 2 nav commands. No changes are made to the str_replace_editor.

TODO:

Run eval on OH
Add tests for navigator

This is an alternative to #5.

Related Issue

Close #28

jpollard-cs

Love this! I've left some comments and questions - hopefully they are helpful! I want to dig a bit further into some other things such as output tree rendering, but thought I would share what I have so far.

jpollard-cs · 2025-01-03T01:49:27Z

openhands_aci/navigator/__init__.py

+
+TOOL_DESCRIPTION = """Custom navigation tool for navigating to symbols in a codebase
+* If there are multiple symbols with the same name, the tool will print all of them
+* The `jump_to_definition` command will print the FULL definition of the symbol, along with the absolute path to the file


have you tested cases where the definition is for a dependency which has not yet been resolved / downloaded? I don't mean anything exotic like dynamically downloaded dependencies, but a simple case of finding the definition from an external dependency which hasn't yet been installed by a package manager or what have you? just curious if this can be surfaced in a meaningful way that OH is able to reason about and install the dependencies itself or prompt the user to do so?

jpollard-cs · 2025-01-03T02:22:33Z

tests/integration/test_symbol_navigator.py

+    with pytest.raises(Exception) as exc_info:
+        navigator(command='invalid_command', symbol_name='MyClass')
+
+    assert 'Unrecognized command' in str(exc_info.value)


Are there further plans to test E2E that OH is using the Navigator CLI when it would by expected to do so or does that introduce too much bias / reduce agency? It would seem that there might at least be some basic tests along these lines where the obvious choice should be to use this tool. I'm not really concerned with influencing the agency of the OH "agent", but more saying that it would be good to have a way to test the efficacy of this tool and it's relevant prompts such that we have confidence that 1.) OH is well aware of it as a valid option when it is contextually appropriate and that 2.) it is able to appropriately call the right CLI commands and effectively navigate the results (understanding this may not be perfect perhaps this could be based on some threshold percentage?). My thinking here is that this would help build confidence that the OH eval results are actually testing the desired functionality. Also are eval results the sole arbiter of whether or not a new feature will make the cut and if so where can I learn more about how that process works and how it is measured? A concern is what might happen if, for instance, this led to significantly lower token usage, but performed slightly worse perhaps because more brute force methods pull more code into the context. It could be that ancillary work to endow OH with a better ability to logically reason about code could unlock improved eval scores. In other words how do you avoid throwing away perfectly good tools?

jpollard-cs · 2025-01-03T02:27:20Z

openhands_aci/editor/editor.py


 Command = Literal[
    'view',
    'create',
    'str_replace',
    'insert',
    'undo_edit',
-    # 'jump_to_definition', TODO:


not sure if there's a formal board and backlog item to link these TODO items to, but in my experience these kinds of things often never get done otherwise

jpollard-cs · 2025-01-03T02:33:25Z