Release 0.0.7 · triton-inference-server/triton_cli

What's Changed

Sync with Triton 24.04
Bump TRT-LLM version to 0.9.0
Add support for llama-2-7b-chat, llama-3-8b, and llama-3-8b-instruct for both vLLM and TRT-LLM
Improve error checking and error messages of building TRT-LLM engines
Log the underlying convert_checkpoint.py and trtllm-build commands for reproducibility/visibility
Don't call convert_checkpoint.py if converted weights are already found
Call convert_checkpoint.py via subprocess to improve total memory usage
Attempt to cleanup failed trtllm models in model repository if import or engine building fails, rather than leaving the model repository in an unfinished state.
Update tests to wait for both HTTP and GRPC server endpoints to be ready before testing
- Fixes intermittent ConnectionRefusedError in CI tests

Full Changelog: 0.0.6...0.0.7