NeMo Inspector is a tool designed to help you analyze Large Language Model (LLM) generations. It provides two main pages:
- Inference Page: Interactively generate and analyze model responses.
- Analyze Page: Explore and manipulate existing generations, apply filters, sorting criteria, and compute statistics.
The Inference page allows you to experiment with model prompts and responses in real-time, adjusting various parameters. The Analyze page lets you load previously generated outputs and apply filtering, sorting, labeling, and statistic calculations for in-depth exploration.
-
Clone and Install the Tool:
git clone [email protected]:NVIDIA/NeMo-Inspector.git cd nemo-inspector pip install .
-
Launch the Tool:
nemo_inspector
This will start a local server that you can access through your browser.
The Inference page allows you to generate responses using an LLM and analyze them immediately. It supports two generation modes:
- Prompt-based Mode: You write the entire prompt that will be sent to the model.
- Template-based Mode: You select from predefined templates, fill in placeholders, and let the tool automatically construct the final prompt.
The Inference page utilizes NeMo-Skills pipelines for inference.
The Analyze page helps you work with pre-generated outputs. To use it, provide paths to the generation files using command-line arguments. For example:
nemo_inspector --inspector_params.model_prediction \
generation1='/path/to/generation1/output-greedy.jsonl' \
generation2='/path/to/generation2/output-rs*.jsonl'
Once loaded, the Analyze page lets you:
- Sort and Filter Results: Apply custom filtering and sorting functions to refine the displayed data.
- Compare Generations: View outputs from multiple generation runs side-by-side.
- Modify and Label Data: Update or annotate samples and save the changes for future reference.
- Compute Statistics: Generate both custom and general statistics to summarize your data.
The tool supports two filtering modes: Filter Files mode and Filter Questions mode. You can define custom filtering functions in Python and run them directly in the UI.
- In this mode, the filtering function will be run on each sample across different files simultaneously.
- The input to the filtering function is a dictionary where keys represent generation names and values are JSON objects for that sample.
- The custom function should return a Boolean value (
True
to keep the sample,False
to filter it out).
Example of a custom filtering function:
def custom_filtering_function(error_message: str) -> bool:
# Implement your logic here
return 'timeout' not in error_message
# This line will be used for the filtering:
custom_filtering_function(data['generation1']['error_message'])
Note: The last line of the custom filtering function is used for filtering. All preceding lines are just for computation.
To apply multiple conditions to multiple generations, use the &&
separator. For instance:
data['generation1']['is_correct'] && not data['generation2']['is_correct']
Important: In Filter Files mode, do not write multi-generation conditions without using &&
. Each condition should be separated by &&
.
- In this mode, the function filters each question across multiple files without filtering out entire files.
- The input is a dictionary of generation names mapping to lists of JSON data for that question.
In this mode, you write conditions without the &&
operator. For example:
data['generation1'][0]['is_correct'] and not data['generation2'][0]['is_correct']
This example filters out questions where the first generation is correct and the second generation is incorrect. It can also compare fields directly:
data['generation1'][0]['is_correct'] != data['generation2'][0]['is_correct']
Note: These examples cannot be used in Filter Files mode.
Sorting functions are similar to filtering functions, but there are key differences:
- Scope: Sorting functions operate on individual data entries (not dictionaries with multiple generations).
- Cross-Generations: Sorting cannot be applied across multiple generations at once. You must sort one generation at a time.
A correct sorting function might look like this:
def custom_sorting_function(generation: str):
# Sort by the length of the generation text
return len(generation)
# This line will be used for the sorting:
custom_sorting_function(data['generation'])
NeMo Inspector supports two types of statistics:
-
Custom Statistics: Applied to the samples of a single question (for each generation).
Default custom statistics include:
correct_responses
wrong_responses
no_responses
-
General Custom Statistics: Applied across all questions and all generations.
Default general custom statistics include:
dataset size
overall number of samples
generations per sample
You can change the existing or define your own Custom and General Custom Statistics functions.
Custom Statistics Example:
def unique_error_counter(datas):
# `datas` is a list of JSONs (one per file) for a single question
unique_errors = set()
for data in datas:
unique_errors.add(data.get('error_message'))
return len(unique_errors)
def number_of_runs(datas):
return len(datas)
# Map function names to functions
{'unique_errors': unique_error_counter, 'number_of_runs': number_of_runs}
General Custom Statistics Example:
def overall_unique_error_counter(datas):
# `datas` is a list of lists of dictionaries,
# where datas[question_index][file_index] is a JSON record
unique_errors = set()
for question_data in datas:
for file_data in question_data:
unique_errors.add(file_data.get('error_message'))
return len(unique_errors)
# Map function names to functions
{'unique_errors': overall_unique_error_counter}
Note: The final line in both the Custom and General Custom Statistics code blocks should be a dictionary mapping function names to their corresponding functions.
You can update each sample in the dataset programmatically. At the end of the code block, return the updated sample dictionary:
# For example, strip leading and trailing whitespace from the "generation" field
{**data, 'generation': data['generation'].strip()}