Automating Invariant Filtering: Leveraging LLMs to Streamline Test Oracle Generation

This is the online appendix for a paper submitted to the Software Quality Days 2025.

Abstract

Automated generation of test oracles is a critical area of research in software quality assurance. One effective technique is the detection of invariants by analyzing dynamic execution data. Though a common challenge of these approaches is the detection of false-positive invariants. This paper investigates the potential of Large Language Models (LLMs) to assist in filtering these dynamically detected invariants, aiming to reduce the manual effort involved in discarding incorrect invariants. We conducted experiments using various GPT models from OpenAI, leveraging a dataset of invariants detected from the dynamic execution of a REST API. By employing a Zero-shot Chain-of-Thought Prompting methodology, we guided the LLMs to articulate their reasoning behind their decisions. Our findings indicate that classification performance significantly improves with detailed instructions and strategic prompt design (the best model achieving on average 80.7% accuracy), with some performance differences between different invariant types.

Directory InvariantClassifier contains the Java Gradle project for executing the experiments and analyzing the data
1. Sub-Directory agora_data contains the data from the original study used for the experiments (the execution data is not included due to size limitations)
2. Sub-Directory results contains the result data for our first experiment setup
3. Sub-Directory results_improved_prompt contains the result data for our second experiment setup
  - The results are grouped by the API operation and LLM.
  - The results for each invariant contain the prompt generated for interacting with the LLM, the reasoning response in markdown, and a JSON with the data used for further processing.
Directory plots contains a host of plots for the experiment results (some of which are in the full paper, other had to be cut because of space limitations)

Results

Setup 1 (Initial Setup: Simple System Message and Static Prompts)

GPT-3.5:

		GPT Verdict
		true-positive	false-positive
Label	true-positive	195	674
	false-positive	96	766

Accuracy: 0.5551704217215483
F1-Score: 0.33620689655172414
Precision: 0.6701030927835051
Recall: 0.2243958573072497
TPR: 0.2243958573072497
FPR: 0.11136890951276102

GPT-4o-mini:

		GPT Verdict
		true-positive	false-positive
Label	true-positive	69	800
	false-positive	4	858

Accuracy: 0.5355285961871751
F1-Score: 0.1464968152866242
Precision: 0.9452054794520548
Recall: 0.07940161104718067
TPR: 0.07940161104718067
FPR: 0.004640371229698376

Setup 2 (Enhanced Setup: Improved System Message and Data Examples)

GPT-3.5:

		GPT Verdict
		true-positive	false-positive
Label	true-positive	783	86
	false-positive	684	178

Accuracy: 0.5551704217215483
F1-Score: 0.6703767123287672
Precision: 0.5337423312883436
Recall: 0.9010356731875719
TPR: 0.9010356731875719
FPR: 0.7935034802784223

GPT-4o-mini:

		GPT Verdict
		true-positive	false-positive
Label	true-positive	581	288
	false-positive	129	733

Accuracy: 0.7590987868284229
F1-Score: 0.7359088030398987
Precision: 0.8183098591549296
Recall: 0.6685845799769851
TPR: 0.6685845799769851
FPR: 0.14965197215777262

GPT-4o:

		GPT Verdict
		true-positive	false-positive
Label	true-positive	764	105
	false-positive	229	633

Accuracy: 0.8070479491623339
F1-Score: 0.8206229860365198
Precision: 0.7693856998992951
Recall: 0.8791714614499425
TPR: 0.8791714614499425
FPR: 0.265661252900232

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
InvariantClassifier		InvariantClassifier
plots		plots
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automating Invariant Filtering: Leveraging LLMs to Streamline Test Oracle Generation

Abstract

Contents

Results

Setup 1 (Initial Setup: Simple System Message and Static Prompts)

Setup 2 (Enhanced Setup: Improved System Message and Data Examples)

About

Releases

Packages

Languages

software-competence-center-hagenberg/2025-SWQD-LLM-Invariant-classification

Folders and files

Latest commit

History

Repository files navigation

Automating Invariant Filtering: Leveraging LLMs to Streamline Test Oracle Generation

Abstract

Contents

Results

Setup 1 (Initial Setup: Simple System Message and Static Prompts)

Setup 2 (Enhanced Setup: Improved System Message and Data Examples)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages