A Framework for Automated Collection and Analysis of Incidents on LLM Services
This repository contains the web application for the FAILS project. It is built using React for the frontend and Flask for the backend.
Large Language Model (LLM) services have rapidly become essential tools for applications ranging from customer support to content generation, yet their distributed nature makes them prone to failures that impact reliability and uptime. Existing tools for analysing service incidents are either closed-source, lack comparative capabilities, or fail to provide comprehensive insights into failure trends and recovery patterns. To address these gaps, we present FAILS (Framework for Analysis of Incidents and Outages of LLM Services), an open-source system designed to collect, analyse and visualize incident data from leading LLM providers. FAILS enables users to explore temporal trends, assess reliability metrics associated with failure models such as Mean Time to Recovery (MTTR) and Mean Time Between Failures (MTBF), and gain insights into service co-dependencies using a modern LLM-assisted analysis. With a web-based interface and advanced plotting tools, FAILS enables researchers, engineers, and decision-makers to understand and mitigate disruptions due to LLM services.
- Node.js and npm
- Python 3.11 (tested with 3.12 and 3.13, didn't work!)
- OpenAI API key (not system critical but needed for AI plot analysis feature)
-
Install Node.js and npm:
If you haven't installed Node.js and npm, download and install them from the official Node.js website. This will also install npm, which is the package manager for Node.js.
-
Install frontend dependencies:
Navigate to the
client
directory and install the dependencies:cd client npm install
-
Set up Python virtual environment:
Navigate to the
llm_analysis
directory and create a virtual environment:cd llm_analysis python -m venv venv
Activate the virtual environment:
-
On macOS and Linux:
source venv/bin/activate
-
On Windows:
.\venv\Scripts\activate
-
-
Install backend dependencies:
With the virtual environment activated, install the dependencies using
pip
:pip install -r requirements.txt
-
Configure Environment Variables:
Create a
.env
file in theserver/scripts
directory with your API keys:OPENAI_API_KEY=your_openai_api_key_here
Replace
your_openai_api_key_here
with your actual OpenAI API key.
-
Start the backend server:
For development with auto-reload:
In the
server
directory, ensure the virtual environment is activated, then run:python app.py
This will start the Flask server on
http://localhost:5000
.For production deployment using Gunicorn:
cd server chmod +x start.sh stop.sh # Make scripts executable (first time only) ./start.sh # Start the server ./stop.sh # Stop the server when needed
The server will be available at
http://localhost:5000
. -
Start the frontend development server:
In the
client
directory, run:npm start
This will start the React development server on
http://localhost:3000
.
The main dashboard provides visualization and analysis of LLM service incidents through various plots and metrics.
An interactive chat interface that allows users to analyze incident patterns and get AI-powered insights about service reliability. The chat interface:
- Maintains conversation context for follow-up questions
- Provides markdown-formatted responses
- Supports natural language queries about:
- Common failure patterns
- Service reliability trends
- Impact analysis
- Recovery time patterns
- Root cause categorization
Example queries:
- "Sort the service providers by number of incidents in total in the entire dataset and give the timeframe!"
- "Tell me more about the impact levels of incidents"
The analysis is powered by GPT-4o-mini and uses the historical incident data to provide data-backed insights.
The application includes an AI-powered plot analysis feature that can analyze visualizations and provide insights. To use this feature:
-
Setup Requirements:
- Ensure you have a valid OpenAI API key
- Add the API key to your
.env
file as described above - Make sure you're running the application in production mode using the start.sh script
-
Using the Feature:
- Generate plots by selecting services and date range
- Once plots are displayed, find the "AI Plot Analysis" section below the plots
- Choose either:
- A single plot to analyze specific visualizations
- "Analyze All Plots" for a comprehensive summary
- Click "Analyze Plot" to generate AI insights
-
Analysis Types:
- Single Plot Analysis: Provides detailed insights about specific visualizations
- All Plots Analysis: Generates a comprehensive summary of all plots, highlighting key patterns and insights
-
Troubleshooting:
- If you see "Please use production server" message, ensure you're running the server using start.sh
- Verify your API key is correctly set in the .env file
- Check the server logs for any API-related errors
The application includes scripts to collect and update incident data from various LLM providers. There are two main data collection scripts:
-
Regular Data Updates - Collects recent incidents:
cd server/scripts python run_incident_scrapers.py
This script:
- Collects new incidents from OpenAI, Anthropic, Character.AI, and StabilityAI
- Updates the existing incident database with new data
- Runs both the StabilityAI.py file and the incident_scraper_oac.py file
-
Historical Data Collection - One-time collection of all historical incidents:
cd server/scripts/data_gen_modules python incident_scraper_oac_historical.py
This script:
- Collects all available historical incidents
- Creates a complete historical database
- Should be run only once when setting up a new instance
If you encounter issues during data collection:
-
Check the Logs:
- View server/logs/incident_scrapers.log for detailed error messages
- Common issues include network timeouts and parsing errors
-
Browser Issues:
- If you see WebDriver errors, ensure Chrome is properly installed
- Try running without headless mode for debugging by removing the '--headless=new' option
-
Data Validation Failures:
- Check that the source websites haven't changed their structure
- Verify network connectivity to all provider status pages
![mainpage](https://private-user-images.githubusercontent.com/77168983/403483095-e31dfd2c-54d6-4a3b-ba23-d1c8fd5fb1bc.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTE4MDYsIm5iZiI6MTczODk1MTUwNiwicGF0aCI6Ii83NzE2ODk4My80MDM0ODMwOTUtZTMxZGZkMmMtNTRkNi00YTNiLWJhMjMtZDFjOGZkNWZiMWJjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDE4MDUwNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTdkYzE0Y2ZhNDQ2ZWJiMjI3YTc0MWQ5YzkzY2IwMGJiOTRmMzBlZTdhNWIyOWQzMDJiNmE0ZDMyZDYxYjY5NjQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.lfRuEUIYDEO4lLDzrDG1BoH_IfskFWD22YokFypMyvs)
![datatable](https://private-user-images.githubusercontent.com/77168983/403483504-57fe0198-43fd-41ae-93f5-53c7fc3788bd.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTE4MDYsIm5iZiI6MTczODk1MTUwNiwicGF0aCI6Ii83NzE2ODk4My80MDM0ODM1MDQtNTdmZTAxOTgtNDNmZC00MWFlLTkzZjUtNTNjN2ZjMzc4OGJkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDE4MDUwNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNhZTA3NzM2ZTlhYjRkM2NhMjE0MjcxMjYyN2Y1MGNmYWY4MDNhMDlkMjMxZWNmZmVkOWEyMDE4NjcxNDYwOWEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.yXzbvPcQcq-uuL3EwgRKJ2KQoqLo9tvIi3TsBWv3ZJU)
![chatbot](https://private-user-images.githubusercontent.com/77168983/403483566-0d927fd0-bffa-4362-9fd2-9c5f2dc609f8.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTE4MDYsIm5iZiI6MTczODk1MTUwNiwicGF0aCI6Ii83NzE2ODk4My80MDM0ODM1NjYtMGQ5MjdmZDAtYmZmYS00MzYyLTlmZDItOWM1ZjJkYzYwOWY4LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDE4MDUwNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWIxNGU2NmJjY2IxYmQxOGUwOGJkYzFlNzllYmNkYTdlYmUyNjIwYjM4Zjk3YTI2NzNhMGM1MGY4OTRhYmY3YzEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.AcMqI_FCQ_sdNTDo17hGlZYCeVsUAfy0FjHCH2Zyngo)
![llmanalysis](https://private-user-images.githubusercontent.com/77168983/403483619-9ebb9e69-0444-41be-888c-c816642895f6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTE4MDYsIm5iZiI6MTczODk1MTUwNiwicGF0aCI6Ii83NzE2ODk4My80MDM0ODM2MTktOWViYjllNjktMDQ0NC00MWJlLTg4OGMtYzgxNjY0Mjg5NWY2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDE4MDUwNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY4NmIzNDRmMjhhOGI4M2ZmNzFkZDZkNmM4ODlhNDEyNzM1OTdhNGI4NGRjZDIzOTQyMDI5MDAwM2IzM2ZhOTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.cfHb_APkNdfLx0XIUgg1PkF5tiYI3GzwogSLo-IVN8Y)
Code by Nishanthi Srinivasan, Bálint László Szarvas and Sándor Battaglini-Fischer.
Many thanks to Xiaoyu Chu and Prof. Dr. Ir. Alexandru Iosup for the support!