-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to evaluate the models' performance through metrics such as MASE? #75
Comments
Thanks for your interest @nate-gillman! We used gluonts for computing metrics. Here's an example for the Important While many datasets in GluonTS have the same name as the ones used in the paper, they may be different from the evaluation in the paper in crucial aspects such as prediction length and number of rolls. You will need to install: import numpy as np
import torch
from gluonts.dataset.repository import get_dataset
from gluonts.dataset.split import split
from gluonts.ev.metrics import MASE, MeanWeightedSumQuantileLoss
from gluonts.itertools import batcher
from gluonts.model.evaluation import evaluate_forecasts
from gluonts.model.forecast import SampleForecast
from tqdm.auto import tqdm
from chronos import ChronosPipeline
# Load dataset
batch_size = 32
num_samples = 20
dataset = get_dataset("m4_hourly")
prediction_length = dataset.metadata.prediction_length
# Load Chronos
pipeline = ChronosPipeline.from_pretrained(
"amazon/chronos-t5-small",
device_map="cuda:0",
torch_dtype=torch.bfloat16,
)
# Split dataset for evaluation
_, test_template = split(dataset.test, offset=-prediction_length)
test_data = test_template.generate_instances(prediction_length)
# Generate forecast samples
forecast_samples = []
for batch in tqdm(batcher(test_data.input, batch_size=32)):
context = [torch.tensor(entry["target"]) for entry in batch]
forecast_samples.append(
pipeline.predict(
context,
prediction_length=prediction_length,
num_samples=num_samples,
).numpy()
)
forecast_samples = np.concatenate(forecast_samples)
# Convert forecast samples into gluonts SampleForecast objects
sample_forecasts = []
for item, ts in zip(forecast_samples, test_data.input):
forecast_start_date = ts["start"] + len(ts["target"])
sample_forecasts.append(
SampleForecast(samples=item, start_date=forecast_start_date)
)
# Evaluate
metrics_df = evaluate_forecasts(
sample_forecasts,
test_data=test_data,
metrics=[
MASE(),
MeanWeightedSumQuantileLoss(np.arange(0.1, 1.0, 0.1)),
],
)
metrics_df
|
Thanks a ton for your quick reply!! That answers my question. |
Keeping this open as a FAQ. |
I have a dataframe that I tried to convert to a Gluonts dataset so I can use the above code but I keep getting errors. Appreciate any advice....
OUTPUT:
OUTPUT:
|
@yeongnamtan there's no need to use
then you can apply |
@lostella I tried what you suggested and get this error.... AttributeError Traceback (most recent call last) AttributeError: 'list' object has no attribute 'test' My code below... Load datasetbatch_size = 32 dataset = [ Split dataset for evaluation_, test_template = split(dataset.test, offset=-prediction_length) Generate forecast samplesforecast_samples = [] Convert forecast samples into gluonts SampleForecast objectssample_forecasts = [] Evaluatemetrics_df = evaluate_forecasts( |
@yeongnamtan you should do
and not
Also, if you wrap your code snippets between triple-backticks, the code will display with a better format, see https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks |
thank you very much for your help. It is working now. |
Hi, I ran the exact same code twice that @abdulfatir provided above for MASE[0.5] = 0.738313 MASE[0.5] = 0.734823 |
Which model are you using and what's the |
model: |
Ah, okay. This discrepancy may be due to an off-by-one issue we recently fixed in the inference code (See #73). Fixing this improved our results. We have not updated the paper with the new results yet. |
Got it, totally valid! Appreciate it :) |
Update: We have just open-sourced the datasets used in the paper (thanks @shchur!). Please check the updated README. We have also released an evaluation script and backtest configs to compute the WQL and MASE numbers as reported in the paper. Please follow the instructions in this README to evaluate on the in-domain and zero-shot benchmarks. |
Hi Chronos team--
Howdy!! I'm a PhD student in the States and I'm using this as a baseline for my research... thanks for building this model :)
I'm currently implementing evaluation metrics like in the paper to work for the Chronos model, and I'm starting with MASE. One thing that's unclear to me at the moment: in Appendix D in the arXiv preprint, the authors say that the MASE computation involves some seasonality parameter$S$ from in the seasonal naive model.
What seasonality parameter should I use to obtain metrics similar to how the authors did it in the paper? In other scenarios, I've seen that some people try to automatically compute a seasonality S for each dataset; I've also seen people use information about the original dataset to select$S$ (e.g. if it's a taxi dataset with hourly counts, then choosing $S=7*24$ would be a reasonable heuristic); and I've seen other people just use $S=1$ , but that to me seems like a "seasonal very naive model".
Thanks in advance for your help!!
Cheers
Nate
The text was updated successfully, but these errors were encountered: