Skip to content

Commit

Permalink
Add support for exogenous regressors (#125)
Browse files Browse the repository at this point in the history
* Simplify implementations of SARIMA, ETS, VectorAR.

We remove code duplication from these files, and we also remove the
online training features of ETS. This should make the ETS model easier
to use at inference time, since time_series_prev is no longer entangled
with online inference features.

* Bugfixes.

* Simpler TransformSequence implementation.

* Refactor argument order for anomaly detector train

We change the order from (train_data, anomaly_labels, train_config,
post_rule_train_config) to (train_data, train_config, anomaly_labels,
post_rule_train_config). This brings it in line with the base signature.

* Allow custom color plots.

* Add exogenous regressor support for Prophet.

* Add exog regressor support for SARIMA.

* Add exog_data param to downstream models.

We allow ForecastingDetector's, ForecasterEnsemble's, and
LayeredForecaster's to accept the param exog_data in both train() and
forecast() methods. However, layered models (especially autoML) do not
yet support training with exogenous data.

* Make base train() abstract.

* Make exogenous pre-processing more rigorous.

* Add exog regressor support to evaluators.

* Add exog regressor test for Prophet.

* Abstract away the notion of grid search.

* Add ensemble & evaluator test coverage for exog.

* Fix build failures.

* Remove exog_data reference from SeasonalityLayer

* More fixes.

* Add exogenous regressor support to layered models.

* Silence Prophet deserialization warnings.

* Fix 2-layer AutoSarima bug.

* Change train to _train for DetectorEnsemble.

* Fix typos.

* Slight cleanup of ensemble code.

* Simplify ensemble cross-val to use evaluators.

* Add save/load test coverage for exog.

* Make layers aware of exogenous regressors.

* Deprecate Python 3.6 & update version.

* More rigorous handling of kwargs in LayeredModel.

* Rename ForecasterWithExog to ForecasterExogBase

* More robust support for inverse transforms.

Use named variables rather than integer indexing. This ensures that we
can invert multivariate forecasts.

* Fix how model_kwargs is set in layered models.

* Make time series more JSON-compatible.

* Various bugfixes.

* More systematic post-processing of forecasts.

* Fix docs error.

* Remove RMSE value assertions from boostingtrees.

* Skip univariate VectorAR test as before.

* Skip spark tests on Python 3.10

* Optimize application of inverse transforms.

The inverse of many transforms is just the identity. This commit adds an
optimization which skips applying the inverse altogether if this is the
case.

* Reduce size of walmart_mini to prevent OOM errors.

* Update pyspark session fixture.

* Make test_vector_ar smaller.

* Remove python3.6 fallback code from conj priors.

* models.anomaly.utils -> models.utils.torch_utils

* Update default settings of ARIMA models.

I figured out that the enforce_invertibility=True and
enforce_stationarity=True settings were previously causing segfaults in
the unit tests because of an out-of-memory error. I have updated the
tests to use smaller data size to circumvent the error.

* Fix failures from SARIMA model update.

* Try unpersisting dataframes in spark tests.

This could ameliorate OOM issues (if that's the cause of test failures).

* Increase Spark network timeout for unit tests.

* Run spark tests separately for 3.8/3.9.

* Update test_forecast_ensemble.

* Add tutorial on exogenous regressors.

* Add auto-retry to tests.

* Use cached docs from gh-pages branch.

Also improve git robustness of build_docs.sh.
  • Loading branch information
aadyotb authored Oct 3, 2022
1 parent ab90426 commit a0d9f9c
Show file tree
Hide file tree
Showing 88 changed files with 2,376 additions and 6,772 deletions.
10 changes: 5 additions & 5 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,18 @@ on:
types: [ published ]

jobs:
build:
docs:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: '3.9'
python-version: '3.10'
- name: Install dependencies
run: |
sudo apt-get update -y
Expand All @@ -29,7 +29,7 @@ jobs:
- name: Build Sphinx docs
run: |
docs/build_docs.sh
timeout-minutes: 60
timeout-minutes: 10
- name: Deploy to gh-pages
uses: peaceiris/actions-gh-pages@v3
if: ${{ github.ref == 'refs/heads/main' || github.event_name == 'release' }}
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: '3.x'
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip setuptools build
Expand Down
80 changes: 38 additions & 42 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: build
name: tests

on:
push:
Expand All @@ -7,18 +7,18 @@ on:
branches: [ main ]

jobs:
build:
tests:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.6", "3.7", "3.8", "3.9", "3.10"]
python-version: ["3.7", "3.8", "3.9", "3.10"]

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

Expand All @@ -34,45 +34,41 @@ jobs:
- name: Test with pytest
id: test
run: |
# Get a comma-separated list of the directories of all python source files
source_files=$(for f in $(find merlion -iname "*.py"); do echo -n ",$f"; done)
script="import os; print(','.join({os.path.dirname(f) for f in '$source_files'.split(',') if f}))"
source_modules=$(python -c "$script")
# A BLAS bug causes high-dim multivar Bayesian LR test to segfault in 3.6. Run the test first to avoid.
if [[ $PYTHON_VERSION == 3.6 ]]; then
python -m pytest -v tests/change_point/test_conj_prior.py
coverage run --source=${source_modules} -L -m pytest -v --ignore tests/change_point/test_conj_prior.py
else
coverage run --source=${source_modules} -L -m pytest -v
fi
# Obtain code coverage from coverage report
coverage report
coverage xml -o .github/badges/coverage.xml
COVERAGE=`coverage report | grep "TOTAL" | grep -Eo "[0-9\.]+%"`
echo "##[set-output name=coverage;]${COVERAGE}"
# Choose a color based on code coverage
COVERAGE=${COVERAGE/\%/}
if (($COVERAGE > 90)); then
COLOR=brightgreen
elif (($COVERAGE > 80)); then
COLOR=green
elif (($COVERAGE > 70)); then
COLOR=yellow
elif (($COVERAGE > 60)); then
COLOR=orange
else
COLOR=red
fi
echo "##[set-output name=color;]${COLOR}"
uses: nick-fields/retry@v2
env:
PYTHON_VERSION: ${{ matrix.python-version }}
with:
max_attempts: 3
timeout_minutes: 40
command: |
# Get a comma-separated list of the directories of all python source files
source_files=$(for f in $(find merlion -iname "*.py"); do echo -n ",$f"; done)
script="import os; print(','.join({os.path.dirname(f) for f in '$source_files'.split(',') if f}))"
source_modules=$(python -c "$script")
# Run tests & obtain code coverage from coverage report.
coverage run --source=${source_modules} -L -m pytest -v -s
coverage report && coverage xml -o .github/badges/coverage.xml
COVERAGE=`coverage report | grep "TOTAL" | grep -Eo "[0-9\.]+%"`
echo "##[set-output name=coverage;]${COVERAGE}"
# Choose a color based on code coverage
COVERAGE=${COVERAGE/\%/}
if (($COVERAGE > 90)); then
COLOR=brightgreen
elif (($COVERAGE > 80)); then
COLOR=green
elif (($COVERAGE > 70)); then
COLOR=yellow
elif (($COVERAGE > 60)); then
COLOR=orange
else
COLOR=red
fi
echo "##[set-output name=color;]${COLOR}"
- name: Create coverage badge
if: ${{ github.ref == 'refs/heads/main' && matrix.python-version == '3.8' }}
if: ${{ github.ref == 'refs/heads/main' && matrix.python-version == '3.10' }}
uses: emibcn/[email protected]
with:
label: coverage
Expand All @@ -81,8 +77,8 @@ jobs:
path: .github/badges/coverage.svg

- name: Push badge to badges branch
uses: s0/git-publish-subdir-action@develop
if: ${{ github.ref == 'refs/heads/main' && matrix.python-version == '3.8' }}
uses: s0/git-publish-subdir-action@v2.5.1
if: ${{ github.ref == 'refs/heads/main' && matrix.python-version == '3.10' }}
env:
REPO: self
BRANCH: badges
Expand Down
6 changes: 3 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ transform used to process the data before giving it to the model, if the `transf
given when initializing the config.

See our implementation of [SARIMA](merlion/models/forecast/sarima.py) for a fairly simple example of what this looks
like in practice, and this [notebook](examples/forecast/ForecastNewModel.ipynb) for a step-by-step walkthrough of a
like in practice, and this [notebook](examples/forecast/4_ForecastNewModel.ipynb) for a step-by-step walkthrough of a
minimal example.

### Forecaster-Based Anomaly Detectors
Expand All @@ -92,7 +92,7 @@ this class into an `ForecasterDetectorClass`. You need to do the following thing

See our implementation of a [Prophet-based anomaly detector](merlion/models/anomaly/forecast_based/prophet.py) for an
example of what this looks like in practice, as well as the forecaster tutorial
[notebook](examples/forecast/3_ForecastNewModel.ipynb).
[notebook](examples/forecast/4_ForecastNewModel.ipynb).

## Data Pre-Processing Transforms
To implement a new data pre-processing transform, begin by reading the
Expand Down Expand Up @@ -127,7 +127,7 @@ You can add support for a new dataset of time series by implementing an appropri
[`ts_datasets`](ts_datasets), and uploading the raw data (potentially compressed) to the [`data`](data) directory.
If your dataset has labeled anomalies, it belongs in [`ts_datasets.anomaly`](ts_datasets/ts_datasets/anomaly). If it
does not have labeled anomalies, it belongs in [`ts_datasets.forecast`](ts_datasets/ts_datasets/forecast). See the
[API docs](https://opensource.salesforce.com/Merlion/latest/ts_datasets.html) for more details.
[API docs](https://opensource.salesforce.com/Merlion/ts_datasets.html) for more details.

Once you've implemented your data loader class, add it to the top-level namespace of the module
([`ts_datasets/ts_datasets/anomaly/__init__.py`](ts_datasets/ts_datasets/anomaly/__init__.py) or
Expand Down
2 changes: 2 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ RUN pip install pyarrow "./"
COPY apps /opt/spark/apps
RUN chmod g+w /opt/spark/apps
USER ${spark_uid}
COPY emissions.csv emissions.csv
COPY emissions.json emissions.json
9 changes: 1 addition & 8 deletions benchmark_forecast.py
Original file line number Diff line number Diff line change
Expand Up @@ -336,15 +336,8 @@ def train_model(
config=ForecastEvaluatorConfig(train_window=train_window, horizon=horizon, retrain_freq=retrain_freq),
)

# Initialize train config
train_kwargs = {}
if type(model).__name__ == "AutoSarima":
train_kwargs = {"train_config": {"enforce_stationarity": True, "enforce_invertibility": True}}

# Get Evaluate Results
train_result, test_pred = evaluator.get_predict(
train_vals=train_vals, test_vals=test_vals, train_kwargs=train_kwargs, retrain_kwargs=train_kwargs
)
train_result, test_pred = evaluator.get_predict(train_vals=train_vals, test_vals=test_vals)

rmses = evaluator.evaluate(ground_truth=test_vals, predict=test_pred, metric=ForecastMetric.RMSE)
smapes = evaluator.evaluate(ground_truth=test_vals, predict=test_pred, metric=ForecastMetric.sMAPE)
Expand Down
Loading

0 comments on commit a0d9f9c

Please sign in to comment.