Caching of daily prices in backtest #1469

PurpleHazeIan · 2024-11-29T13:11:35Z

PurpleHazeIan
Nov 29, 2024

Should method get_daily_prices() of class RawData be cached? In the comments it is described as a 'KEY OUTPUT' but the cache decorator used is input, not output().

I ran some tests, naively thinking that reading daily prices would be the simplest of activities, and very fast. But get_daily_prices() is not fast (at least, not on Windows) because in the sim_data class that it calls there is a pandas resample to Business Days (.resample('1B')) and this is very slow (at least, it is on Windows). I don't use any other OS to compare.

In my experience (40 rules, 45 instruments, no parameters estimated i.e. all read from yaml) this simplest of changes reduced the duration of the run_systems step by around 25%, by cutting the time to calculate all 1800 scaled and capped forecasts by around 75%.

Perhaps we should not be motivated by squeezing down runtimes, but caching daily prices - which are the basis of many calculations downstream - would seem right even if it made no great difference. I saw a comment on Ideas for improvements #1420 about not having to rerun entire backtests. Simple changes that improve the speed of backtests could play into that space.

tgibson11 · 2024-11-29T15:32:13Z

tgibson11
Nov 29, 2024

There is likely a trade off here between performance and memory usage.

Whether that trade off is worthwhile will probably be a matter of opinion, but I think we should at least understand the effect before making this kind of change.

Would you mind re-running your test and noting the effect on memory use?

1 reply

PurpleHazeIan Nov 29, 2024
Author

This is rough and ready and the result of one run for each scenario. Without knowing how to measure and log memory usage, I simply observed values displayed in Windows Task Manager for the python process.

Note that I have modified things a little so that all capped_forecasts are evaluated up front in my version of runSystemCarryTrendDynamic() by an extra method in ForecastCombine() - this is a hangover of something I saw in a fork which proposed using multiprocessing to speed up forecast calculation, which I don't actually do. So for our purposes here run_systems can be divided into three phases (1) calculating all capped_forecasts (2) calculations choosing which forecasts to use for each instrument (3) applying instrument weights, calculating notional positions and pickling the system. In phase 1 memory use climbs to a plateau; in 2 it stays constant or rises; in 3 it rises rapidly to a peak before the process ends (so it is certain that I am not seeing the full instantaneous maximum).

So the results (on a laptop with 16GB and not memory constrained) ...
Decorator @output: (1) 1.210 GB (2) steady at 1.2 GB (3) 4.3 GB (4) Pickle size on disk 2730 MB
Decorator input: (1) 1.254 GB (2) rising to 2.0 GB (3) 4.5 GB (4) Pickle size on disk 2768 MB

I would conclude from the similarity of these numbers, despite the sample size of one, that the extra caching is not adding significantly to memory requirements (the numbers are actually slightly lower?) but is delivering a significant runtime reduction.

The github adjusted prices for 252 instruments occupy 243MB in CSV, so perhaps say 105MB in parquet (my adjusted prices for 45 futures is 14MB CSV and 6MB parquet) - how much memory could a cached copy realistically occupy?

I'm happy if this is not taken up in the repo - but was so surprised by the impact that I thought it worth sharing!

PurpleHazeIan · 2024-11-29T18:26:42Z

PurpleHazeIan
Nov 29, 2024
Author

Another argument for assuming the memory impact is small is to examine just how much is cached in the system object's cache (which might be an important driver of memory usage). For each instrument almost 500 items are cached (system with 40 rules). This includes get_raw_forecast twice for each rule (one from rules and one from forecastScaleCap), get_scaled_forecast for each rule, get_capped_forecast for each rule. At least 160 series each of the same length as daily_prices. One more is nothing.

1 reply

tgibson11 Nov 29, 2024

I prefer not to assume, but based on the results you described above, the impact does seem to be relatively minimal.

vishalg · 2024-11-29T18:47:05Z

vishalg
Nov 29, 2024

Would you be able to provide a patch for this? I run my system on a cloud server (with limited memory) and I'll be happy to test.

3 replies

tgibson11 Nov 29, 2024

Sure. Is the decorator the only thing that needs to be changed?

vishalg Nov 29, 2024

I am not sure, one for @PurpleHazeIan perhaps

tgibson11 Nov 29, 2024

Sorry, didn't realize you weren't the OP.

bug-or-feature · 2024-11-29T19:10:17Z

bug-or-feature
Nov 29, 2024
Collaborator

From the docs on caching

Similarly most stages contain 'input' methods, which do no calculations but get the 'output' from an earlier stage and then 'serve' it to the rest of the stage. These exist to simplify changing the internal wiring of a stage and reduce the coupling between methods from different stages. These should also never cache; or again we'll be caching the same data multiple times ( see stage wiring).

5 replies

tgibson11 Nov 29, 2024

This isn't entirely clear to me, but we're dealing with the RawData stage here. Does it even have inputs from a preceding stage? It doesn't look like it to me, but I haven't looked much at the stage wiring stuff before.

It gets data from futuresSimData, which is not a stage, and doesn't do any caching.

Logically, it doesn't make sense to me that these very early-stage methods would be getting data directly from Parquet when all the later-stage data that they are used to calculate has been cached. Could lead to inconsistent results if there is newer data in Parquet that is not reflected in cached data.

In any case, if we're changing get_daily_prices, we should also be changing get_hourly_prices. Possibly get_prices_at_natural_frequency too, although that one seems less necessary.

PurpleHazeIan Nov 29, 2024
Author

@vishalg Yes, the only change I made was to replace @input with @output() - see original code snippet from rawdata.py
@bug-or-feature I note that the docs do say that. But in this case, caching does not result in the data being cached multiple times, because it is not cached at all. get_daily_prices() is called 6 times per instrument by runSystemCarryTrendDynamic() and it is very expensive (unexpectedly so, although I am not in a position to rule out if this is a consequence of the implementation of pandas business date resampling on Windows). Notwithstanding all that, there is a discrepancy between the decorator and the comment - see code snippet. (And, separately, get_raw_forecast() does seem to be cached twice, as @output() of Rules and @diagnostic() of forecastScaleCap - I think these are both accidental oversights in complex code.)

    @input
    def get_daily_prices(self, instrument_code) -> pd.Series:
        """
        Gets daily prices

        :param instrument_code: Instrument to get prices for
        :type trading_rules: str

        :returns: Tx1 pd.DataFrame

        KEY OUTPUT
        """
        self.log.debug(
            "Calculating daily prices for %s" % instrument_code,
            instrument_code=instrument_code,
        )

PurpleHazeIan Nov 29, 2024
Author

Correction - get_raw_forecast() is not cached in the base forecastScaleCap, but it is in volAttenForecastScaleCap, which I use.

vishalg Nov 30, 2024

In that case, it would make sense for this to be cached again forecastScaleCap simply fetches result from rules_stage and passes it on. On the other hand volAttenForecastScaleCap does further processing before returning the data.

PurpleHazeIan Nov 30, 2024
Author

Yes, that seems right. Once volAttenForecastSaleCap has cached its version, the version from rules_stage should never be called directly again, despite rules_stage 'owning' it as an output.

PurpleHazeIan · 2024-12-02T19:42:28Z

PurpleHazeIan
Dec 2, 2024
Author

Before closing discussion …

Adding (or removing) caching doesn't of itself interfere with stage wiring, as far as I understand it. Caching can even be turned off (backtests become glacially slow).

Stage wiring seems to me to be the convention that when any stage needs (as an example) capped_forecasts, it calls any_stage.get_capped_forecasts() for its input which calls forecastScaleCap.get_capped_forecasts() for its output. forecastScaleCap sensibly does the caching (output() decorator) and any_stage doesn't (input decorator does nothing). The data could be cached on one side of this interface or the other or both or neither; the results will be the same, except for passing through the decorator function, and the implications for memory usage and calculation time.

As @tgibson11 has said, the thing with rawdata.get_daily_prices() being marked as input is that there is no corresponding output() as the stage before doesn't do caching. Yet all the forecasts and positions that derive from daily_prices are cached. This just struck me as an oversight (and an expensive one when I saw the impact on calculation time). There may be other methods which would benefit too.

Separately, but linked, if memory usage is a real constraint for some, then it might be worth reviewing what gets cached or not (not everything decorated diagnostic() is needed in production). I wrote some logging for the cache. On my own production backtest - 40 rules, 45 instruments, using myFuturesRawData and volAttendForecastScaleCap stages for extra functionality - I observed …

30037 items were calculated and placed in the cache.
1 was retrieved 64970 times - base_system.get_instrument_list().
6524 were retrieved 31224 times, so 4-5 times each on average.
23512 were never retrieved. Several of these were forecast series in sets of 40x45=1800. It might be advantageous not to cache 45x40 scaled_forecasts, which are not needed after capped_forecasts are available and not retrieved from cache, to make way for 45 daily_prices, as an example.

0 replies

PurpleHazeIan · 2024-12-05T21:46:54Z

PurpleHazeIan
Dec 5, 2024
Author

Reopened as I see @robcarver17 has looked in recently - Rob, can you think of any reason why rawdata.get_daily_prices() should not be cached?

1 reply

robcarver17 Dec 6, 2024
Maintainer

No, it's completely logical that we cache an expensive read operation when we first run it; similarly that we cache an expensive calculation that we're going to reuse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Caching of daily prices in backtest #1469

{{title}}

Replies: 6 comments 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Caching of daily prices in backtest #1469

Replies: 6 comments · 11 replies

PurpleHazeIan Nov 29, 2024 Author

PurpleHazeIan Nov 29, 2024 Author

bug-or-feature Nov 29, 2024 Collaborator

PurpleHazeIan Nov 29, 2024 Author

PurpleHazeIan Nov 29, 2024 Author

PurpleHazeIan Nov 30, 2024 Author

PurpleHazeIan Dec 2, 2024 Author

PurpleHazeIan Dec 5, 2024 Author

robcarver17 Dec 6, 2024 Maintainer

Replies: 6 comments 11 replies

PurpleHazeIan Nov 29, 2024
Author

PurpleHazeIan
Nov 29, 2024
Author

bug-or-feature
Nov 29, 2024
Collaborator

PurpleHazeIan Nov 29, 2024
Author

PurpleHazeIan Nov 29, 2024
Author

PurpleHazeIan Nov 30, 2024
Author

PurpleHazeIan
Dec 2, 2024
Author

PurpleHazeIan
Dec 5, 2024
Author

robcarver17 Dec 6, 2024
Maintainer