thoughts on Arctic dependency #466

js190 · 2021-11-22T15:00:22Z

js190
Nov 22, 2021

As useful as Arctic is I think that in the modern days of Parquet and Arrow it's usefulness is limited, especially in the simple case of data co-located on the same compute node.

Does anyone have thoughts on this?
I think we could roll our own columnar data store without too much hassle, some objectorientated timeseries structure around a simple DataFrame.to_parquet() would be very performant. e.g. similar to this project https://github.com/ranaroussi/pystore
The columnar store would be so simple it could be a sub module to this project.

robcarver17 · 2021-11-22T15:07:45Z

robcarver17
Nov 22, 2021
Maintainer

I've considered this before. I guess it would make sense to keep mongodb as the basic engine, although if we write to an intermediate layer which is anything that can store a dictionary (a bit like the rather ugly timed object class), then to mongo.

0 replies

robcarver17 · 2021-11-24T09:17:16Z

robcarver17
Nov 24, 2021
Maintainer

Doing some @ ing to get more interest on this thread

@bug-or-feature
@tgibson11
@james-ward
@cmorgan
@mjuhanne
@AlistairHaimes

1 reply

AlistairHaimes Nov 24, 2021

I'm not familiar with those products I'm afraid, but I guess the natural caution would be assuming that they'd work "better" than a system that ManAHL built specifically optimised for a similar use-case to pysystemtrade?

Although perhaps time has moved on; beyond me, I'm afraid.

robcarver17 · 2021-11-24T10:02:29Z

robcarver17
Nov 24, 2021
Maintainer

Pandas already has a .to_dict() method for data frames so in theory one could just write this dict to mongo, and there are plenty of toy examples on the web showing this.

There are probably some issues with converting dates and weird data types, but nothing insurmountable and we don't have to address every case since the type of data is well known (all floats for the data currently used for arctic).

My biggest question is around speed, something like the adjusted price series for Gold has nearly 50,000 rows and counting. Arctic seems blindingly fast, but then it's been written to cope with tick data, which isn't required here; and probably also has to deal with other special cases that aren't relevant here.

0 replies

bug-or-feature · 2021-11-24T10:09:05Z

bug-or-feature
Nov 24, 2021
Collaborator

What problem are we trying to resolve here? If it's just that we cannot update to the latest pandas and artic, I think we should wait. There's a couple of live PRs (here and here) that hint that the issue with deprecation of pandas.Panel might go away pretty soon.

More generally, in my experience I think its a really bad idea to write your own code for anything! ALWAYS try to use someone else's. Writing your own should be a last resort

0 replies

tgibson11 · 2021-11-24T13:54:14Z

tgibson11
Nov 24, 2021

Present, and following along. I'd ask the same as Andy - what problem are we trying to solve? My gut feeling is that Arctic and Mongo are both overkill for what we are doing, at least with only daily data. OTOH, they seem to be working well at the moment, other than the Pandas version issue, which is surmountable, and will hopefully be resolved soon. Unless there is some significant benefit to be gained that I'm not seeing, I say if it ain't broke, don't fix it.

…

On Wed, Nov 24, 2021 at 3:09 AM Andy Geach ***@***.***> wrote: What problem are we trying to resolve here? If it's just that we cannot update to the latest pandas and artic, I think we should wait. There's a couple of live PRs (here <man-group/arctic#887> and here <man-group/arctic#908>) that hint that the issue with deprecation of pandas.Panel might go away pretty soon. More generally, in my experience I think its a really bad idea to write your own code for anything! ALWAYS try to use someone else's. Writing your own should be a last resort — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#466 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA3JWKUMFAGCK6XMPPBZLMTUNS2UZANCNFSM5IRG265Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

0 replies

js190 · 2021-11-25T09:28:38Z

js190
Nov 25, 2021
Author

I'm not sure that AHL are using Arctic anymore, although this is pure rumour and speculation. I like the idea of storing raw data from ib_insync in mongo (useful for debugging) and then periodically building a timeseries in parquet from the ib_insync messages or historical data (csv/parquet/structured mongo collection). The performance of reading parquet data is fantastic, I would expect improvements in our IO speed.

0 replies

bug-or-feature · 2022-02-10T10:26:45Z

bug-or-feature
Feb 10, 2022
Collaborator

Interesting new section on Arctic README, about a new generation version

https://github.com/man-group/arctic

There has been a lot of activity in the project, but they haven't got to the bits we want yet...

0 replies

bug-or-feature · 2022-11-05T13:19:34Z

bug-or-feature
Nov 5, 2022
Collaborator

https://www.man.com/arcticdb

https://pypi.org/project/arcticdb/

3 replies

js190 Nov 8, 2022
Author

bah it's not opensource, next!

js190 Nov 8, 2022
Author

I still think we should do a super simple API around parquet files on disk, given that our systems will be smallish data wise and SSDs are very fast

js190 Nov 8, 2022
Author

Using the Arrow opensource library https://arrow.apache.org/docs/python/parquet.html we could then even add an over the wire serialisation if we cared about having the data available over the network.

robcarver17 · 2022-11-08T09:22:07Z

robcarver17
Nov 8, 2022
Maintainer

As has been said before, I don't think there is any massive hurry, but if arctic is never going to be upgraded in it's open source variant to cope with later pandas libraries, at some point the bullet will have to be bitten. At the same time I'd want to take some data that I've stored in my weird funky 'timed storage' mode and put that into time series (the main exception being optimal positions where a flexible record and class need to be stored). There is already an issue raised for this.

I've heard good things about parquet, and am certainly leaning towards that for time series data. I don't know how well it would handle the concurrent nature of pysystemtrade, and of course there would need to be a specified file structure. .csv backups would be easier though; just crawl the parquet file structure and write .csvs out.

The question then arises as to what to do about the non time series data. The irony is that my original choice of mongo was purely to run Arctic on top of it...

Of course there is no hurry, it could stay in mongo forever in reality. But I agree that it does seem to be a rather industrial sized option if we aren't also storing time series data in the same place. I'm not sure I want to go all the way back to using sqllite - I don't think it copes well with multiple read/writes from different threads; and I'm not sure I ever want to write any SQL ever again. I think Redis would be an obvious alternative, but I don't know enough about databases to judge if that's a step in the right direction, or if the Redis/mongo decision is purely an ideological one.

I also note in passing that for any non time series data where the record structure is fixed, it could be represented as a dataframe and thus put into parquet. That probably accounts for the overwhelming proportion of data; the only exception I can think of off hand are the log records and (again) the optimal position tables. However there are ways to get round the latter such as using strategy specific dataframes.

Given I don't think I've ever actually searched the log records (I'm happier looking at diagnostic output), there is probably a better solution here such as just appending log outputs to a single text file which is ocasionally cleaned. Logging is also the main instance where there are concurrent writes to the same collection, which would be problematic I think with parquet.

Although potentially suitable for dataframes, it may also not make sense to put the order stack state and algo information into parquet. Something like redis or mongo would make more sense here.

After this stream of conciousness, I think it might be possible to switch to 100% parquet or something close to it. Certainly that would be simpler.

1 reply

js190 Nov 8, 2022
Author

I think parquet for timeseries and Mongo for everything else is nice. We could even catalogue the parquet files in Mongo.

ChrisAllisonMalta · 2022-11-08T09:44:17Z

ChrisAllisonMalta
Nov 8, 2022

Having gone through the exercise of replacing Mongo and Arctic with SQL Server I can certainly say there is definitely a benefit on the NoSQL side of the argument to not have to write adapter classes for each table and figure what is been stored and idiosyncrasies in the code (optimisedPosition in Mongo not being the Optimised Position used in the code :) )

Now that I've done that and it works I can focus on optimising it but I suspect I'm going to end up with less performance when creating and writing Multiple and Adjusted Prices. Thats a trade off I'm happy with in my case to be able to query data with operators like ">" !!

0 replies

dr1ver1 · 2022-11-08T10:50:44Z

dr1ver1
Nov 8, 2022

Rob,
Given the number of people who seem to be looking at this in some depth, do you think there's an opportunity to divide out the work, and you oversee, rather than doing all the work yourself?
I don't know how this is managed in github world, and if it would introduce additional complexity and issues, but it might be worth considering.
If so, it seems that some investigation work is required first.

1 reply

robcarver17 Nov 8, 2022
Maintainer

I'm always happy to do less work :-)

I think that initially there are 3 investigative tasks to complete before putting together a more precise plan:

an audit of the existing data storage
Some simplification of the existing data storage (removing almost all uses of timed storage, changing logs to ouput to text files; there may be others)
Testing of key alternatives

1 and 2 I should probably do. 1 is a dependency for 2 and for whatever happens next, whilst 2 will make things easier going forward.

3 has no dependency on 1 or 2, and could be done in parallel without initially touching the main codebase. So that is something that other people can take on.

In terms of testing, I think the main things to consider are (please feel free to add to this list)

speed comparision to arctic
file size comparision
ease of writing a time series and non time series object
potential dependencies
ease of managing the database instance
how well it deals with multiple concurrent reads, concurrent read/writes, and how it deals with attempted multiple writes

There may be other things to add to this list.

bug-or-feature · 2023-03-17T12:29:20Z

bug-or-feature
Mar 17, 2023
Collaborator

Nearly a year and a half later, but maybe finally some movement...

man-group/arctic#999

0 replies

tgibson11 · 2023-03-18T17:00:41Z

tgibson11
Mar 18, 2023

Ugh.

I just setup pysystemtrade on Linux Mint 21, and let me tell you, it was a nightmare.

Linux Mint 21 comes with python 3.10, but pysystemtrade requires python 3.8, because of its pandas version, because of Arctic.

That means getting a second python version installed without messing up the system's python, which was not exactly trivial. (I ended up building python 3.8.16 from source...there might be simpler ways, but I don't think any of them are going to be THAT simple, and there is a real possibility of messing up the system's python instance if you're not careful.)

I think this is becoming a more pressing issue. As time goes on, it will become more and more difficult to get pysystemtrade running on a new OS.

3 replies

bug-or-feature Mar 18, 2023
Collaborator

https://github.com/pyenv/pyenv

tgibson11 Mar 18, 2023

I saw pyenv, but the process for getting that installed seemed almost as much work as building from source. Maybe not, but that was my impression.

dr1ver1 Mar 18, 2023

Yes - second that. pyenv makes it easy to have a python version per project. It is worth using on all projects, in my opinion.
BTW, pandas 2 has been released: https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html

robcarver17 · 2023-03-19T10:37:27Z

robcarver17
Mar 19, 2023
Maintainer

I'm going to start working on the simplification part of this now (https://github.com/robcarver17/pysystemtrade/issues/754)) first, to make things easier for the move to parquet, which now looks inevitable.

(With the caveat that I need to test the performance of parquet on large tables with concurrent reads - concurrent writes are less of an issue if we assume that the logging is dropped from the database and just into text files)

0 replies

bug-or-feature · 2023-03-19T14:21:31Z

bug-or-feature
Mar 19, 2023
Collaborator

I think this is becoming a more pressing issue. As time goes on, it will become more and more difficult to get pysystemtrade running on a new OS.

I strongly disagree with this. You just need the right tools and the right process. I've created an installation guide, see #1065. Using the steps there it becomes trivial. I have done it successfully on all sorts of flavours of Linux, and MacOS, including ARM Silicon

As for parquet, I'm neutral. But I'd like to be sure we're doing it for the right reasons. And I'd hope the task was "to add support for parquet" rather than "move to parquet".

3 replies

algo-dude Mar 29, 2023

Hey @bug-or-feature , thanks for the install guide. For ARM silicon, did you modify the requirements to install a more modern pandas and or numpy? Curious as I’ve done this recently and am hopeful that I don’t run into issues in the future.

bug-or-feature Mar 29, 2023
Collaborator

@algo-dude No. With ARM chips you have to effectively build your own versions of some of the dependencies. It's a bit technical. I'll add a section to the install guide. Or ping me directly if you're in a rush

algo-dude Mar 29, 2023

No rush but thank you for the additions. I’ll update my pyenv based on your notes.

bug-or-feature · 2023-03-31T12:43:35Z

bug-or-feature
Mar 31, 2023
Collaborator

I'm hearing lots of good things about DuckDB. Plays nicely with CSV, Parquet, Pandas. Kind of SQLite for big data. Crazy fast

1 reply

ak2k Mar 31, 2023

DuckDB is indeed wonderful. It's impressively productive and powerful for querying Parquet, CSV, sqlite files, etc, but its own on-disk storage isn't yet stable across DuckDB versions so we likely wouldn't want to use it as a candidate storage format until at least v1.0.0.

cauldnz · 2023-04-16T13:09:47Z

cauldnz
Apr 16, 2023

So the folks from MAN have published the ArcticDB source now under the BSL 1.1 license.

Leaving aside whether their current Additional Use Right breaches the 2nd covenant under the BSL license, it does look firmly pay-to-play for any production use. Becomes Apache 2.0 in 24 months time...

As @bug-or-feature notes, DuckDB looks pretty interesting for this use-case. Another option that might be interesting to consider would be Ibis. This project is looking to provide a standardized dataframe like API over pluggable back end databases (including in-memory stores like DuckDB & Polars, big analytic stores like Druid and Clickhouse, simple SQL DBs such as SQLLite and MySQL, and some other interesting stores such as HeavyDB which is a GPU-accelerated DB). So potentially provides a unified API for system metadata, instrument history data, and maybe even logging.

0 replies

bug-or-feature · 2023-08-17T20:38:44Z

bug-or-feature
Aug 17, 2023
Collaborator

Arctic 1.82.0 just released, with support for pandas<2, numpy<2

https://github.com/man-group/arctic/blob/master/CHANGES.md

0 replies

robcarver17 · 2023-11-13T14:50:34Z

robcarver17
Nov 13, 2023
Maintainer

I met James Monroe, ex AHL CTO, who came to hear me speak in London recently. He's now in charge of releasing Arctic as a commercial product... He offered me a free production license but I guess that wouldn't extend to you guys.

1 reply

bug-or-feature Nov 14, 2023
Collaborator

I have a call with James and another ArcticDB colleague next week. If anyone else wants to join, let me know

robcarver17 · 2023-11-16T20:00:24Z

robcarver17
Nov 16, 2023
Maintainer

"I think this is becoming a more pressing issue. As time goes on, it will become more and more difficult to get pysystemtrade running on a new OS."

Well of course my laptop died, and a new battery did not revive it, apparently a known problem with the USB-C charging port which decays over time. So I'm currently in this position, and it looks like I will be trying out the pyenv solution to see if it works.... if not then expect a very quick decision on this and some fairly frantic coding.

5 replies

bug-or-feature Nov 17, 2023
Collaborator

Yikes! Good luck.

Really unlucky timing - there's an Arctic PR to add support for Python 3.11 that was postponed yesterday 🤦

robcarver17 Nov 17, 2023
Maintainer

Yeah, it's not going well so far! My gut feeling is it would be less work and better in the long run to just move everything up to current versions (pandas 2.1.3, python 3.10.12), and if Arctic doesn't work then so be it.

"to add support for parquet" rather than "move to parquet" ...

I think the issue here is more going to be 'how to support arctic'. I have no intention of deleting the arctic specific code (which is pretty thin glue anyway), but if running arctic requires a different version of pandas, well that's certainly fiddly for those wishing to stay with arctic, and that's assuming I don't have to make other non backwardly compatible changes to the code.

oldlore Nov 17, 2023

If it's just the power that's gone, can't you just remove the laptop drive and copy/clone the system on to another one?

robcarver17 Nov 17, 2023
Maintainer

I suppose in theory, but I've now got everything set up and running nicely under mint 21.2, with the exception of pst (and Zoom, interestingly), and it seems pretty bonkers to deliberately run an old OS version along with everything else.

robcarver17 Nov 17, 2023
Maintainer

I think the quickest and simplest route is just to write single parquet files from pandas dataframes; duckDB doesn't seem to offer anything extra in terms of eg multiple processes / one file.

So the road map is:

in the develop branch create classes to write parquet instead of arctic (i.e. for now don't replace the stuff in mongoDB) DONE
test those work, in particular by including 'write to parquet' as part of the standard backup procedure DONE
create a new branch with no arctic and all the version numbers brought up to date on everything else DONE
test that new branch on my laptop (mint 21) making any changes required to get everything working DONE
see if I can deploy it on my my backup server built up with mint 21 and the new branch DONE
test the backup server for a few days to make sure there are no hidden errors DONE
that new branch will be merged into develop, at which point there will be no official support for arctic (although the code will still exist) DONE
rebuild my other server with mint 21 and the new branch DONE
at some point in the future, possibly replace the remaining mongoDB with parquet (so that there is a single solution for which you don't need to run a database)
at some point in the future, reconsider duckDB or Ibis to have a tidier and more robust parquet store

jamesmunro · 2023-11-17T18:59:12Z

jamesmunro
Nov 17, 2023

Hi All 👋 ,

This is James @ ArcticDB. We're happy to work with you all on making ArcticDB a possible backend if that's of interest, including things like MongoDB support. As @bug-or-feature says, we're chatting next week. We super appreciate that you've all been users of the original Arctic.

0 replies

robcarver17 · 2023-11-20T20:15:27Z

robcarver17
Nov 20, 2023
Maintainer

As per the roadmap, once I have a version working with parquet and up to date libraries, I'm obviously very happy if someone can get some kind of arctic working again with the python / pandas / etc versions that everything has been brought up to date on. There is no reason why there can't be support for both solutions going forward.

At least a lot of what I have done in the last few days has made it easier to plug and play different database / storage solutions than before.

5 replies

bug-or-feature Nov 20, 2023
Collaborator

Looking good Rob. Happy to jump in and help anytime if needed

robcarver17 Nov 21, 2023
Maintainer

Yes just had a couple of minor backward compatibility issues with pandas but I can now backtest my dynamic system. I've stood down my trading server (on develop branch) and I'm currently rebuilding the alternative server with new linux etc; which I can then test to see if it runs production code. It's a lot less stressful when there is a fallback server just in case it doesn't work!

robcarver17 Nov 21, 2023
Maintainer

.... coming up with quite a few more pandas backward compatibility issues in production but hopefully get these all shaken out in a day or so...

bug-or-feature Nov 22, 2023
Collaborator

I came across the weird GB date thing (commit 70ec9a5) a while back, discussed in #1244

robcarver17 Nov 27, 2023
Maintainer

Yeah, I dealt with that using the rather hacky method of just taking the left() of the date string.

thoughts on Arctic dependency #466

Replies: 22 comments · 24 replies

robcarver17 Nov 22, 2021 Maintainer

robcarver17 Nov 24, 2021 Maintainer

robcarver17 Nov 24, 2021 Maintainer

bug-or-feature Nov 24, 2021 Collaborator

js190 Nov 25, 2021 Author

bug-or-feature Feb 10, 2022 Collaborator

bug-or-feature Nov 5, 2022 Collaborator

js190 Nov 8, 2022 Author

js190 Nov 8, 2022 Author

js190 Nov 8, 2022 Author

robcarver17 Nov 8, 2022 Maintainer

js190 Nov 8, 2022 Author

robcarver17 Nov 8, 2022 Maintainer

bug-or-feature Mar 17, 2023 Collaborator

bug-or-feature Mar 18, 2023 Collaborator

robcarver17 Mar 19, 2023 Maintainer

bug-or-feature Mar 19, 2023 Collaborator

bug-or-feature Mar 29, 2023 Collaborator

bug-or-feature Mar 31, 2023 Collaborator

bug-or-feature Aug 17, 2023 Collaborator

robcarver17 Nov 13, 2023 Maintainer

bug-or-feature Nov 14, 2023 Collaborator

robcarver17 Nov 16, 2023 Maintainer

bug-or-feature Nov 17, 2023 Collaborator

robcarver17 Nov 17, 2023 Maintainer

robcarver17 Nov 17, 2023 Maintainer

robcarver17 Nov 17, 2023 Maintainer

robcarver17 Nov 20, 2023 Maintainer

bug-or-feature Nov 20, 2023 Collaborator

robcarver17 Nov 21, 2023 Maintainer

robcarver17 Nov 21, 2023 Maintainer

bug-or-feature Nov 22, 2023 Collaborator

robcarver17 Nov 27, 2023 Maintainer

Replies: 22 comments 24 replies

robcarver17
Nov 22, 2021
Maintainer

robcarver17
Nov 24, 2021
Maintainer

robcarver17
Nov 24, 2021
Maintainer

bug-or-feature
Nov 24, 2021
Collaborator

js190
Nov 25, 2021
Author

bug-or-feature
Feb 10, 2022
Collaborator

bug-or-feature
Nov 5, 2022
Collaborator

js190 Nov 8, 2022
Author

js190 Nov 8, 2022
Author

js190 Nov 8, 2022
Author

robcarver17
Nov 8, 2022
Maintainer

js190 Nov 8, 2022
Author

robcarver17 Nov 8, 2022
Maintainer

bug-or-feature
Mar 17, 2023
Collaborator

bug-or-feature Mar 18, 2023
Collaborator

robcarver17
Mar 19, 2023
Maintainer

bug-or-feature
Mar 19, 2023
Collaborator

bug-or-feature Mar 29, 2023
Collaborator

bug-or-feature
Mar 31, 2023
Collaborator

bug-or-feature
Aug 17, 2023
Collaborator

robcarver17
Nov 13, 2023
Maintainer

bug-or-feature Nov 14, 2023
Collaborator

robcarver17
Nov 16, 2023
Maintainer

bug-or-feature Nov 17, 2023
Collaborator

robcarver17 Nov 17, 2023
Maintainer

robcarver17 Nov 17, 2023
Maintainer

robcarver17 Nov 17, 2023
Maintainer

robcarver17
Nov 20, 2023
Maintainer

bug-or-feature Nov 20, 2023
Collaborator

robcarver17 Nov 21, 2023
Maintainer

robcarver17 Nov 21, 2023
Maintainer

bug-or-feature Nov 22, 2023
Collaborator

robcarver17 Nov 27, 2023
Maintainer