vegan::simper() with groups > 2 #484

hlydecker · 2021-02-07T22:26:52Z

hlydecker
Feb 7, 2021

Currently vegan::simper() allows the user to run the function with 2 or more groups.

While SIMPER can work would multiple groups, there are a couple potential issues involved with groups > 2.

First, the computational needs for the calculations increase dramatically. I attempted several comparisons between three groups (3 groups x 700 samples/group x 450 species); in all cases this led to 100% CPU utilisation. I attempted this on a very powerful Macbook Pro (2.3 GHz 8-core i9, 64 gb RAM) using the parallel options; I had to kill the process after ~20 minutes of max CPU load. Interestingly RAM consumption was minimal.

Secondly, and perhaps more fundamentally, SIMPER is a comparison between two groups using Bray-Curtis dissimilarity. This dissimilarity metric is for making comparisons between two groups. While I suppose we can do multiple Bray-Curtis dissimilarity distance measures between each pair of groups, this seems to be beyond what I would expect from a standard SIMPER.

If it is still desirable to allow people to compare between > 2 groups, I propose adding in some communication to the user if the number of groups being compared is > 2. Something to warn the user that this comparison may be computationally expensive, and that it may be difficult to interpret the results (and a normal SIMPER is already hard to interpret!).

Otherwise, it might be a good idea to only allow 2 groups for the standard simper() and maybe add in a simper_n() for playing with groups >2.

P.S. I am not a mathematician, so if someone with more familiarity with actual mechanics underlying SIMPER + Bray-Curtis can chime in providing some more explanation that would be great. I'm just used to these metrics only being calculated between 2 groups.

eduardszoecs · 2021-02-07T23:47:14Z

eduardszoecs
Feb 7, 2021

Thanks Henry for your suggestion.

While I suppose we can do multiple Bray-Curtis dissimilarity distance measures between each pair of groups, this seems to be beyond what I would expect from a standard SIMPER.

In fact, this is what simper does, see the examples provided with the function (4 groups, yielding 6 pairwise comparisons). We could a short sentence to the groups parameter to describe that in the case of >2 levels each pairwise comparison is computed.

If you want to compare only two groups, you can just supply a subset of two groups. This is just convenience functionality for the users which usually do pairwise comparisons of multiple groups.

0 replies

jarioksa · 2021-02-08T11:43:00Z

jarioksa
Feb 8, 2021
Maintainer

I confirm that simper can be slow. Profiling tells that the internal function pfun takes about 99% of running time, but I could not spot any obvious ways of speeding up the function, but it seemed to be efficiently designed. There just is so much calculations to do when all observations in a group are compared against all observations in another group. In your case this means 700 * 700 pairs of sampling units for each of three pairwise comparisons of your groups, and if you run 999 permutations there will be 1468530000 comparisons across sampling units. For timing purposes I ran a small example without parallel processing and in 1.6GHz laptop I could do 33600 comparison per second, which would take over 12 hours to complete your example. With faster computer and parallel processing you can cut down this timing, but it would still be slow. In your case you have three pairwise comparisons for three groups, but this was not the crucial point: the group sizes were. If you want to get the simper results, you need to give it that time. However, I suggest you read the last paragraph of the Details section in the simper help page and after that re-consider how much time you want to give to your computer to run the analysis (but the computer can also work when you sleep).

0 replies

hlydecker · 2021-02-08T11:57:41Z

hlydecker
Feb 8, 2021
Author

@jarioksa thanks for your comments; yeah I did the rough numbers and it is really a lot of calculations. I ended up just running the task on my high performance computing cluster.

Good to know it is all “working as intended”. At first I had thought perhaps the function was doing 700 x 700 x 700 with the three groups, or that maybe that it just got stuck somewhere because it was trying to make an impossible number of comparisons. Pairwise comparisons for each pair within the group makes a lot more sense.

0 replies

hlydecker · 2021-02-08T21:47:21Z

hlydecker
Feb 8, 2021
Author

In case you're interested; I allocated a node with 32 cores and 123gb of memory to the task. It finished in a little bit over 2 hours (47 hours CPU time).

0 replies

jarioksa · 2021-02-10T09:00:32Z

jarioksa
Feb 10, 2021
Maintainer

@hlydecker , which version of vegan did you use? I noticed that the gitHub version of simper (vegan_2.6-0) was completely redesigned, and much faster than the release (CRAN) version in vegan_2.5-x. In a couple of tests the gitHub version was about 10 times faster than CRAN. This is still slow, but gives some relief.

We haven't implemented parallel processing in this new function, though. That's the reason why it has not been ported to the release versions.

0 replies

hlydecker · 2021-02-10T22:13:59Z

hlydecker
Feb 10, 2021
Author

@jarioksa I'll check out the gitHub version next time I need to run this locally! As you guessed, I was using the CRAN release.

10x faster is a pretty massive performance increase; good job!

Personally I'll stick with using the CRAN build so that I can use parallel processing. I think any time I need to do a SIMPER with large sample sizes I'll just use my HPC cluster, and parallel processing gives huge gains.

Interestingly there was some strange resource utilisation: I found on a local machine that the function was CPU bottle necked and never ended up actually using much RAM. However, on my HPC cluster I found that CPU utilisation was poor and that RAM consumption was excessive.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vegan::simper() with groups > 2 #484

{{title}}

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

vegan::simper() with groups > 2 #484

hlydecker Feb 7, 2021

Replies: 6 comments

eduardszoecs Feb 7, 2021

jarioksa Feb 8, 2021 Maintainer

hlydecker Feb 8, 2021 Author

hlydecker Feb 8, 2021 Author

jarioksa Feb 10, 2021 Maintainer

hlydecker Feb 10, 2021 Author

hlydecker
Feb 7, 2021

eduardszoecs
Feb 7, 2021

jarioksa
Feb 8, 2021
Maintainer

hlydecker
Feb 8, 2021
Author

hlydecker
Feb 8, 2021
Author

jarioksa
Feb 10, 2021
Maintainer

hlydecker
Feb 10, 2021
Author