Create h2o data files in a scalable manner #2

MrPowers · 2022-01-16T12:53:30Z

The h2o data creation R scripts work for small datasets, but aren't scalable. Here are the results on my machine:

Rscript groupby-datagen.R 1e7 1e2 0 0 # 7 seconds
Rscript groupby-datagen.R 1e8 1e1 5 1 # 3.5 minutes
Rscript groupby-datagen.R 1e9 1e1 5 1 # errors out, presumably due to a memory error

@ghuls created a sed script that should be more portable and easier to parallelize than the Rscript. I'm assuming we can parallelize the sed script to generate multiple CSV files in parallel, so this will scale. e.g. have groupby-datagen.R 1e9 1e1 5 1 output 50 1GB files instead of a single 50GB file.

@ghuls - can you create a PR and with your sed code? Any suggestions how to parallelize it? I can probably figure out how to parallelize it with Dask if that's the best option. Thanks!

The text was updated successfully, but these errors were encountered:

ghuls · 2022-01-16T13:17:03Z

Can you give the timings for: 1e8 1e1 0 0 (no NAs and no sorting)?

MrPowers · 2022-01-16T14:44:12Z

@ghuls - Rscript groupby-datagen.R 1e8 1e1 0 0 ran in 2 minutes. Let me know if you need anything else!

MrPowers · 2022-01-19T10:01:53Z

@ghuls - any chance you can send me your sed script to create these data files so I can try it out? I've never used sed before and I'm interested in learning more. Thanks!

ghuls · 2022-01-19T10:47:53Z

It is not a sed script, but an awk script.

I didn't have time to add support for NAs yet. Once it is there I can make a pull request.

groupby-datagen () {
    local N="${1:-1e7}";
    local K="${2:-1e2-0}";
    local NAs="${3:-0}";
    local sort="${4:-0}";

    frawk \
        -B cranelift \
        -v "N=${N}" \
        -v "K=${K}" \
        -v "NAs=${NAs}" \
        -v "sort=${sort}" \
        '
        function rand_int(x) {
            return 1 + int(rand() * x);
        }

        BEGIN {
           # Convert input variables to numbers (needed in case they are in scientific notation).
           N = int(N + 0);
           K = int(K + 0);
           NAs = int(NAs + 0);

           # Set fixed seed for random number generator.
           srand(123);

           # Print header.
           print "id1,id2,id3,id4,id5,id6,v1,v2,v3";

           if (sort != 1) {
               for (i=0; i<N; i++) {
                   printf("id%03d,id%03d,id%010d,%d,%d,%d,%d,%d,%.06f\n", rand_int(K), rand_int(K), rand_int(N/K), rand_int(K), rand_int(K), rand_int(N/K), rand_int(5), rand_int(15), rand() * 100);
               }
           } else {
               for (i=0; i<N; i++) {
                   printf("id%03d,id%03d,id%010d,%d,%d,%d,%d,%d,%.06f\n", rand_int(K), rand_int(K), rand_int(N/K), rand_int(K), rand_int(K), rand_int(N/K), rand_int(5), rand_int(15), rand() * 100) | "LC_COLLATE=C sort --parallel=2 -k 1,1 -k2,2 -k3,3 -k4,4n -k 5,5n -k 6,6n";
               }
           }
        }
        '
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create h2o data files in a scalable manner #2

Create h2o data files in a scalable manner #2

MrPowers commented Jan 16, 2022

ghuls commented Jan 16, 2022

MrPowers commented Jan 16, 2022

MrPowers commented Jan 19, 2022

ghuls commented Jan 19, 2022 •

edited

Loading

Create h2o data files in a scalable manner #2

Create h2o data files in a scalable manner #2

Comments

MrPowers commented Jan 16, 2022

ghuls commented Jan 16, 2022

MrPowers commented Jan 16, 2022

MrPowers commented Jan 19, 2022

ghuls commented Jan 19, 2022 • edited Loading

ghuls commented Jan 19, 2022 •

edited

Loading