Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create h2o data files in a scalable manner #2

Open
MrPowers opened this issue Jan 16, 2022 · 4 comments
Open

Create h2o data files in a scalable manner #2

MrPowers opened this issue Jan 16, 2022 · 4 comments

Comments

@MrPowers
Copy link
Owner

The h2o data creation R scripts work for small datasets, but aren't scalable. Here are the results on my machine:

Rscript groupby-datagen.R 1e7 1e2 0 0 # 7 seconds
Rscript groupby-datagen.R 1e8 1e1 5 1 # 3.5 minutes
Rscript groupby-datagen.R 1e9 1e1 5 1 # errors out, presumably due to a memory error

@ghuls created a sed script that should be more portable and easier to parallelize than the Rscript. I'm assuming we can parallelize the sed script to generate multiple CSV files in parallel, so this will scale. e.g. have groupby-datagen.R 1e9 1e1 5 1 output 50 1GB files instead of a single 50GB file.

@ghuls - can you create a PR and with your sed code? Any suggestions how to parallelize it? I can probably figure out how to parallelize it with Dask if that's the best option. Thanks!

@ghuls
Copy link
Contributor

ghuls commented Jan 16, 2022

Can you give the timings for: 1e8 1e1 0 0 (no NAs and no sorting)?

@MrPowers
Copy link
Owner Author

@ghuls - Rscript groupby-datagen.R 1e8 1e1 0 0 ran in 2 minutes. Let me know if you need anything else!

@MrPowers
Copy link
Owner Author

@ghuls - any chance you can send me your sed script to create these data files so I can try it out? I've never used sed before and I'm interested in learning more. Thanks!

@ghuls
Copy link
Contributor

ghuls commented Jan 19, 2022

It is not a sed script, but an awk script.

I didn't have time to add support for NAs yet. Once it is there I can make a pull request.

groupby-datagen () {
    local N="${1:-1e7}";
    local K="${2:-1e2-0}";
    local NAs="${3:-0}";
    local sort="${4:-0}";

    frawk \
        -B cranelift \
        -v "N=${N}" \
        -v "K=${K}" \
        -v "NAs=${NAs}" \
        -v "sort=${sort}" \
        '
        function rand_int(x) {
            return 1 + int(rand() * x);
        }

        BEGIN {
           # Convert input variables to numbers (needed in case they are in scientific notation).
           N = int(N + 0);
           K = int(K + 0);
           NAs = int(NAs + 0);

           # Set fixed seed for random number generator.
           srand(123);

           # Print header.
           print "id1,id2,id3,id4,id5,id6,v1,v2,v3";

           if (sort != 1) {
               for (i=0; i<N; i++) {
                   printf("id%03d,id%03d,id%010d,%d,%d,%d,%d,%d,%.06f\n", rand_int(K), rand_int(K), rand_int(N/K), rand_int(K), rand_int(K), rand_int(N/K), rand_int(5), rand_int(15), rand() * 100);
               }
           } else {
               for (i=0; i<N; i++) {
                   printf("id%03d,id%03d,id%010d,%d,%d,%d,%d,%d,%.06f\n", rand_int(K), rand_int(K), rand_int(N/K), rand_int(K), rand_int(K), rand_int(N/K), rand_int(5), rand_int(15), rand() * 100) | "LC_COLLATE=C sort --parallel=2 -k 1,1 -k2,2 -k3,3 -k4,4n -k 5,5n -k 6,6n";
               }
           }
        }
        '
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants