-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create h2o data files in a scalable manner #2
Comments
Can you give the timings for: |
@ghuls - |
@ghuls - any chance you can send me your sed script to create these data files so I can try it out? I've never used sed before and I'm interested in learning more. Thanks! |
It is not a sed script, but an awk script. I didn't have time to add support for NAs yet. Once it is there I can make a pull request. groupby-datagen () {
local N="${1:-1e7}";
local K="${2:-1e2-0}";
local NAs="${3:-0}";
local sort="${4:-0}";
frawk \
-B cranelift \
-v "N=${N}" \
-v "K=${K}" \
-v "NAs=${NAs}" \
-v "sort=${sort}" \
'
function rand_int(x) {
return 1 + int(rand() * x);
}
BEGIN {
# Convert input variables to numbers (needed in case they are in scientific notation).
N = int(N + 0);
K = int(K + 0);
NAs = int(NAs + 0);
# Set fixed seed for random number generator.
srand(123);
# Print header.
print "id1,id2,id3,id4,id5,id6,v1,v2,v3";
if (sort != 1) {
for (i=0; i<N; i++) {
printf("id%03d,id%03d,id%010d,%d,%d,%d,%d,%d,%.06f\n", rand_int(K), rand_int(K), rand_int(N/K), rand_int(K), rand_int(K), rand_int(N/K), rand_int(5), rand_int(15), rand() * 100);
}
} else {
for (i=0; i<N; i++) {
printf("id%03d,id%03d,id%010d,%d,%d,%d,%d,%d,%.06f\n", rand_int(K), rand_int(K), rand_int(N/K), rand_int(K), rand_int(K), rand_int(N/K), rand_int(5), rand_int(15), rand() * 100) | "LC_COLLATE=C sort --parallel=2 -k 1,1 -k2,2 -k3,3 -k4,4n -k 5,5n -k 6,6n";
}
}
}
'
} |
The h2o data creation R scripts work for small datasets, but aren't scalable. Here are the results on my machine:
@ghuls created a sed script that should be more portable and easier to parallelize than the Rscript. I'm assuming we can parallelize the sed script to generate multiple CSV files in parallel, so this will scale. e.g. have
groupby-datagen.R 1e9 1e1 5 1
output 50 1GB files instead of a single 50GB file.@ghuls - can you create a PR and with your sed code? Any suggestions how to parallelize it? I can probably figure out how to parallelize it with Dask if that's the best option. Thanks!
The text was updated successfully, but these errors were encountered: