You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, we're using at the time @dsnp/parquetjs to write parquet files in node. But is a fork of an old package. And doesn't look super maintained.
So I came across this repo that looks super active but is not clear to me if we can do what we're doing now with parquet-wasm. So maybe you can help me understand.
What do we want to do?
We want to iterate a huge PostgreSQL table with a cursor so we have batches of rows that we want to iterate and store in a parquet file.
So I was wondering if that's possible with parquet-wasm. Handle streaming of data and at the end save the file in disk
This is how we do with @dsnp/parquetjs
constBATCH_SIZE=4096constSQL_QUERY='SELECT * FROM users'
async functionwriteParquet(): Promise<string>{returnnew Promise<string>((resolve)=>{
let url: string// This doesn't matter. // Source batchquery do a cursor pg iteration // and we receive N rows for each batch in `onBatch` methodOUR_POSTGREST_DB.batchQuery(SQL_QUERY,{batchSize: BATCH_SIZE,onBatch: async(batch)=>{if(!writer){constschema=this.buildParquetSchema(batch.fields)writer=awaitParquetWriter.openFile(schema,'/path/to/file.parquet',{rowGroupSize: size>ROW_GROUP_SIZE ? size : ROW_GROUP_SIZE,})}for(constrowofbatch.rows){// This does not write in parquet I think but accumulate as many rows// as you define in `rowGroupSize`awaitwriter.appendRow(row)}if(batch.lastBatch){awaitwriter.close()resolve(url)}},})})}
Thanks for the help!
The text was updated successfully, but these errors were encountered:
I haven't looked at that PR in a while. It looks like it needs a little work to be updated with the latest main branch. But aside from that it might work with few changes. You can ask @H-Plus-Time if he's interested in working on that PR more.
What?
Hi, we're using at the time @dsnp/parquetjs to write parquet files in node. But is a fork of an old package. And doesn't look super maintained.
So I came across this repo that looks super active but is not clear to me if we can do what we're doing now with parquet-wasm. So maybe you can help me understand.
What do we want to do?
We want to iterate a huge PostgreSQL table with a cursor so we have batches of rows that we want to iterate and store in a parquet file.
So I was wondering if that's possible with parquet-wasm. Handle streaming of data and at the end save the file in disk
This is how we do with
@dsnp/parquetjs
Thanks for the help!
The text was updated successfully, but these errors were encountered: