Skip to content

Commit

Permalink
[copy_from]: Arrow/Parquet Reader (#30958)
Browse files Browse the repository at this point in the history
This PR implements an `ArrowReader` in the `mz_arrow_util` crate. It's
intended to be used with `COPY ... FROM ... (FORMAT PARQUET)`.

We don't currently support all Arrow Types, but so far this has been
sufficient for testing and I figured with how large the change already
is this was a good place to start.

### Motivation

Progress towards
MaterializeInc/database-issues#6575

### Checklist

- [x] This PR has adequate test coverage / QA involvement has been duly
considered. ([trigger-ci for additional test/nightly
runs](https://trigger-ci.dev.materialize.com/))
- [x] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [x] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [x] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [x] If this PR includes major [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note),
I have pinged the relevant PM to schedule a changelog post.

---------

Co-authored-by: Joseph Koshakow <[email protected]>
  • Loading branch information
ParkMyCar and jkosh44 authored Jan 22, 2025
1 parent f3ee8dd commit 21b31a8
Show file tree
Hide file tree
Showing 6 changed files with 999 additions and 4 deletions.
12 changes: 10 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions src/arrow-util/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,15 @@ workspace = true
anyhow = "1.0.66"
arrow = { version = "53.3.0", default-features = false }
chrono = { version = "0.4.35", default-features = false, features = ["std"] }
dec = { version = "0.4.9", features = ["num-traits"] }
half = "2"
mz-repr = { path = "../repr" }
mz-ore = { path = "../ore" }
num-traits = "0.2"
ordered-float = { version = "4.2.0" }
serde = { version = "1.0.152" }
serde_json = "1.0.125"
uuid = "1.2.2"
workspace-hack = { version = "0.0.0", path = "../workspace-hack", optional = true }

[package.metadata.cargo-udeps.ignore]
Expand Down
22 changes: 22 additions & 0 deletions src/arrow-util/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,26 @@
// the Business Source License, use of this software will be governed
// by the Apache License, Version 2.0.

use std::sync::Arc;

use arrow::array::{make_array, ArrayRef};
use arrow::buffer::NullBuffer;

pub mod builder;
pub mod reader;

/// Merge the provided null buffer with the existing array's null buffer, if any.
pub fn mask_nulls(column: &ArrayRef, null_mask: Option<&NullBuffer>) -> ArrayRef {
if null_mask.is_none() {
Arc::clone(column)
} else {
let nulls = NullBuffer::union(null_mask, column.nulls());
let data = column
.to_data()
.into_builder()
.nulls(nulls)
.build()
.expect("changed only null mask");
make_array(data)
}
}
Loading

0 comments on commit 21b31a8

Please sign in to comment.