Implement Bulk Import, Regenerate core for `2024-10` API #79

austin-denoble · 2024-10-10T17:23:50Z

Problem

We are releasing a new version of the API this month: 2024-10.

There are 3 primary new features that are included in this release:

Import
Inference
- Embed
- Rerank

This PR implements the operations to support import. Sorry about the size, but you can basically ignore all of the generated code under internal/gen unless you're curious about the new structure of the generated core files. Follow the codgen/build-clients.sh script for those details.

Solution

Since the import operations are technically part of the data plane but only supported via REST, they are represented in the OpenAPI spec and not our protos file. Because of this, we need to change a few things to support these operations Client and IndexConnection structs to support these operations because traditionally the code IndexConnection wraps was targeting gRPC-only db data operations. We now need to generate rest code for the data plane as well so we can interact with imports.

Update the codegen/build-clients.sh script to handle building new modules for both internal/gen/db_data/grpc and internal/gen/db_data/rest.
Update Client struct and move NewClientBaseParams into a field that can be shared more easily when constructing the IndexConnection.
- Add buildDataClientBaseOptions to handle constructing the necessary rest client options for the underlying dbDataClient.
- Add an ensureHostHasHttps helper as we need to make sure this is present for the index Host that's passed, which was not necessary for grpc.
- Update Index method to handle calling buildDataClientBaseOptions and passes the new client into newIndexConnection.
Update IndexConnection to support both REST and gRPC interfaces under the hood (restClient, grpcClient).
- Update newIndexConnection to support attaching the new restClient to the IndexConnection struct.
Update IndexConnection to support all import operations: StartImport, ListImports, DescribeImport, CancelImport.
Add end-to-end integration test for validating the import flow against serverless indexes.
Some nitpicky code cleanup, renaming of things around the new rest vs. grpc paradigm, etc.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Infrastructure change (CI configs, etc)
Non-code change (docs, etc)
None of the above: (explain here)

Test Plan

just test - make sure CI passes

To see examples of how to use the new methods, check the doc comments.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1208325183834377
- https://app.asana.com/0/0/1208541827330963

…ts.sh script to handle generating db_data from both oas and proto, update imports in client code

…tation for the db_data client as a member on the IndexConnection struct, for users this happens under the hood, regenerate and update submodule

…l, describe, and list, add integration test covering calling each step in the flow

… conversions between generated types and exposed stuff

…s and types

jhamon

Great work!

aulorbe · 2024-10-15T17:07:56Z

pinecone/client.go

 //   - restClient: Optional underlying *http.Client object used to communicate with the Pinecone API,
 //     provided through NewClientParams.RestClient or NewClientBaseParams.RestClient. If not provided,
 //     a default client is created for you.
-//   - sourceTag: An optional string used to help Pinecone attribute API activity, provided through NewClientParams.SourceTag
-//     or NewClientBaseParams.SourceTag.
+//   - baseParams: A NewClientBaseParams object that holds the configuration for the Pinecone client.


I think you might need to update the example code in this docstring to be clientParams := pinecone.NewClientBaseParams{} obj, no?

You don't need to provide pinecone.NewClientBaseParams{}, you should be able to instantiate and interact with the client in the same way as before using pinecone.NewClientParams{}. I've just added a private field baseParams to bundle that up and reuse it when creating an index connection. There should be no visible impact to the user.

Ahh okay thank you!

aulorbe · 2024-10-15T17:10:58Z

pinecone/index_connection.go

 	"net/url"
 	"strings"

-	"github.com/pinecone-io/go-pinecone/internal/gen/data"
+	db_data_grpc "github.com/pinecone-io/go-pinecone/internal/gen/db_data/grpc"


Question: why are some things snake-cased (db_data_rest) while others are camel-cased (dbDataClient)? Should they be consistent throughout the code base?

The snake case values are referring to specific packages, so in this case go-pinecone/internal/gen/db_data/grpc, because there are shared types across both grpc and rest I needed a way to differentiate.

Keeping the package names and references as snake case felt like it made it more obvious when we were interacting with a package versus another variable.

aulorbe · 2024-10-15T17:16:13Z

pinecone/index_connection.go

+}
+
+// StartImport imports data from a storage provider into an index. The uri parameter must start with the
+// schema of a supported storage provider. For buckets that are not publicly readable, you will also need to


Nit: I'd add e.g. "s3://" as an example of the start of the uri param, just b/c some ppl might not know what that means

aulorbe · 2024-10-15T17:16:54Z

pinecone/index_connection.go

+
+// StartImport imports data from a storage provider into an index. The uri parameter must start with the
+// schema of a supported storage provider. For buckets that are not publicly readable, you will also need to
+// separately configure a storage integration and pass the integration id.


Do we have instructions in our docs for how to build a storage integration? If we're telling ppl they need to do an extra step, it might be nice for us to provide them resources about how to do that in the same sentence

Added a link to: https://docs.pinecone.io/guides/operations/integrations/manage-storage-integrations

aulorbe · 2024-10-15T17:17:20Z

pinecone/index_connection.go

+// Parameters:
+//   - ctx: A context.Context object controls the request's lifetime,
+//     allowing for the request to be canceled or to timeout according to the context's deadline.
+//   - uri: The URI of the data to import. The URI must start with the scheme of a supported storage provider.


scheme >> schema

I think it's supposed to be scheme, right? https://en.wikipedia.org/wiki/List_of_URI_schemes

oh idk, you just say schema above, so I thought it should be schema :)

aulorbe · 2024-10-15T17:17:48Z

pinecone/index_connection.go

+//     allowing for the request to be canceled or to timeout according to the context's deadline.
+//   - uri: The URI of the data to import. The URI must start with the scheme of a supported storage provider.
+//   - integrationId: If your bucket requires authentication to access, you need to pass the id of your storage integration using this property.
+//     Pass nil if not required.


Can we make it so they don't have to pass nil if auth is not provided? Seems like it'd be a nice UX helper

I would need to add a new struct type that allows batching uri, integrationId, and errorMode. I could change it but I felt like since there are so few fields at the moment maybe it's more overhead to have to worry about creating the struct rather than just passing values. Do you feel strongly about this?

no def not, you do you!

aulorbe · 2024-10-15T17:18:03Z

pinecone/index_connection.go

+//   - uri: The URI of the data to import. The URI must start with the scheme of a supported storage provider.
+//   - integrationId: If your bucket requires authentication to access, you need to pass the id of your storage integration using this property.
+//     Pass nil if not required.
+//   - errorMode: If set to "continue", the import operation will continue even if some records fail to import.


Does this default to continue in startImport? I think that's how the other clients work

Yes, the full doc for errorMode looks like this:

errorMode: If set to "continue", the import operation will continue even if some records fail to import. Pass "abort" to stop the import operation if any records fail. Will default to "continue" if nil is passed.

aulorbe · 2024-10-15T17:18:57Z

pinecone/index_connection.go

+//   - ctx: A context.Context object controls the request's lifetime,
+//     allowing for the request to be canceled or to timeout according to the context's deadline.
+//   - id: The id of the import operation. This is returned when you call StartImport, or can be retrieved
+//     through the ListImports method.


Can you link to other funcs in the docstrings of Go funcs? Might be nice to link to ListImports here

Good call, updated to link to the explicit methods. I think we could do more of this elsewhere, I think we've just been using proper names in the doc comments until now. Added a chore ticket to go through these: https://app.asana.com/0/1203260648987893/1208561704179017/f

Sweet sounds great, thanks!

aulorbe · 2024-10-15T17:21:16Z

pinecone/index_connection_test.go

@@ -269,8 +269,34 @@ func (ts *IntegrationTests) TestUpdateVectorSparseValues() error {
 	actualSparseValues := vector.Vectors[ts.vectorIds[0]].SparseValues.Values

 	assert.ElementsMatch(ts.T(), expectedSparseValues.Values, actualSparseValues, "Sparse values do not match")
+}
+
+func (ts *IntegrationTests) TestImportFlow() {


Might want to do some negative examples (like confirming errors are thrown) too

I confirmed that we're returning an error when trying to call StartImport with an empty URI.

aulorbe · 2024-10-15T17:22:08Z

pinecone/test_suite.go

@@ -87,7 +87,9 @@ func (ts *IntegrationTests) TearDownSuite() {
 	_, err = WaitUntilIndexReady(ts, ctx)
 	require.NoError(ts.T(), err)
 	err = ts.client.DeleteIndex(ctx, ts.idxName)
-	require.NoError(ts.T(), err)
+	if err != nil {


Do you need to retry here? I know sometimes w/configure index calls and the like, we have to retry in the TS client

I could, I didn't add one for now opting to just let things finish. I'll look at it.

Added a small retry mechanism here for cleanup.

aulorbe

Some Qs!

…fails

## Problem We are releasing a new version of the API this month: `2024-10`. There are 3 primary new features that are included in this release: - Import - Inference - Embed - Rerank This PR implements the operations to support import. Sorry about the size, but you can basically ignore all of the generated code under `internal/gen` unless you're curious about the new structure of the generated core files. Follow the `codgen/build-clients.sh` script for those details. ## Solution Since the import operations are technically part of the data plane but only supported via REST, they are represented in the OpenAPI spec and not our protos file. Because of this, we need to change a few things to support these operations `Client` and `IndexConnection` structs to support these operations because traditionally the code `IndexConnection` wraps was targeting gRPC-only db data operations. We now need to generate rest code for the data plane as well so we can interact with imports. - Update the `codegen/build-clients.sh` script to handle building new modules for both `internal/gen/db_data/grpc` and `internal/gen/db_data/rest`. - Update `Client` struct and move `NewClientBaseParams` into a field that can be shared more easily when constructing the `IndexConnection`. - Add `buildDataClientBaseOptions` to handle constructing the necessary rest client options for the underlying `dbDataClient`. - Add an `ensureHostHasHttps` helper as we need to make sure this is present for the index `Host` that's passed, which was not necessary for grpc. - Update `Index` method to handle calling `buildDataClientBaseOptions` and passes the new client into `newIndexConnection`. - Update `IndexConnection` to support both REST and gRPC interfaces under the hood (`restClient`, `grpcClient`). - Update `newIndexConnection` to support attaching the new `restClient` to the `IndexConnection` struct. - Update `IndexConnection` to support all import operations: `StartImport`, `ListImports`, `DescribeImport`, `CancelImport`. - Add end-to-end integration test for validating the import flow against serverless indexes. - Some nitpicky code cleanup, renaming of things around the new rest vs. grpc paradigm, etc. ## Type of Change - [ ] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] This change requires a documentation update - [ ] Infrastructure change (CI configs, etc) - [ ] Non-code change (docs, etc) - [ ] None of the above: (explain here) ## Test Plan `just test` - make sure CI passes To see examples of how to use the new methods, check the doc comments. --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1208325183834377 - https://app.asana.com/0/0/1208541827330963

regenerate client code from 2024-10 specification, update build-clien…

e225c56

…ts.sh script to handle generating db_data from both oas and proto, update imports in client code

austin-denoble changed the base branch from main to release-candidate/2024-10 October 10, 2024 17:24

austin-denoble added 12 commits October 10, 2024 19:36

update IndexConnection and Client to support adding the REST implemen…

2ad000f

…tation for the db_data client as a member on the IndexConnection struct, for users this happens under the hood, regenerate and update submodule

regenerate code from spec changes, implement bulk import start, cance…

06a4397

…l, describe, and list, add integration test covering calling each step in the flow

run just gen off latest spec changes

67a2fa5

clean up build-clients.sh comments

69be1f8

add custom types for import operations toFoo() functions for handling…

bde3e30

… conversions between generated types and exposed stuff

fix test file

27de308

standardize the import names for different modules

fc1911b

clean up comments

28356bf

more tweaks to imports, add doc comments for all new import operation…

8c0d9fc

…s and types

get rid of failing the test suite on teardown

f6b3841

fix nil pointer reference on StartImport op

281a5ad

remove continue-on-error for tests step

5a7b2df

austin-denoble changed the title ~~Regenerate code for 2024-10, Implement Bulk Import~~ Implement Bulk Import, Regenerate core for 2024-10 API Oct 15, 2024

fix nil pointer reference again

1a21bab

austin-denoble requested review from haruska and aulorbe October 15, 2024 06:28

austin-denoble marked this pull request as ready for review October 15, 2024 06:28

make sure MetadataFilter is public on QueryByVectorIdRequest

7b53f93

haruska approved these changes Oct 15, 2024

View reviewed changes

jhamon approved these changes Oct 15, 2024

View reviewed changes

aulorbe reviewed Oct 15, 2024

View reviewed changes

aulorbe approved these changes Oct 15, 2024

View reviewed changes

austin-denoble added 6 commits October 16, 2024 12:52

review feedback - tweak doc comments, link to specific code bits

616675e

add error test for no uri when calling StartImport

9e85a0a

swap schema -> scheme

5951a3d

add small retry mechanism for integration test cleanup when deletion …

13ae2c7

…fails

fix logic in retry

1456a4b

try and fix the update vector metadata flakiness

e7ea5f9

austin-denoble merged commit b715d26 into release-candidate/2024-10 Oct 16, 2024
4 checks passed

austin-denoble deleted the adenoble/implement-bulk-import branch October 16, 2024 19:33

austin-denoble mentioned this pull request Oct 23, 2024

Merge release-candidate/2024-10 branch to main #85

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Bulk Import, Regenerate core for `2024-10` API #79

Implement Bulk Import, Regenerate core for `2024-10` API #79

austin-denoble commented Oct 10, 2024 •

edited

Loading

jhamon left a comment

aulorbe Oct 15, 2024

austin-denoble Oct 16, 2024

aulorbe Oct 16, 2024

aulorbe Oct 15, 2024 •

edited

Loading

austin-denoble Oct 16, 2024

aulorbe Oct 16, 2024

aulorbe Oct 15, 2024

aulorbe Oct 15, 2024

austin-denoble Oct 16, 2024

aulorbe Oct 15, 2024

austin-denoble Oct 16, 2024

aulorbe Oct 16, 2024

aulorbe Oct 15, 2024

austin-denoble Oct 16, 2024

aulorbe Oct 16, 2024

aulorbe Oct 15, 2024

austin-denoble Oct 16, 2024

aulorbe Oct 15, 2024

austin-denoble Oct 16, 2024

aulorbe Oct 16, 2024

aulorbe Oct 15, 2024

austin-denoble Oct 16, 2024

aulorbe Oct 15, 2024

austin-denoble Oct 16, 2024

austin-denoble Oct 16, 2024

aulorbe left a comment

Implement Bulk Import, Regenerate core for 2024-10 API #79

Implement Bulk Import, Regenerate core for 2024-10 API #79

Conversation

austin-denoble commented Oct 10, 2024 • edited Loading

Problem

Solution

Type of Change

Test Plan

jhamon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aulorbe Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aulorbe left a comment

Choose a reason for hiding this comment

Implement Bulk Import, Regenerate core for `2024-10` API #79

Implement Bulk Import, Regenerate core for `2024-10` API #79

austin-denoble commented Oct 10, 2024 •

edited

Loading

aulorbe Oct 15, 2024 •

edited

Loading