-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Bulk Import, Regenerate core for 2024-10
API
#79
Implement Bulk Import, Regenerate core for 2024-10
API
#79
Conversation
…ts.sh script to handle generating db_data from both oas and proto, update imports in client code
…tation for the db_data client as a member on the IndexConnection struct, for users this happens under the hood, regenerate and update submodule
…l, describe, and list, add integration test covering calling each step in the flow
… conversions between generated types and exposed stuff
2024-10
, Implement Bulk Import2024-10
API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
// - restClient: Optional underlying *http.Client object used to communicate with the Pinecone API, | ||
// provided through NewClientParams.RestClient or NewClientBaseParams.RestClient. If not provided, | ||
// a default client is created for you. | ||
// - sourceTag: An optional string used to help Pinecone attribute API activity, provided through NewClientParams.SourceTag | ||
// or NewClientBaseParams.SourceTag. | ||
// - baseParams: A NewClientBaseParams object that holds the configuration for the Pinecone client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might need to update the example code in this docstring to be clientParams := pinecone.NewClientBaseParams{}
obj, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to provide pinecone.NewClientBaseParams{}
, you should be able to instantiate and interact with the client in the same way as before using pinecone.NewClientParams{}
. I've just added a private field baseParams
to bundle that up and reuse it when creating an index connection. There should be no visible impact to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh okay thank you!
"net/url" | ||
"strings" | ||
|
||
"github.com/pinecone-io/go-pinecone/internal/gen/data" | ||
db_data_grpc "github.com/pinecone-io/go-pinecone/internal/gen/db_data/grpc" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: why are some things snake-cased (db_data_rest
) while others are camel-cased (dbDataClient
)? Should they be consistent throughout the code base?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snake case values are referring to specific packages, so in this case go-pinecone/internal/gen/db_data/grpc
, because there are shared types across both grpc and rest I needed a way to differentiate.
Keeping the package names and references as snake case felt like it made it more obvious when we were interacting with a package versus another variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah cool !
pinecone/index_connection.go
Outdated
} | ||
|
||
// StartImport imports data from a storage provider into an index. The uri parameter must start with the | ||
// schema of a supported storage provider. For buckets that are not publicly readable, you will also need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd add e.g. "s3://"
as an example of the start of the uri param, just b/c some ppl might not know what that means
pinecone/index_connection.go
Outdated
|
||
// StartImport imports data from a storage provider into an index. The uri parameter must start with the | ||
// schema of a supported storage provider. For buckets that are not publicly readable, you will also need to | ||
// separately configure a storage integration and pass the integration id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have instructions in our docs for how to build a storage integration? If we're telling ppl they need to do an extra step, it might be nice for us to provide them resources about how to do that in the same sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Parameters: | ||
// - ctx: A context.Context object controls the request's lifetime, | ||
// allowing for the request to be canceled or to timeout according to the context's deadline. | ||
// - uri: The URI of the data to import. The URI must start with the scheme of a supported storage provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheme
>> schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's supposed to be scheme, right? https://en.wikipedia.org/wiki/List_of_URI_schemes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh idk, you just say schema
above, so I thought it should be schema
:)
// allowing for the request to be canceled or to timeout according to the context's deadline. | ||
// - uri: The URI of the data to import. The URI must start with the scheme of a supported storage provider. | ||
// - integrationId: If your bucket requires authentication to access, you need to pass the id of your storage integration using this property. | ||
// Pass nil if not required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it so they don't have to pass nil
if auth is not provided? Seems like it'd be a nice UX helper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would need to add a new struct type that allows batching uri
, integrationId
, and errorMode
. I could change it but I felt like since there are so few fields at the moment maybe it's more overhead to have to worry about creating the struct rather than just passing values. Do you feel strongly about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no def not, you do you!
// - uri: The URI of the data to import. The URI must start with the scheme of a supported storage provider. | ||
// - integrationId: If your bucket requires authentication to access, you need to pass the id of your storage integration using this property. | ||
// Pass nil if not required. | ||
// - errorMode: If set to "continue", the import operation will continue even if some records fail to import. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this default to continue
in startImport
? I think that's how the other clients work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the full doc for errorMode
looks like this:
errorMode: If set to "continue", the import operation will continue even if some records fail to import.
Pass "abort" to stop the import operation if any records fail. Will default to "continue" if nil is passed.
pinecone/index_connection.go
Outdated
// - ctx: A context.Context object controls the request's lifetime, | ||
// allowing for the request to be canceled or to timeout according to the context's deadline. | ||
// - id: The id of the import operation. This is returned when you call StartImport, or can be retrieved | ||
// through the ListImports method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you link to other funcs in the docstrings of Go funcs? Might be nice to link to ListImports
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, updated to link to the explicit methods. I think we could do more of this elsewhere, I think we've just been using proper names in the doc comments until now. Added a chore ticket to go through these: https://app.asana.com/0/1203260648987893/1208561704179017/f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sweet sounds great, thanks!
pinecone/index_connection_test.go
Outdated
@@ -269,8 +269,34 @@ func (ts *IntegrationTests) TestUpdateVectorSparseValues() error { | |||
actualSparseValues := vector.Vectors[ts.vectorIds[0]].SparseValues.Values | |||
|
|||
assert.ElementsMatch(ts.T(), expectedSparseValues.Values, actualSparseValues, "Sparse values do not match") | |||
} | |||
|
|||
func (ts *IntegrationTests) TestImportFlow() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to do some negative examples (like confirming errors are thrown) too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that we're returning an error when trying to call StartImport
with an empty URI.
@@ -87,7 +87,9 @@ func (ts *IntegrationTests) TearDownSuite() { | |||
_, err = WaitUntilIndexReady(ts, ctx) | |||
require.NoError(ts.T(), err) | |||
err = ts.client.DeleteIndex(ctx, ts.idxName) | |||
require.NoError(ts.T(), err) | |||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to retry here? I know sometimes w/configure index calls and the like, we have to retry in the TS client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, I didn't add one for now opting to just let things finish. I'll look at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a small retry mechanism here for cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some Qs!
## Problem We are releasing a new version of the API this month: `2024-10`. There are 3 primary new features that are included in this release: - Import - Inference - Embed - Rerank This PR implements the operations to support import. Sorry about the size, but you can basically ignore all of the generated code under `internal/gen` unless you're curious about the new structure of the generated core files. Follow the `codgen/build-clients.sh` script for those details. ## Solution Since the import operations are technically part of the data plane but only supported via REST, they are represented in the OpenAPI spec and not our protos file. Because of this, we need to change a few things to support these operations `Client` and `IndexConnection` structs to support these operations because traditionally the code `IndexConnection` wraps was targeting gRPC-only db data operations. We now need to generate rest code for the data plane as well so we can interact with imports. - Update the `codegen/build-clients.sh` script to handle building new modules for both `internal/gen/db_data/grpc` and `internal/gen/db_data/rest`. - Update `Client` struct and move `NewClientBaseParams` into a field that can be shared more easily when constructing the `IndexConnection`. - Add `buildDataClientBaseOptions` to handle constructing the necessary rest client options for the underlying `dbDataClient`. - Add an `ensureHostHasHttps` helper as we need to make sure this is present for the index `Host` that's passed, which was not necessary for grpc. - Update `Index` method to handle calling `buildDataClientBaseOptions` and passes the new client into `newIndexConnection`. - Update `IndexConnection` to support both REST and gRPC interfaces under the hood (`restClient`, `grpcClient`). - Update `newIndexConnection` to support attaching the new `restClient` to the `IndexConnection` struct. - Update `IndexConnection` to support all import operations: `StartImport`, `ListImports`, `DescribeImport`, `CancelImport`. - Add end-to-end integration test for validating the import flow against serverless indexes. - Some nitpicky code cleanup, renaming of things around the new rest vs. grpc paradigm, etc. ## Type of Change - [ ] Bug fix (non-breaking change which fixes an issue) - [X] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] This change requires a documentation update - [ ] Infrastructure change (CI configs, etc) - [ ] Non-code change (docs, etc) - [ ] None of the above: (explain here) ## Test Plan `just test` - make sure CI passes To see examples of how to use the new methods, check the doc comments. --- - To see the specific tasks where the Asana app for GitHub is being used, see below: - https://app.asana.com/0/0/1208325183834377 - https://app.asana.com/0/0/1208541827330963
Problem
We are releasing a new version of the API this month:
2024-10
.There are 3 primary new features that are included in this release:
This PR implements the operations to support import. Sorry about the size, but you can basically ignore all of the generated code under
internal/gen
unless you're curious about the new structure of the generated core files. Follow thecodgen/build-clients.sh
script for those details.Solution
Since the import operations are technically part of the data plane but only supported via REST, they are represented in the OpenAPI spec and not our protos file. Because of this, we need to change a few things to support these operations
Client
andIndexConnection
structs to support these operations because traditionally the codeIndexConnection
wraps was targeting gRPC-only db data operations. We now need to generate rest code for the data plane as well so we can interact with imports.codegen/build-clients.sh
script to handle building new modules for bothinternal/gen/db_data/grpc
andinternal/gen/db_data/rest
.Client
struct and moveNewClientBaseParams
into a field that can be shared more easily when constructing theIndexConnection
.buildDataClientBaseOptions
to handle constructing the necessary rest client options for the underlyingdbDataClient
.ensureHostHasHttps
helper as we need to make sure this is present for the indexHost
that's passed, which was not necessary for grpc.Index
method to handle callingbuildDataClientBaseOptions
and passes the new client intonewIndexConnection
.IndexConnection
to support both REST and gRPC interfaces under the hood (restClient
,grpcClient
).newIndexConnection
to support attaching the newrestClient
to theIndexConnection
struct.IndexConnection
to support all import operations:StartImport
,ListImports
,DescribeImport
,CancelImport
.Type of Change
Test Plan
just test
- make sure CI passesTo see examples of how to use the new methods, check the doc comments.