Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Bulk Import, Regenerate core for 2024-10 API #79

Merged

Conversation

austin-denoble
Copy link
Contributor

@austin-denoble austin-denoble commented Oct 10, 2024

Problem

We are releasing a new version of the API this month: 2024-10.

There are 3 primary new features that are included in this release:

  • Import
  • Inference
    • Embed
    • Rerank

This PR implements the operations to support import. Sorry about the size, but you can basically ignore all of the generated code under internal/gen unless you're curious about the new structure of the generated core files. Follow the codgen/build-clients.sh script for those details.

Solution

Since the import operations are technically part of the data plane but only supported via REST, they are represented in the OpenAPI spec and not our protos file. Because of this, we need to change a few things to support these operations Client and IndexConnection structs to support these operations because traditionally the code IndexConnection wraps was targeting gRPC-only db data operations. We now need to generate rest code for the data plane as well so we can interact with imports.

  • Update the codegen/build-clients.sh script to handle building new modules for both internal/gen/db_data/grpc and internal/gen/db_data/rest.
  • Update Client struct and move NewClientBaseParams into a field that can be shared more easily when constructing the IndexConnection.
    • Add buildDataClientBaseOptions to handle constructing the necessary rest client options for the underlying dbDataClient.
    • Add an ensureHostHasHttps helper as we need to make sure this is present for the index Host that's passed, which was not necessary for grpc.
    • Update Index method to handle calling buildDataClientBaseOptions and passes the new client into newIndexConnection.
  • Update IndexConnection to support both REST and gRPC interfaces under the hood (restClient, grpcClient).
    • Update newIndexConnection to support attaching the new restClient to the IndexConnection struct.
  • Update IndexConnection to support all import operations: StartImport, ListImports, DescribeImport, CancelImport.
  • Add end-to-end integration test for validating the import flow against serverless indexes.
  • Some nitpicky code cleanup, renaming of things around the new rest vs. grpc paradigm, etc.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Infrastructure change (CI configs, etc)
  • Non-code change (docs, etc)
  • None of the above: (explain here)

Test Plan

just test - make sure CI passes

To see examples of how to use the new methods, check the doc comments.


…ts.sh script to handle generating db_data from both oas and proto, update imports in client code
@austin-denoble austin-denoble changed the base branch from main to release-candidate/2024-10 October 10, 2024 17:24
@austin-denoble austin-denoble changed the title Regenerate code for 2024-10, Implement Bulk Import Implement Bulk Import, Regenerate core for 2024-10 API Oct 15, 2024
@austin-denoble austin-denoble marked this pull request as ready for review October 15, 2024 06:28
Copy link
Contributor

@jhamon jhamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

// - restClient: Optional underlying *http.Client object used to communicate with the Pinecone API,
// provided through NewClientParams.RestClient or NewClientBaseParams.RestClient. If not provided,
// a default client is created for you.
// - sourceTag: An optional string used to help Pinecone attribute API activity, provided through NewClientParams.SourceTag
// or NewClientBaseParams.SourceTag.
// - baseParams: A NewClientBaseParams object that holds the configuration for the Pinecone client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might need to update the example code in this docstring to be clientParams := pinecone.NewClientBaseParams{} obj, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to provide pinecone.NewClientBaseParams{}, you should be able to instantiate and interact with the client in the same way as before using pinecone.NewClientParams{}. I've just added a private field baseParams to bundle that up and reuse it when creating an index connection. There should be no visible impact to the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh okay thank you!

"net/url"
"strings"

"github.com/pinecone-io/go-pinecone/internal/gen/data"
db_data_grpc "github.com/pinecone-io/go-pinecone/internal/gen/db_data/grpc"
Copy link
Contributor

@aulorbe aulorbe Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why are some things snake-cased (db_data_rest) while others are camel-cased (dbDataClient)? Should they be consistent throughout the code base?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snake case values are referring to specific packages, so in this case go-pinecone/internal/gen/db_data/grpc, because there are shared types across both grpc and rest I needed a way to differentiate.

Keeping the package names and references as snake case felt like it made it more obvious when we were interacting with a package versus another variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah cool !

}

// StartImport imports data from a storage provider into an index. The uri parameter must start with the
// schema of a supported storage provider. For buckets that are not publicly readable, you will also need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd add e.g. "s3://" as an example of the start of the uri param, just b/c some ppl might not know what that means


// StartImport imports data from a storage provider into an index. The uri parameter must start with the
// schema of a supported storage provider. For buckets that are not publicly readable, you will also need to
// separately configure a storage integration and pass the integration id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have instructions in our docs for how to build a storage integration? If we're telling ppl they need to do an extra step, it might be nice for us to provide them resources about how to do that in the same sentence

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Parameters:
// - ctx: A context.Context object controls the request's lifetime,
// allowing for the request to be canceled or to timeout according to the context's deadline.
// - uri: The URI of the data to import. The URI must start with the scheme of a supported storage provider.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheme >> schema

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's supposed to be scheme, right? https://en.wikipedia.org/wiki/List_of_URI_schemes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh idk, you just say schema above, so I thought it should be schema :)

// allowing for the request to be canceled or to timeout according to the context's deadline.
// - uri: The URI of the data to import. The URI must start with the scheme of a supported storage provider.
// - integrationId: If your bucket requires authentication to access, you need to pass the id of your storage integration using this property.
// Pass nil if not required.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it so they don't have to pass nil if auth is not provided? Seems like it'd be a nice UX helper

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to add a new struct type that allows batching uri, integrationId, and errorMode. I could change it but I felt like since there are so few fields at the moment maybe it's more overhead to have to worry about creating the struct rather than just passing values. Do you feel strongly about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no def not, you do you!

// - uri: The URI of the data to import. The URI must start with the scheme of a supported storage provider.
// - integrationId: If your bucket requires authentication to access, you need to pass the id of your storage integration using this property.
// Pass nil if not required.
// - errorMode: If set to "continue", the import operation will continue even if some records fail to import.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this default to continue in startImport? I think that's how the other clients work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the full doc for errorMode looks like this:

errorMode: If set to "continue", the import operation will continue even if some records fail to import.
Pass "abort" to stop the import operation if any records fail. Will default to "continue" if nil is passed.

// - ctx: A context.Context object controls the request's lifetime,
// allowing for the request to be canceled or to timeout according to the context's deadline.
// - id: The id of the import operation. This is returned when you call StartImport, or can be retrieved
// through the ListImports method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you link to other funcs in the docstrings of Go funcs? Might be nice to link to ListImports here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, updated to link to the explicit methods. I think we could do more of this elsewhere, I think we've just been using proper names in the doc comments until now. Added a chore ticket to go through these: https://app.asana.com/0/1203260648987893/1208561704179017/f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sweet sounds great, thanks!

@@ -269,8 +269,34 @@ func (ts *IntegrationTests) TestUpdateVectorSparseValues() error {
actualSparseValues := vector.Vectors[ts.vectorIds[0]].SparseValues.Values

assert.ElementsMatch(ts.T(), expectedSparseValues.Values, actualSparseValues, "Sparse values do not match")
}

func (ts *IntegrationTests) TestImportFlow() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to do some negative examples (like confirming errors are thrown) too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that we're returning an error when trying to call StartImport with an empty URI.

@@ -87,7 +87,9 @@ func (ts *IntegrationTests) TearDownSuite() {
_, err = WaitUntilIndexReady(ts, ctx)
require.NoError(ts.T(), err)
err = ts.client.DeleteIndex(ctx, ts.idxName)
require.NoError(ts.T(), err)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to retry here? I know sometimes w/configure index calls and the like, we have to retry in the TS client

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, I didn't add one for now opting to just let things finish. I'll look at it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a small retry mechanism here for cleanup.

Copy link
Contributor

@aulorbe aulorbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Qs!

@austin-denoble austin-denoble merged commit b715d26 into release-candidate/2024-10 Oct 16, 2024
4 checks passed
@austin-denoble austin-denoble deleted the adenoble/implement-bulk-import branch October 16, 2024 19:33
austin-denoble added a commit that referenced this pull request Oct 23, 2024
## Problem
We are releasing a new version of the API this month: `2024-10`.

There are 3 primary new features that are included in this release:
- Import
- Inference
  - Embed
  - Rerank

This PR implements the operations to support import. Sorry about the
size, but you can basically ignore all of the generated code under
`internal/gen` unless you're curious about the new structure of the
generated core files. Follow the `codgen/build-clients.sh` script for
those details.

## Solution
Since the import operations are technically part of the data plane but
only supported via REST, they are represented in the OpenAPI spec and
not our protos file. Because of this, we need to change a few things to
support these operations `Client` and `IndexConnection` structs to
support these operations because traditionally the code
`IndexConnection` wraps was targeting gRPC-only db data operations. We
now need to generate rest code for the data plane as well so we can
interact with imports.

- Update the `codegen/build-clients.sh` script to handle building new
modules for both `internal/gen/db_data/grpc` and
`internal/gen/db_data/rest`.
- Update `Client` struct and move `NewClientBaseParams` into a field
that can be shared more easily when constructing the `IndexConnection`.
- Add `buildDataClientBaseOptions` to handle constructing the necessary
rest client options for the underlying `dbDataClient`.
- Add an `ensureHostHasHttps` helper as we need to make sure this is
present for the index `Host` that's passed, which was not necessary for
grpc.
- Update `Index` method to handle calling `buildDataClientBaseOptions`
and passes the new client into `newIndexConnection`.
- Update `IndexConnection` to support both REST and gRPC interfaces
under the hood (`restClient`, `grpcClient`).
- Update `newIndexConnection` to support attaching the new `restClient`
to the `IndexConnection` struct.
- Update `IndexConnection` to support all import operations:
`StartImport`, `ListImports`, `DescribeImport`, `CancelImport`.
- Add end-to-end integration test for validating the import flow against
serverless indexes.
- Some nitpicky code cleanup, renaming of things around the new rest vs.
grpc paradigm, etc.

## Type of Change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [X] New feature (non-breaking change which adds functionality)
- [ ] Breaking change (fix or feature that would cause existing
functionality to not work as expected)
- [ ] This change requires a documentation update
- [ ] Infrastructure change (CI configs, etc)
- [ ] Non-code change (docs, etc)
- [ ] None of the above: (explain here)

## Test Plan
`just test` - make sure CI passes

To see examples of how to use the new methods, check the doc comments.


---
- To see the specific tasks where the Asana app for GitHub is being
used, see below:
  - https://app.asana.com/0/0/1208325183834377
  - https://app.asana.com/0/0/1208541827330963
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants