-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
.Net: Batch OpenAI embeddings request #3295
.Net: Batch OpenAI embeddings request #3295
Conversation
Non-breaking at compile time but may break at runtime for users that are passing more data then the model allows (combined 16 inputs and 8191 tokens for |
@@ -154,27 +154,31 @@ await foreach (StreamingChoice choice in streamingChatCompletions.GetChoicesStre | |||
IList<string> data, | |||
CancellationToken cancellationToken = default) | |||
{ | |||
var result = new List<ReadOnlyMemory<float>>(data.Count); | |||
foreach (string text in data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current approach allows to pass any number of strings, given that they are processed one at a time. The change requires to limit the list to the max allowed by OpenAI and Azure. It feels like a breaking change. For instance, clients passing hundreds of strings will have to refactor their code taking into account the limits of the model selected.
@anthonypuppo Thanks for your contribution. Based on the commets above it sounds like we don't need this change. I'm going to close this PR but if I'm mistaken please fix the merge conflict and re-open the PR. |
@markwallace-microsoft This change is absolutely necessary. The current implementation executes an embedding request per input. That's not necessarily wrong, but extremely inefficient as both OpenAI and Azure OpenAI support batching. It's "breaking" due to model limits (i.e. text-embedding-ada-002 only supports up to a max of 16 inputs and ~8k tokens per batch). Existing user implementations could be users passing a large amount of data so they would have to refactor to handle the input chunking according to their user case. FWIW the Hugging Face connector already does this (no loop, send all inputs at once): Lines 105 to 112 in 0772d01
Apologies for letting this branch get so out of sync. I'll have it fixed up shortly. |
### Motivation and Context Continuation of #3295 The OpenAI embeddings endpoint and `text-embedding-ada-002` supports either a single item or an array of items. Rather than send text inputs one by one, the request should send them all at once for improved performance. Note that this will require callers to manage their own batching strategy (added an example to demonstrate). I view this as a worthwhile tradeoff for improved performance out of the box. Fixes #3294. ### Description - Text data is sent as an array to the embeddings endpoint instead of an individual request per text. - Aligns embedding request functionality with how the HuggingFace connector works. - Remove unnecessary `ToArray()` call on the response embeddings. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄 (implementations passing more than 16 inputs will get an error from the OpenAI endpoint and need to batch appropriately) --------- Co-authored-by: Dmytro Struk <[email protected]>
### Motivation and Context Continuation of microsoft#3295 The OpenAI embeddings endpoint and `text-embedding-ada-002` supports either a single item or an array of items. Rather than send text inputs one by one, the request should send them all at once for improved performance. Note that this will require callers to manage their own batching strategy (added an example to demonstrate). I view this as a worthwhile tradeoff for improved performance out of the box. Fixes microsoft#3294. ### Description - Text data is sent as an array to the embeddings endpoint instead of an individual request per text. - Aligns embedding request functionality with how the HuggingFace connector works. - Remove unnecessary `ToArray()` call on the response embeddings. ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄 (implementations passing more than 16 inputs will get an error from the OpenAI endpoint and need to batch appropriately) --------- Co-authored-by: Dmytro Struk <[email protected]>
Motivation and Context
The OpenAI embeddings endpoint and
text-embedding-ada-002
supports either a single item or an array of items. Rather than send text inputs one by one, the request should send them all at once for improved performance.From MS docs:
Fixes #3294.
Description
Text data is sent as an array to the embeddings endpoint instead of an individual request per text. This also aligns embedding request functionality with how the HuggingFace connector works.
Contribution Checklist