Workaround for dynamic indexing on CPU distribution #3802

matteo-grella · 2020-09-18T07:22:15Z

matteo-grella
Sep 18, 2020

Hey,

Have any of you ever had to use the CPU distribution but with the need to index progressively? Even if it is designed specifically for static dataset, is there any possible workaround (obviously at the expense of performance)?

Thanks,

Matteo

Answered by shiyu22

Sep 18, 2020

the CPU version can deal with static dataset only

That phrase is really confusing and is it on the website? Please submit a doc issue and point the place where we can change it.

View full answer

shiyu22 · 2020-09-18T08:04:49Z

shiyu22
Sep 18, 2020

Hi,

The data is stored in segment, and the size of each segment is the index_file_size parameter you specified when createing the collection. When the data reaches this size, Milvus will index it automatically, but if the data does not reach index_file_size and you want to dynamically index it, you can call the create_index function manually.

0 replies

matteo-grella · 2020-09-18T08:36:30Z

matteo-grella
Sep 18, 2020
Author

Thank you @shiyu22 for such a quick response!

To make sure that I understood you correctly before I try it, let me describe my flow of operations at a high level:

For each vector Vi of a dynamically populated V-queue,

Search() for vectors similar to a Vi vector already normalized using the inner product metric;
If the most similar vector found has a score >= to a certain threshold, go to point 3. otherwise to point 4.;
Do nothing (duplicate information);
Insert() the vector Vi and re-index by calling CreateIndex().

Is it right? If so, how does the CreateIndex function work? Does it add to the index only the current vector of each iteration (Vi), or does it re-index all the previously inserted vectors?

I hope I've been clear enough :)

1 reply

shiyu22 Sep 18, 2020

The process you describe is clear 👍

But the fourth step is incorrect, because create_index() is usually called when there is a large amount of unindexed data, and does not need to be called after inserting a single vector.

It's worth mentioning, did you create the index after you created the collection? If you index at the beginning, Milvus will automatically index when the amount of data reaches index_file_size.

Prepare parameters needed like index_file_size to create a collection.
Create the index you need for the collection, index introduction reference it.
Follow steps 1 and 2 you mentioned earlier to insert the data.

Since we have already indexed the collection in the second step, the inserted data will be indexed automatically when the index_file_size is reached. When you have inserted a large amount of data, there will be multiple segments to store the vectors, which means there will be multiple index files.

matteo-grella · 2020-09-18T09:44:37Z

matteo-grella
Sep 18, 2020
Author

Thanks,

I think I have now enough info to start the implementation!

By the way, is it then a bit confusing saying that the CPU version can deal with static dataset only?

However, if the "loop" I mentioned earlier is performed in parallel, let's say by different pods in a k8s environment, I assume that in order to find the vectors they must be indexed (indexing for me equals storing, somehow). That's why I was thinking to index vectors one by one.

0 replies

shiyu22 · 2020-09-18T10:11:19Z

shiyu22
Sep 18, 2020

the CPU version can deal with static dataset only

That phrase is really confusing and is it on the website? Please submit a doc issue and point the place where we can change it.

1 reply

matteo-grella Sep 18, 2020
Author

Here we go: Misleading description for the CPU-only version?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workaround for dynamic indexing on CPU distribution #3802

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Workaround for dynamic indexing on CPU distribution #3802

matteo-grella Sep 18, 2020

Replies: 4 comments · 2 replies

shiyu22 Sep 18, 2020

matteo-grella Sep 18, 2020 Author

shiyu22 Sep 18, 2020

matteo-grella Sep 18, 2020 Author

shiyu22 Sep 18, 2020

matteo-grella Sep 18, 2020 Author

matteo-grella
Sep 18, 2020

Replies: 4 comments 2 replies

shiyu22
Sep 18, 2020

matteo-grella
Sep 18, 2020
Author

matteo-grella
Sep 18, 2020
Author

shiyu22
Sep 18, 2020

matteo-grella Sep 18, 2020
Author