Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TEXT_MATCH returning no results sometimes. #38644

Open
1 task done
danielelongo14 opened this issue Dec 22, 2024 · 6 comments · May be fixed by #39070
Open
1 task done

[Bug]: TEXT_MATCH returning no results sometimes. #38644

danielelongo14 opened this issue Dec 22, 2024 · 6 comments · May be fixed by #39070
Assignees
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@danielelongo14
Copy link

danielelongo14 commented Dec 22, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.5.0-gpu
- Deployment mode(standalone or cluster): Standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.5.0
- OS(Ubuntu or CentOS): MacOS and linux
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After uploading the documents in milvus, I query with client.search and filter = f"TEXT_MATCH(content, '{important_words}')".
The first query goes well, then even trying with the same query it returns no results.

Expected Behavior

No response

Steps To Reproduce

[UPDATED]

Start Milvus on Docker

Create a collection with

        COLLECTION_NAME = "Documents"
        EMBEDDING_DIM = 768
        analyzer_params = {
                "type": "standard",
                "filter": ["lowercase"],
                }

        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, max_length=100, is_primary=True),
            FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=10000, enable_analyzer=True, analyzer_params=analyzer_params, enable_match=True),
            FieldSchema(name="file_id", dtype=DataType.VARCHAR, max_length=100),
            FieldSchema(name="file_name", dtype=DataType.VARCHAR, max_length=255),
            FieldSchema(name="page_num", dtype=DataType.INT64),
            FieldSchema(name="para_num", dtype=DataType.INT64),
            FieldSchema(name="data_type", dtype=DataType.VARCHAR, max_length=50),
            FieldSchema(name="company_name", dtype=DataType.VARCHAR, max_length=255),
            FieldSchema(name="file_path", dtype=DataType.VARCHAR, max_length=100),
            FieldSchema(name="folder_name", dtype=DataType.VARCHAR, max_length=255),
            FieldSchema(name="created_date", dtype=DataType.VARCHAR, max_length=50),
            FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=EMBEDDING_DIM),
            FieldSchema(name="sparse_embeddings", dtype=DataType.SPARSE_FLOAT_VECTOR)
        ]

        try:
            schema = CollectionSchema(fields, description="Document collection", auto_id=True, enable_dynamic_field=True)
            collection = Collection(COLLECTION_NAME, schema)
            # Create an index on the dense embedding field

            try:
                dense_index_params = {
                    "index_type": "IVF_FLAT",  # Choose an appropriate index type (e.g., IVF_FLAT)
                    "metric_type": "COSINE",   # Metric type (e.g., COSINE for cosine similarity)
                    "params": {"nlist": 128}   # Index parameters
                }

                collection.create_index(field_name="embedding", index_params=dense_index_params)
            except Exception as e:
                print(e)

            try:
                # Create an index on the sparse embedding field
                sparse_index_params = {
                    "index_type": "SPARSE_INVERTED_INDEX",
                    "metric_type": "IP",  # Inner Product is supported for sparse vectors
                    "params": {"drop_ratio_build": 0.2}  # Parameters specific to sparse indexing
                }

                collection.create_index(field_name="sparse_embeddings", index_params=sparse_index_params)

                logger.info(f"Indexes created successfully")

            except Exception as e:
                logger.error(f"Error while creating Indexes: {e}")
        except Exception as e:
            logger.error(f"Error while creating Schema: {e}")

ONLY CREATE COLLECTION IF the password is not updated, so one time only.

Load documents inside the collection, and try the TEXT_MATCH filter. At this stage, it works normally.

Remove the container and rebuild it. Now, the collection and the documents are still in the volume, so it doesn't create a new one.

Try the TEXT_MATCH filter, it doesn't find any results, while the normal semantic search does.

The only solution I've found is to delete all the volumes, recreate the collection, and reinsert the documents.

Milvus Log

No response

Anything else?

No response

@danielelongo14 danielelongo14 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 22, 2024
@yanliang567
Copy link
Contributor

yanliang567 commented Dec 24, 2024

there is a limitation that Text_Match does not search in the growing segment immediately, until it was synced to disk. I guess you hint this issue, so please try to manually flush() the collection after insertion. BTW, it is not recommended to flush manually, but in this case you can try that.

/assign @danielelongo14
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 24, 2024
@xiaofan-luan
Copy link
Collaborator

TEXT_MATCH

Guess this is not the case.
Text Match is guarantee to be searched in 200ms, if not it's a issue

@xiaofan-luan
Copy link
Collaborator

Remove the container and rebuild it. Now, the collection and the documents are still in the volume, so it doesn't create a new one.

did you mean stop the container and brought up a another one?

@yanliang567 my guess is match is not done in one collection

@danielelongo14
Copy link
Author

Yes, I usually do

docker compose up --build --no-cache and then docker compose down

Without deleting the volumes, do again docker compose up --build --no-cache

Now text match won't work, while semantic search will

@yiwen92
Copy link

yiwen92 commented Jan 8, 2025

Thanks for you report @danielelongo14 . Our dev team just found an internal bug that might cause this issue. Now it's under fixing, will let you know which version should be fixed.

@SpadeA-Tang SpadeA-Tang linked a pull request Jan 8, 2025 that will close this issue
@danielelongo14
Copy link
Author

Thank you!

Sometimes also I get a LoadSegment: Error in GetObjectSize, [errcode:404, exception:, errmessage:No response body.,
Do you guys happen to know what can cause it while loading the collection?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants