Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match queries and field values only if both went through the same analyzer #2108

Open
k0ral opened this issue Nov 28, 2024 · 2 comments
Open

Comments

@k0ral
Copy link

k0ral commented Nov 28, 2024

(Full reproducible example at the end.)

I have trouble setting up bleve to achieve the specific use case described below ; it's not clear to me whether it's actually feasible, you tell me :) .

I'm in a situation where I need to support:

  1. case-insensitive queries: match if some word is present regardless of text case
  2. exact match queries: match if, and only if, the original document (before analysis) has some word verbatim
  3. a mix of the above combined with boolean operators (e.g. in pseudo-code: case_insensitive("foo") AND exact("bar"))

I expect this is a matter of properly setting analyzers, but I'm not sure. I have created the following analyzers:

var LowercaseWords = MyAnalyzer{
	Name: "lowercase_words",
	Settings: map[string]interface{}{
		"type":      custom.Name,
		"tokenizer": whitespace.Name,
		"token_filters": []string{
			lowercase.Name,
		},
	},
}

var ExactWords = MyAnalyzer{
	Name: "exact_words",
	Settings: map[string]interface{}{
		"type":      custom.Name,
		"tokenizer": whitespace.Name,
	},
}

My understanding is that analyzers can be assigned to:

  • document fields
  • match queries

I can implement (1) alone by setting the lowercase_words analyzer to both the field and the match query.
Similarly, I can implement (2) alone by setting the exact_words analyzer to both the field and the match query.
However, I can't find a way to implement (3) : indeed, if I set both analyzers on the same field, bleve will match the exact match query against tokens on which the lowercase analyzer has been applied, which violates (2).

In other words: I'm trying to have bleve match queries and tokens only if they went through the same analyzer. And I haven't found how to express that requirement using the current API.

Can this use case be achieved with bleve ? If so, how ? If not, any hints on how to patch bleve to support it ?

Full reproducible example

The expected values of below tests show the behavior I'm trying to achieve.

package minimal_test

import (
	"testing"

	"github.com/blevesearch/bleve/v2"
	"github.com/blevesearch/bleve/v2/analysis/analyzer/custom"
	"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
	"github.com/blevesearch/bleve/v2/analysis/tokenizer/whitespace"
	"github.com/blevesearch/bleve/v2/mapping"
	"github.com/blevesearch/bleve/v2/search/query"
	"github.com/stretchr/testify/require"
)

type MyAnalyzer struct {
	Name     string
	Settings map[string]interface{}
}

var ExactWords = MyAnalyzer{
	Name: "exact_words",
	Settings: map[string]interface{}{
		"type":      custom.Name,
		"tokenizer": whitespace.Name,
	},
}

var LowercaseWords = MyAnalyzer{
	Name: "lowercase_words",
	Settings: map[string]interface{}{
		"type":      custom.Name,
		"tokenizer": whitespace.Name,
		"token_filters": []string{
			lowercase.Name,
		},
	},
}

const documentType = "documentType"

type MyDocument struct {
	MyField string `json:"my_field"`
}

// The receiver MUST NOT be a pointer for type resolution to work in bleve
func (d MyDocument) Type() string {
	return documentType
}

var exactFieldMapping = func() *mapping.FieldMapping {
	exactMapping := mapping.NewTextFieldMapping()
	exactMapping.Analyzer = ExactWords.Name

	return exactMapping
}()

var lowercaseFieldMapping = func() *mapping.FieldMapping {
	lowercaseMapping := mapping.NewTextFieldMapping()
	lowercaseMapping.Analyzer = LowercaseWords.Name

	return lowercaseMapping
}()

func newDocumentMapping(fieldMappings []*mapping.FieldMapping) *mapping.DocumentMapping {
	docMapping := mapping.NewDocumentMapping()
	docMapping.AddFieldMappingsAt("my_field", fieldMappings...)
	return docMapping
}

func newIndex(documentMapping *mapping.DocumentMapping) (bleve.Index, error) {
	indexMapping := bleve.NewIndexMapping()
	err := indexMapping.AddCustomAnalyzer(ExactWords.Name, ExactWords.Settings)
	if err != nil {
		return nil, err
	}

	err = indexMapping.AddCustomAnalyzer(LowercaseWords.Name, LowercaseWords.Settings)
	if err != nil {
		return nil, err
	}

	indexMapping.AddDocumentMapping(documentType, documentMapping)

	return bleve.NewMemOnly(indexMapping)
}

func TestExact(t *testing.T) {
	t.Parallel()

	cases := map[string]struct {
		FieldValue    string
		FieldMappings []*mapping.FieldMapping
		Query         query.Query
		ExpectedMatch bool
	}{
		"exact_mapping_exact_query": {
			FieldValue:    "foo foobar baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: true,
		},

		// Fails
		"lowercase_mapping_exact_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: false,
		},

		// Fails
		"both_mappings_exact_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping, lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: false,
		},

		// Fails
		"exact_mapping_lowercase_query": {
			FieldValue:    "foo foobar baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: false,
		},

		"lowercase_mapping_lowercase_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true,
		},

		"both_mappings_lowercase_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping, lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true,
		},
	}

	for name, c := range cases {
		t.Run(name, func(t *testing.T) {
			t.Parallel()
			index, err := newIndex(newDocumentMapping(c.FieldMappings))
			require.NoError(t, err)

			err = index.Index("document_id", MyDocument{MyField: c.FieldValue})
			require.NoError(t, err)

			search := bleve.NewSearchRequest(c.Query)
			search.Highlight = bleve.NewHighlight()
			search.IncludeLocations = true
			result, err := index.Search(search)
			require.NoError(t, err)

			require.Equal(t, 1, result.Status.Successful)
			if c.ExpectedMatch {
				require.Len(t, result.Hits, 1)
				require.Equal(t, "document_id", result.Hits[0].ID)
			} else {
				require.Empty(t, result.Hits)
			}
		})
	}
}
@abhinavdangeti
Copy link
Member

@k0ral Your expectation for this to work is right.

  • When you apply both field mappings to your field, analyzed tokens for both the mapping rules will be indexed for the field.
  • Also these tokens will be written the composite _all field.
  • Now when you run your query, you're not setting the field for it - meaning it'll go look for the search tokens generated by the chosen analyzer within the _all composite field.

When I try your code with the latest version of bleve I see test cases 2, 3 and 4 failing because - you've set ExpectedMatch to false. I changed that expectations and the test passes ..

	}{
		"exact_mapping_exact_query": {
			FieldValue:    "foo foobar baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: true,
		},

		"lowercase_mapping_exact_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: true, // not expected to fail because foobar is indexed by the lower case mapping
		},

		"both_mappings_exact_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping, lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: true, // tokens from both mappings should be available
		},

		"exact_mapping_lowercase_query": {
			FieldValue:    "foo foobar baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true, // won't fail because foobar (after lower case analyzer was applied) is indexed as is by exact mapping
		},

		"lowercase_mapping_lowercase_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true,
		},

		"both_mappings_lowercase_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping, lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true, // tokens from both mappings should be available
		},
	}

Test output ..

$ go test -run=TestExact -v
=== RUN   TestExact
=== PAUSE TestExact
=== CONT  TestExact
=== RUN   TestExact/exact_mapping_exact_query
=== PAUSE TestExact/exact_mapping_exact_query
=== RUN   TestExact/lowercase_mapping_exact_query
=== PAUSE TestExact/lowercase_mapping_exact_query
=== RUN   TestExact/both_mappings_exact_query
=== PAUSE TestExact/both_mappings_exact_query
=== RUN   TestExact/exact_mapping_lowercase_query
=== PAUSE TestExact/exact_mapping_lowercase_query
=== RUN   TestExact/lowercase_mapping_lowercase_query
=== PAUSE TestExact/lowercase_mapping_lowercase_query
=== RUN   TestExact/both_mappings_lowercase_query
=== PAUSE TestExact/both_mappings_lowercase_query
=== CONT  TestExact/exact_mapping_exact_query
=== CONT  TestExact/exact_mapping_lowercase_query
=== CONT  TestExact/both_mappings_lowercase_query
=== CONT  TestExact/lowercase_mapping_lowercase_query
=== CONT  TestExact/both_mappings_exact_query
=== CONT  TestExact/lowercase_mapping_exact_query
--- PASS: TestExact (0.00s)
    --- PASS: TestExact/lowercase_mapping_lowercase_query (0.00s)
    --- PASS: TestExact/lowercase_mapping_exact_query (0.00s)
    --- PASS: TestExact/both_mappings_lowercase_query (0.00s)
    --- PASS: TestExact/exact_mapping_lowercase_query (0.00s)
    --- PASS: TestExact/both_mappings_exact_query (0.00s)
    --- PASS: TestExact/exact_mapping_exact_query (0.00s)
PASS
ok  	github.com/couchbase/cbft	0.388s

Would you test this with the latest release version and confirm that you still see a situation?

@k0ral
Copy link
Author

k0ral commented Dec 10, 2024

When I try your code with the latest version of bleve I see test cases 2, 3 and 4 failing because - you've set ExpectedMatch to false. I changed that expectations and the test passes

Thank you for taking the time to answer. I guess my original post wasn't clear enough, sorry for that.

The values I used for ExpectedMatch are the behavior I'm trying to achieve, I'm asking how to setup bleve (index, mappings, etc.) to make those tests pass without changing ExpectedMatch 🙂 .

Also, I should mention I'm using version 2.4.3, which is the latest release at this time I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants