Match queries and field values only if both went through the same analyzer #2108

k0ral · 2024-11-28T16:37:30Z

(Full reproducible example at the end.)

I have trouble setting up bleve to achieve the specific use case described below ; it's not clear to me whether it's actually feasible, you tell me :) .

I'm in a situation where I need to support:

case-insensitive queries: match if some word is present regardless of text case
exact match queries: match if, and only if, the original document (before analysis) has some word verbatim
a mix of the above combined with boolean operators (e.g. in pseudo-code: case_insensitive("foo") AND exact("bar"))

I expect this is a matter of properly setting analyzers, but I'm not sure. I have created the following analyzers:

var LowercaseWords = MyAnalyzer{
	Name: "lowercase_words",
	Settings: map[string]interface{}{
		"type":      custom.Name,
		"tokenizer": whitespace.Name,
		"token_filters": []string{
			lowercase.Name,
		},
	},
}

var ExactWords = MyAnalyzer{
	Name: "exact_words",
	Settings: map[string]interface{}{
		"type":      custom.Name,
		"tokenizer": whitespace.Name,
	},
}

My understanding is that analyzers can be assigned to:

document fields
match queries

I can implement (1) alone by setting the lowercase_words analyzer to both the field and the match query.
Similarly, I can implement (2) alone by setting the exact_words analyzer to both the field and the match query.
However, I can't find a way to implement (3) : indeed, if I set both analyzers on the same field, bleve will match the exact match query against tokens on which the lowercase analyzer has been applied, which violates (2).

In other words: I'm trying to have bleve match queries and tokens only if they went through the same analyzer. And I haven't found how to express that requirement using the current API.

Can this use case be achieved with bleve ? If so, how ? If not, any hints on how to patch bleve to support it ?

Full reproducible example

The expected values of below tests show the behavior I'm trying to achieve.

package minimal_test

import (
	"testing"

	"github.com/blevesearch/bleve/v2"
	"github.com/blevesearch/bleve/v2/analysis/analyzer/custom"
	"github.com/blevesearch/bleve/v2/analysis/token/lowercase"
	"github.com/blevesearch/bleve/v2/analysis/tokenizer/whitespace"
	"github.com/blevesearch/bleve/v2/mapping"
	"github.com/blevesearch/bleve/v2/search/query"
	"github.com/stretchr/testify/require"
)

type MyAnalyzer struct {
	Name     string
	Settings map[string]interface{}
}

var ExactWords = MyAnalyzer{
	Name: "exact_words",
	Settings: map[string]interface{}{
		"type":      custom.Name,
		"tokenizer": whitespace.Name,
	},
}

var LowercaseWords = MyAnalyzer{
	Name: "lowercase_words",
	Settings: map[string]interface{}{
		"type":      custom.Name,
		"tokenizer": whitespace.Name,
		"token_filters": []string{
			lowercase.Name,
		},
	},
}

const documentType = "documentType"

type MyDocument struct {
	MyField string `json:"my_field"`
}

// The receiver MUST NOT be a pointer for type resolution to work in bleve
func (d MyDocument) Type() string {
	return documentType
}

var exactFieldMapping = func() *mapping.FieldMapping {
	exactMapping := mapping.NewTextFieldMapping()
	exactMapping.Analyzer = ExactWords.Name

	return exactMapping
}()

var lowercaseFieldMapping = func() *mapping.FieldMapping {
	lowercaseMapping := mapping.NewTextFieldMapping()
	lowercaseMapping.Analyzer = LowercaseWords.Name

	return lowercaseMapping
}()

func newDocumentMapping(fieldMappings []*mapping.FieldMapping) *mapping.DocumentMapping {
	docMapping := mapping.NewDocumentMapping()
	docMapping.AddFieldMappingsAt("my_field", fieldMappings...)
	return docMapping
}

func newIndex(documentMapping *mapping.DocumentMapping) (bleve.Index, error) {
	indexMapping := bleve.NewIndexMapping()
	err := indexMapping.AddCustomAnalyzer(ExactWords.Name, ExactWords.Settings)
	if err != nil {
		return nil, err
	}

	err = indexMapping.AddCustomAnalyzer(LowercaseWords.Name, LowercaseWords.Settings)
	if err != nil {
		return nil, err
	}

	indexMapping.AddDocumentMapping(documentType, documentMapping)

	return bleve.NewMemOnly(indexMapping)
}

func TestExact(t *testing.T) {
	t.Parallel()

	cases := map[string]struct {
		FieldValue    string
		FieldMappings []*mapping.FieldMapping
		Query         query.Query
		ExpectedMatch bool
	}{
		"exact_mapping_exact_query": {
			FieldValue:    "foo foobar baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: true,
		},

		// Fails
		"lowercase_mapping_exact_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: false,
		},

		// Fails
		"both_mappings_exact_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping, lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: false,
		},

		// Fails
		"exact_mapping_lowercase_query": {
			FieldValue:    "foo foobar baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: false,
		},

		"lowercase_mapping_lowercase_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true,
		},

		"both_mappings_lowercase_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping, lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true,
		},
	}

	for name, c := range cases {
		t.Run(name, func(t *testing.T) {
			t.Parallel()
			index, err := newIndex(newDocumentMapping(c.FieldMappings))
			require.NoError(t, err)

			err = index.Index("document_id", MyDocument{MyField: c.FieldValue})
			require.NoError(t, err)

			search := bleve.NewSearchRequest(c.Query)
			search.Highlight = bleve.NewHighlight()
			search.IncludeLocations = true
			result, err := index.Search(search)
			require.NoError(t, err)

			require.Equal(t, 1, result.Status.Successful)
			if c.ExpectedMatch {
				require.Len(t, result.Hits, 1)
				require.Equal(t, "document_id", result.Hits[0].ID)
			} else {
				require.Empty(t, result.Hits)
			}
		})
	}
}

The text was updated successfully, but these errors were encountered:

abhinavdangeti · 2024-12-09T18:48:22Z

@k0ral Your expectation for this to work is right.

When you apply both field mappings to your field, analyzed tokens for both the mapping rules will be indexed for the field.
Also these tokens will be written the composite _all field.
Now when you run your query, you're not setting the field for it - meaning it'll go look for the search tokens generated by the chosen analyzer within the _all composite field.

When I try your code with the latest version of bleve I see test cases 2, 3 and 4 failing because - you've set ExpectedMatch to false. I changed that expectations and the test passes ..

	}{
		"exact_mapping_exact_query": {
			FieldValue:    "foo foobar baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: true,
		},

		"lowercase_mapping_exact_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: true, // not expected to fail because foobar is indexed by the lower case mapping
		},

		"both_mappings_exact_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping, lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "foobar", Analyzer: ExactWords.Name},
			ExpectedMatch: true, // tokens from both mappings should be available
		},

		"exact_mapping_lowercase_query": {
			FieldValue:    "foo foobar baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true, // won't fail because foobar (after lower case analyzer was applied) is indexed as is by exact mapping
		},

		"lowercase_mapping_lowercase_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true,
		},

		"both_mappings_lowercase_query": {
			FieldValue:    "Foo FooBar Baz",
			FieldMappings: []*mapping.FieldMapping{exactFieldMapping, lowercaseFieldMapping},
			Query:         &query.MatchQuery{Match: "FooBar", Analyzer: LowercaseWords.Name},
			ExpectedMatch: true, // tokens from both mappings should be available
		},
	}

Test output ..

$ go test -run=TestExact -v
=== RUN   TestExact
=== PAUSE TestExact
=== CONT  TestExact
=== RUN   TestExact/exact_mapping_exact_query
=== PAUSE TestExact/exact_mapping_exact_query
=== RUN   TestExact/lowercase_mapping_exact_query
=== PAUSE TestExact/lowercase_mapping_exact_query
=== RUN   TestExact/both_mappings_exact_query
=== PAUSE TestExact/both_mappings_exact_query
=== RUN   TestExact/exact_mapping_lowercase_query
=== PAUSE TestExact/exact_mapping_lowercase_query
=== RUN   TestExact/lowercase_mapping_lowercase_query
=== PAUSE TestExact/lowercase_mapping_lowercase_query
=== RUN   TestExact/both_mappings_lowercase_query
=== PAUSE TestExact/both_mappings_lowercase_query
=== CONT  TestExact/exact_mapping_exact_query
=== CONT  TestExact/exact_mapping_lowercase_query
=== CONT  TestExact/both_mappings_lowercase_query
=== CONT  TestExact/lowercase_mapping_lowercase_query
=== CONT  TestExact/both_mappings_exact_query
=== CONT  TestExact/lowercase_mapping_exact_query
--- PASS: TestExact (0.00s)
    --- PASS: TestExact/lowercase_mapping_lowercase_query (0.00s)
    --- PASS: TestExact/lowercase_mapping_exact_query (0.00s)
    --- PASS: TestExact/both_mappings_lowercase_query (0.00s)
    --- PASS: TestExact/exact_mapping_lowercase_query (0.00s)
    --- PASS: TestExact/both_mappings_exact_query (0.00s)
    --- PASS: TestExact/exact_mapping_exact_query (0.00s)
PASS
ok  	github.com/couchbase/cbft	0.388s

Would you test this with the latest release version and confirm that you still see a situation?

k0ral · 2024-12-10T08:26:22Z

When I try your code with the latest version of bleve I see test cases 2, 3 and 4 failing because - you've set ExpectedMatch to false. I changed that expectations and the test passes

Thank you for taking the time to answer. I guess my original post wasn't clear enough, sorry for that.

The values I used for ExpectedMatch are the behavior I'm trying to achieve, I'm asking how to setup bleve (index, mappings, etc.) to make those tests pass without changing ExpectedMatch 🙂 .

Also, I should mention I'm using version 2.4.3, which is the latest release at this time I believe.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match queries and field values only if both went through the same analyzer #2108

Match queries and field values only if both went through the same analyzer #2108

k0ral commented Nov 28, 2024

abhinavdangeti commented Dec 9, 2024

k0ral commented Dec 10, 2024 •

edited

Loading

Match queries and field values only if both went through the same analyzer #2108

Match queries and field values only if both went through the same analyzer #2108

Comments

k0ral commented Nov 28, 2024

Full reproducible example

abhinavdangeti commented Dec 9, 2024

k0ral commented Dec 10, 2024 • edited Loading

k0ral commented Dec 10, 2024 •

edited

Loading