Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Using synonym filter after hunspell. #16530

Open
aswad1 opened this issue Oct 31, 2024 · 1 comment
Open

[BUG] Using synonym filter after hunspell. #16530

aswad1 opened this issue Oct 31, 2024 · 1 comment
Assignees
Labels
bug Something isn't working Other

Comments

@aswad1
Copy link

aswad1 commented Oct 31, 2024

Describe the bug

When using synonym filter after hunspell. I don't see the expected plural synonyms in the output. In the configuration below, I have added synonyms:

  • stationary
  • stationery
  • stationaries
  • stationeries
PUT /test-index3
{
  "settings": {
    "analysis": {
      "filter": {
        "custom_synonym_graph-replacement_filter": {
          "type": "synonym_graph",
          "synonyms": [
            "stationary, stationery, stationaries, stationeries"
          ]
        },
        "custom_hunspell_stemmer": {
          "type": "hunspell",
          "locale": "en_US"
        }
      },
      "analyzer": {
        "test_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "custom_hunspell_stemmer",
            "custom_synonym_graph-replacement_filter"
          ]
        }
      }
    }
  }

While testing, I don't see stationaries and stationeries in the output.

POST /test-index3/_analyze
{
  "analyzer": "test_analyzer",
  "text": "stationary"
}

--
{
  "tokens": [
    {
      "token": "stationery",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "stationary",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "stationary",
      "start_offset": 0,
      "end_offset": 10,
      "type": "word",
      "position": 0
    }
  ]
}

Here is the details analysis from Opensearch:

POST /test-index3/_analyze
{
  "analyzer": "test_analyzer",
  "text": "stationary",
   "explain": true
}

------------------
{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "whitespace",
      "tokens": [
        {
          "token": "stationary",
          "start_offset": 0,
          "end_offset": 10,
          "type": "word",
          "position": 0,
          "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
          "positionLength": 1,
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "lowercase",
        "tokens": [
          {
            "token": "stationary",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
            "positionLength": 1,
            "termFrequency": 1
          }
        ]
      },
      {
        "name": "custom_hunspell_stemmer",
        "tokens": [
          {
            "token": "stationary",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
            "keyword": false,
            "positionLength": 1,
            "termFrequency": 1
          }
        ]
      },
      {
        "name": "custom_synonym_graph-replacement_filter",
        "tokens": [
          {
            "token": "stationery",
            "start_offset": 0,
            "end_offset": 10,
            "type": "SYNONYM",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 65 72 79]",
            "keyword": false,
            "positionLength": 1,
            "termFrequency": 1
          },
          {
            "token": "stationary",
            "start_offset": 0,
            "end_offset": 10,
            "type": "SYNONYM",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
            "keyword": false,
            "positionLength": 1,
            "termFrequency": 1
          },
          {
            "token": "stationary",
            "start_offset": 0,
            "end_offset": 10,
            "type": "word",
            "position": 0,
            "bytes": "[73 74 61 74 69 6f 6e 61 72 79]",
            "keyword": false,
            "positionLength": 1,
            "termFrequency": 1
          }
        ]
      }
    ]
  }
}

The hunspell rules and dictionary files are attached.
en-US.aff.txt
en-US.dic.txt

Related component

Other

To Reproduce

N/A

Expected behavior

The screen capture for Solr analysis screenshot where the synonym graph filter is highlighted. You will see all the synonyms displayed under SGF

Solr-screenshot

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@aswad1 aswad1 added bug Something isn't working untriaged labels Oct 31, 2024
@github-actions github-actions bot added the Other label Oct 31, 2024
@prudhvigodithi
Copy link
Member

prudhvigodithi commented Oct 31, 2024

[Triage]

Coming from #16263 and with the proposed fix to add synonym_analyzer for the synonym_graph (PR #16488) should solve this bug as well.

  • Download the attached .aff and .dic files and put them under config/hunspell/en_US folder.
    Please note there is an issue in the attached .aff file, the following has to be updated to SFX Z Y 14 as SFX Z rule has 8 declared in the header but actually contains 14 rules.
SFX Z Y 8
SFX Z   0     rs         e
SFX Z   y     iers       [^aeiou]y
SFX Z   0     ers        [aeiou]y
SFX Z   0     ers        [^ey]
SFX Z   0     ners         [aiu]n
SFX Z   0     ers          [^e]an
SFX Z   e     ners         [aiu]ne
SFX Z   0     rly         e
SFX Z   y     ierly       [^aeiou]y
SFX Z   0     erly        [aeiou]y
SFX Z   0     erly        [^ey]
SFX Z   0     nerly       [aiu]n
SFX Z   0     erly        [^e]an
SFX Z   e     nerly       [aiu]ne
curl -X PUT "localhost:9200/test-index5" \
-H "Content-Type: application/json" \
-d '{
    "settings": {
        "analysis": {
            "filter": {
                "custom_synonym_graph-replacement_filter": {
                    "type": "synonym_graph",
                    "synonyms": [
                        "stationary, stationery, stationaries, stationeries"
                    ],
					"synonym_analyzer": "standard"
                },
                "custom_hunspell_stemmer": {
                    "type": "hunspell",
                    "locale": "en_US"
                }
            },
            "analyzer": {
                "test_analyzer": {
                    "type": "custom",
                    "tokenizer": "whitespace",
                    "filter": [
                        "lowercase",
                        "custom_hunspell_stemmer",
                        "custom_synonym_graph-replacement_filter"
                    ]
                }
            }
        }
    }
}'
  curl -X POST "localhost:9200/test-index5/_analyze" -H "Content-Type: application/json" -d '{
	"analyzer": "test_analyzer",
	"text": "stationary"
  }'
  • Output: The output now contains stationaries and stationeries.
{
  "tokens": [
    {
      "token": "stationery",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "stationaries",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "stationeries",
      "start_offset": 0,
      "end_offset": 10,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "stationary",
      "start_offset": 0,
      "end_offset": 10,
      "type": "word",
      "position": 0
    }
  ]
}

Thank you
@msfroh @getsaurabh02 @nupurjaiswal @dblock @aswad1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Other
Projects
None yet
Development

No branches or pull requests

2 participants