Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

charset_table #16

Open
realdigger opened this issue May 31, 2020 · 8 comments
Open

charset_table #16

realdigger opened this issue May 31, 2020 · 8 comments

Comments

@realdigger
Copy link

realdigger commented May 31, 2020

charset_table = 0..9, A..Z->a..z, _, a..z

Would be better to remove this line or make it optional, because it limit index to latin charset only.
Default value for charset_table are latin and cyrillic characters.

@Yariksat
Copy link

+1
Himself faced this today
With a default config is not looking for Russian words

@jdarwood007
Copy link
Member

It would be better if this charset_table would detect if your using UTF8 or not. If using UTF-8 it should build a charset safe for UTF-8.

@jdarwood007
Copy link
Member

@Yariksat and @realdigger

Can you guys try the following in your configs. See if this gets support working.

charset_table = U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, \
		U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, _, a..z, \
\
U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\
U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\
U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\
U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\
U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,\
U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,\
U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, U+0134->U+0135,\
U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, U+013C,\
U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, U+0143->U+0144,\
U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,\
U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,\
U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159, U+0159,\
U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, U+0160->U+0161,\
U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, U+0167,\
U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,\
U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,\
U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, U+017B->U+017C,\
U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, U+0430..U+044F,\
U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, U+0621..U+063A, U+01B9,\
U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06D3, U+06F0..U+06FF,\
U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, U+0966..U+096F, U+097B..U+097F,\
U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, U+0A05..U+0A39, U+0A59..U+0A5E,\
U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, U+0AE6..U+0AEF, U+0B05..U+0B39,\
U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, U+0BE6..U+0BF2, U+0C05..U+0C39,\
U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,\
U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,\
U+A807..U+A822, U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,\
U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,\
U+03AF->U+03B9, U+03CA->U+03B9, U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,\
U+03AB->U+03C5, U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,\
U+03CE->U+03C9, U+03C2->U+03C3, U+0391..U+03A1->U+03B1..U+03C1,\
U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, U+03C3..U+03C9,\
U+0E01..U+0E3A, U+0E3F..U+0E46,\
U+0E47..U+0E4F, U+0E50..U+0E5B,\
U+A000..U+A48F,\
U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, U+3040..U+309F, U+30A0..U+30FF,\
U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, U+3130..U+318F, U+A000..U+A48F,\
U+A490..U+A4CF, \
\
U+410..U+42F->U+430..U+44F, U+430..U+44F, \
\
U+621..U+63a, U+640..U+64a, U+66e..U+66f, \
U+671..U+6d3, U+6d5, U+6e5..U+6e6, U+6ee..U+6ef, \
U+6fa..U+6fc, U+6ff

ngram_len = 1
ngram_chars = U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, \
	U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, \
	U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, \
	U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, \
	U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, \
	U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, \
	U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, \
	U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, \
	U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, \
	U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, \
	U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, \
	U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, \
	U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, \
	U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, \
	U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, \
	U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, \
	U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6

@Yariksat
Copy link

@Yariksat and @realdigger

Can you guys try the following in your configs. See if this gets support working.

charset_table = U+FF10..U+FF19->0..9, U+FF21..U+FF3A->a..z, \
		U+FF41..U+FF5A->a..z, 0..9, A..Z->a..z, _, a..z, \
\
U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\
U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\
U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\
U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\
U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, U+0116->U+0117,\
U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D, U+011D,\
U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, U+0134->U+0135,\
U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, U+013C,\
U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, U+0143->U+0144,\
U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, U+014B,\
U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, U+0152->U+0153,\
U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159, U+0159,\
U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, U+0160->U+0161,\
U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, U+0167,\
U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, U+016E->U+016F,\
U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175, U+0175,\
U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, U+017B->U+017C,\
U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, U+0430..U+044F,\
U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, U+0621..U+063A, U+01B9,\
U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, U+0671..U+06D3, U+06F0..U+06FF,\
U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, U+0966..U+096F, U+097B..U+097F,\
U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, U+0A05..U+0A39, U+0A59..U+0A5E,\
U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, U+0AE6..U+0AEF, U+0B05..U+0B39,\
U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, U+0BE6..U+0BF2, U+0C05..U+0C39,\
U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,\
U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,\
U+A807..U+A822, U+0386->U+03B1, U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,\
U+0389->U+03B7, U+03AE->U+03B7, U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,\
U+03AF->U+03B9, U+03CA->U+03B9, U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,\
U+03AB->U+03C5, U+03B0->U+03C5, U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,\
U+03CE->U+03C9, U+03C2->U+03C3, U+0391..U+03A1->U+03B1..U+03C1,\
U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, U+03C3..U+03C9,\
U+0E01..U+0E3A, U+0E3F..U+0E46,\
U+0E47..U+0E4F, U+0E50..U+0E5B,\
U+A000..U+A48F,\
U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, U+3040..U+309F, U+30A0..U+30FF,\
U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, U+3130..U+318F, U+A000..U+A48F,\
U+A490..U+A4CF, \
\
U+410..U+42F->U+430..U+44F, U+430..U+44F, \
\
U+621..U+63a, U+640..U+64a, U+66e..U+66f, \
U+671..U+6d3, U+6d5, U+6e5..U+6e6, U+6ee..U+6ef, \
U+6fa..U+6fc, U+6ff

ngram_len = 1
ngram_chars = U+4E00..U+9FBB, U+3400..U+4DB5, U+20000..U+2A6D6, \
	U+FA0E, U+FA0F, U+FA11, U+FA13, U+FA14, U+FA1F, U+FA21, \
	U+FA23, U+FA24, U+FA27, U+FA28, U+FA29, U+3105..U+312C, \
	U+31A0..U+31B7, U+3041, U+3043, U+3045, U+3047, U+3049, \
	U+304B, U+304D, U+304F, U+3051, U+3053, U+3055, U+3057, \
	U+3059, U+305B, U+305D, U+305F, U+3061, U+3063, U+3066, \
	U+3068, U+306A..U+306F, U+3072, U+3075, U+3078, U+307B, \
	U+307E..U+3083, U+3085, U+3087, U+3089..U+308E, \
	U+3090..U+3093, U+30A1, U+30A3, U+30A5, U+30A7, U+30A9, \
	U+30AD, U+30AF, U+30B3, U+30B5, U+30BB, U+30BD, U+30BF, \
	U+30C1, U+30C3, U+30C4, U+30C6, U+30CA, U+30CB, U+30CD, \
	U+30CE, U+30DE, U+30DF, U+30E1, U+30E2, U+30E3, U+30E5, \
	U+30E7, U+30EE, U+30F0..U+30F3, U+30F5, U+30F6, U+31F0, \
	U+31F1, U+31F2, U+31F3, U+31F4, U+31F5, U+31F6, U+31F7, \
	U+31F8, U+31F9, U+31FA, U+31FB, U+31FC, U+31FD, U+31FE, \
	U+31FF, U+AC00..U+D7A3, U+1100..U+1159, U+1161..U+11A2, \
	U+11A8..U+11F9, U+A000..U+A48C, U+A492..U+A4C6

Thank.
It works, I will observe

@iasdeoupxe
Copy link
Contributor

It could worth to have a look at phpbb/phpbb#5815. Just note that the phpBB implementation is using the older SphinxAPI instead of SphinxQL.

@jdarwood007
Copy link
Member

Just to clarify, they have a different licensing model which means we can't borrow any code from them without it being noncompliant with the GPL and 3 Clause BSD. However the configuration of the Sphinx is not something that would be considered licensable. The difference between the API and QL usage doesn't matter here as this is more how Sphinx is setup to process the data.

Looking at what they set, the charset looks very familiar with some different options elsewhere that we are not setting. If we are finding those options are needed or fix other issues, they should be added. I think the requested test for the updated charset should get things rolling properly.

@iCr
Copy link

iCr commented Feb 25, 2021

charset_table = 0..9, A..Z->a..z, _, a..z

Would be better to remove this line or make it optional, because it limit index to latin charset only.
Default value for charset_table are latin and cyrillic characters.

@realdigger Here is good config for sphinxserach indexes creation:
https://github.com/anilibria/docs/blob/master/install/sphinx.md
https://adw0rd.com/2009/6/15/sphinxsearch/

@jdarwood007
Copy link
Member

So after looking into this and thinking. I think the best solution is to

  • For Manticore, specify the preferred default 'non_cjk'. If japanese, chinese or koren language packs are installed, add the 'cjk' package.
  • For Sphinx, leave this as is. There is just too much to change to get it to provide good defaults.

The config is supposed to be hints to get things working, not be a OOB solution.

http://sphinxsearch.com/docs/current/conf-charset-table.html
http://sphinxsearch.com/wiki/doku.php?id=charset_tables
https://manticoresearch.com/blog/manticore-search-3-years-after-forking-from-sphinx/
https://manticoresearch.com/blog/default-charset-tables-and-stopwords-files/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants