Solr NGramFilterFactory, PatternReplaceFilterFactory and PatternReplaceCharFilterFactory

The congiuration below in eh schema.xml utilized solr.NGramFilterFactory, solr.PatternReplaceFilterFactory and solr.PatternReplaceCharFilterFactory.
NGramFilterFactory enable partial match for the my_copied_id.
PatternReplaceFilterFactory replace all digits with the letter “a” in the index section, and in the query section, it only keep the first 8 characters of the query term, discard anything come after the 8th character.
PatternReplaceCharFilterFactory replace all digits with the letter “a” of the query term.

<field name="my_copied_id" type="my_id_type" indexed="true" stored="true" required="false"/>

<copyField source="id" dest="my_copied_id" />

<fieldType name="my_id_type" class="solr.TextField">
	  <analyzer type="index">
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
			<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" />
			<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="1000" />
			<filter class="solr.PatternReplaceFilterFactory" pattern="[0-9]" replacement="a" replace="all" />
			<filter class="solr.LowerCaseFilterFactory"/>
	  </analyzer>
	  <analyzer type="query">
	  <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[0-9]" replacement="a"/>
		<tokenizer class="solr.KeywordTokenizerFactory"/>
			<filter class="solr.PatternReplaceFilterFactory" pattern="(^.{8}).*" replacement="$1" replace="all" />
			<filter class="solr.LowerCaseFilterFactory"/>
	  </analyzer>
</fieldType>

With the above configuration, and if the documents to be indexed contain the field with name id, it will copy the id field and create
a new field my_copied_id.

When indexing the data:
First, it will be filtered by NGramFilterFactory which does something like this:
Tokenizer to Filter: “four”, “score”
Out: “fou”, “our”, “four”, “sco”, “cor”, “ore”, “scor”, “core”, “score”
Second, it will be filtered by PatternReplaceFilterFactory, which will replace all digits with the letter “a”.
Third, it will be filtered by LowerCaseFilterFactory, which will convert all letters to lowercase.

When querying for the filed my_copied_id:
First, the query term will be filtered by PatternReplaceCharFilterFactory which will replace all digits with the letter a.
Second, the query term will be filtered by PatternReplaceFilterFactory, which will only keep the first 8 characters of the query term, discard anything come after the 8th character.
Third, the query term will be filtered by LowerCaseFilterFactory, which will convert all letters to lowercase.

For example, if the document has a field id with the name solr1000
The below queries are all able to find the document with id=solr1000

    my_copied_id:solr1000
    my_copied_id:solr123456dsfgadrhgshdh
    my_copied_id:solraaa
    my_copied_id:solr12345
    my_copied_id:solr12
    my_copied_id:lr1000
    my_copied_id:lr1234
    my_copied_id:lraa
    my_copied_id:aaa

The returned document will show the same value for id and my_copied_id, in this case will all showing SOLR1000. This is because the filter changes the value store the changed value in as indexes, it is not changing the original stored value. We can do facet search on my_copied_id filed to see the indexed values.

Search within Codexpedia

Custom Search

Search the entire web

Custom Search