Java Tutorial & more: Solr 6: Schemaless and Schema Mode

The schema of the Solr is the place where we tell Solr how it should build indexes from input documents. The default schema mode after Solr 5 is a schemaless, which allows users/developers to construct a schema without having to manually edit the schema.

On this post, I will talk about how to create a core with schemaless mode, add documents, and also show a few differences between schema mode and schemaless mode in a stand-alone Solr.

Creating a Core
This command starts the Solr.
bin/solr start -f -m 1g

Now, let's create a core with a name 'schemaless'

slkc:solr-6.6.0 jihwan$ pwd
/Users/jihwan/devTools/solr-6.6.0
slkc:solr-6.6.0 jihwan$ bin/solr create -c schemaless

Copying configuration to new core instance directory:
/Users/jihwan/devTools/solr-6.6.0/server/solr/schemaless

Creating new core 'schemaless' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=schemaless&instanceDir=schemaless

{
  "responseHeader":{
    "status":0,
    "QTime":1652},
  "core":"schemaless"}

slkc:solr-6.6.0 jihwan$

Schemaless directory contains several configuration files under a 'conf' directory. By default, this core is a schemaless mode, and a managed-schema file and a solrconfig.xml file are created. The created solrconfig.xml doesn't declare a <schemaFactory> element and uses a ManagedIndexSchemaFactory for the schemaless mode by default. In this case, the managed-schema file is used.

Initial Fields in the managed-schema
Right after the core is created, this managed-schema file contains only four <field> and one <copyField> along with many other elements.

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- doc values are enabled by default for primitive types such as long so we don't index the version field  -->
<field name="_version_" type="long" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

<!-- Only enabled in the "schemaless" data-driven example (assuming the client
     does not know what fields may be searched) because it's very expensive to index everything twice. -->
<copyField source="*" dest="_text_"/>

When other fields are used in a document, their field type will be determined based on field value Java classes, and fields are added to the schema.

Adding Documents
Using a SolrJ client application shown on a next post, randomly generated three millions document are added. This is a server side log at the beginning of the addition and the last part of the logs. It took about 2 mins 9 seconds.

2017-07-24 18:21:22.847 INFO  (qtp1205044462-19) 
   [   x:schemaless] o.a.s.s.ManagedIndexSchema Upgraded to managed schema at /Users/jihwan/devTools/solr-6.6.0/server/solr/schemaless/conf/managed-schema
2017-07-24 18:21:23.536 INFO  (qtp1205044462-19) [   x:schemaless] o.a.s.u.p.LogUpdateProcessorFactory 
   [schemaless]  webapp=/solr path=/update params={wt=javabin&version=2}{add=[dcbb6e1e-3ecc-44c3-a02c-bbc9d0725a63 
    (1573829196224921600), e8392adf-6d50-4030-ac58-eda5b49e5f0b (1573829196279447552), 3807603f-e9ea-43b9-8aa3-4274d001aabf 
    (1573829196281544704), 960167b4-9741-4cea-ace0-ace8874b732b (1573829196282593280), be500d86-177a-4bb9-874d-1bc97c8069ba 
    (1573829196283641856), b14ed131-c088-4675-ac4d-dd40227400dc (1573829196285739008), cfacbdc9-e8d0-4509-993b-f8021c2a7b04 
    (1573829196286787584), 103df904-f61b-4b96-a5d7-1239762a7a12 (1573829196287836160), 18d96163-170b-41b8-85f8-72e36e8bfeef 
    (1573829196289933312), 7c5e491e-f1cf-4d84-a044-57cc1c0ccf6d (1573829196289933313), ... (2000 adds)]} 0 775

2017-07-24 18:23:31.431 INFO  (qtp1205044462-22) [   x:schemaless] o.a.s.u.DirectUpdateHandler2 end_commit_flush
2017-07-24 18:23:31.431 INFO  (qtp1205044462-22) [   x:schemaless] o.a.s.u.p.LogUpdateProcessorFactory 
   [schemaless]  webapp=/solr path=/update params {waitSearcher=true&commit=true&softCommit=false&wt=javabin&version=2}{commit=} 0 114

This is a query result.

After adding the documents, we should notice that the managed-schema file under the '/schemaless/conf' directory was modified. Among several modifications, we can find new fields, which correspond to the fields we used in a document. The Solr determined each document's fields and their type, and updated the managed-schema file.

<field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="false" stored="false"/>
<field name="birthYear" type="tlongs"/>
<field name="companyName" type="strings"/>
<field name="firstName" type="strings"/>
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="lastName" type="strings"/>
<field name="permission" type="strings"/>
<field name="state" type="strings"/>

Using this default configuration, a size of the index directory 'data' is 1.26 GB with 274 (segmented) files.

Schema Mode

Although a schemaless is a default mode and support new features such as modifying a schema using API without restarting the Solr server, in fact, Solr Reference Guide says "Schemaless has limitations. It is a bit brute force, and if it guesses wrong, you can't change much about a field without having to reindex", "the Solr community does not recommend going to production without a schema that you have defined yourself"

Developers can explicitly define the schema definition and control field type.

Create a core : bin/solr create -c schema
Created core has a schemaless mode.
Stop the server
Rename 'server/solr/schema/conf/managed-schema' to 'server/solr/schema/conf/schema.xml'
Open the schema.xml file and replace the exiting four fields with these fields.

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- doc values are enabled by default for primitive types such as long so we don't index the version field  -->
<field name="_version_" type="long" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="firstName" type="string" indexed="true" stored="true" docValues="false" />
<field name="lastName" type="string" indexed="true" stored="true" docValues="false" />
<field name="birthYear" type="int" indexed="true" stored="true" docValues="false" />
<field name="companyName" type="string" indexed="true" stored="true" docValues="false" />
<field name="state" type="string" indexed="true" stored="true" docValues="false" />
<field name="permission" type="string" indexed="false" stored="true" multiValued="true" />    

<!-- Only enabled in the "schemaless" data-driven example (assuming the client
     does not know what fields may be searched) because it's very expensive to index everything twice. -->
<copyField source="*" dest="_text_"/>

I also changed the schema name from 'example-data-driven-schema' to 'user_schema'
Open a solrconfig.xml and add the schemaFactory element. ClassicIndexSchemaFactory is used for the 'schema' mode.

<schemaFactory class="ClassicIndexSchemaFactory"/>

Eliminate 'AddSchemaFieldsUpdateProcessorFactory' processor element under the <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> element. This processor dynamically adds fields to the schema if an input document contains one or more fields that don't match any field or dynamic field in the schema. So it needs to be removed with the schema mode. The updateRequestProcessorChain "add-unknown-fields-to-the-schema" is used based on an <initParams path="/update/**"> element defined in the solrconfig.xml.

Adding Documents with Schema Mode
Same SolrJ application is used to add 3 millions documents. Based on the server logs, it took 2 minutes 3 seconds. Compare to the schemaless mode (2 mins 9 secs), the execution time is not much different. The data directory size was 1.12 GB (compare to 1.26 GB with schemaless) The size difference is rather caused by different field definition.

This is a query result.

Schemaless with Manually Added Fields
Schemaless mode is not required to define your own fields manually, but you can. You create a core with the default schemaless mode. Then, manually add fields to the managed-schema file before you insert a doc with any field.

I added six fields, which are used in my User document, under the existing fields in the managed-schema file.

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- doc values are enabled by default for primitive types such as long so we don't index the version field  -->
<field name="_version_" type="long" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

<field name="firstName" type="string" indexed="true" stored="true" docValues="false" />
<field name="lastName" type="string" indexed="true" stored="true" docValues="false" />
<field name="birthYear" type="int" indexed="true" stored="true" docValues="false" />
<field name="companyName" type="string" indexed="true" stored="true" docValues="false" />
<field name="state" type="string" indexed="true" stored="true" docValues="false" />
<field name="permission" type="strings" indexed="true" stored="true" />

<!-- Only enabled in the "schemaless" data-driven example (assuming the client
     does not know what fields may be searched) because it's very expensive to index everything twice. -->
<copyField source="*" dest="_text_"/>

Then, run the same SolrJ application to index 3 millions document. Execution time was 2 minutes 3 seconds with 1.18 GB size of the 'data' directory. You will also notice that the managed-schema file was not modified after the indexing.

Java Tutorial & more

Monday, July 24, 2017

Solr 6: Schemaless and Schema Mode

No comments:

Post a Comment

Java 9: Flow - Reactive Programming