Java Tutorial & more: Solr

Showing posts with label Solr. Show all posts

Monday, July 24, 2017

Solr 6: Schemaless and Schema Mode

The schema of the Solr is the place where we tell Solr how it should build indexes from input documents. The default schema mode after Solr 5 is a schemaless, which allows users/developers to construct a schema without having to manually edit the schema.

On this post, I will talk about how to create a core with schemaless mode, add documents, and also show a few differences between schema mode and schemaless mode in a stand-alone Solr.

Creating a Core
This command starts the Solr.
bin/solr start -f -m 1g

Now, let's create a core with a name 'schemaless'

slkc:solr-6.6.0 jihwan$ pwd
/Users/jihwan/devTools/solr-6.6.0
slkc:solr-6.6.0 jihwan$ bin/solr create -c schemaless

Copying configuration to new core instance directory:
/Users/jihwan/devTools/solr-6.6.0/server/solr/schemaless

Creating new core 'schemaless' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=schemaless&instanceDir=schemaless

{
  "responseHeader":{
    "status":0,
    "QTime":1652},
  "core":"schemaless"}

slkc:solr-6.6.0 jihwan$

Schemaless directory contains several configuration files under a 'conf' directory. By default, this core is a schemaless mode, and a managed-schema file and a solrconfig.xml file are created. The created solrconfig.xml doesn't declare a <schemaFactory> element and uses a ManagedIndexSchemaFactory for the schemaless mode by default. In this case, the managed-schema file is used.

Initial Fields in the managed-schema
Right after the core is created, this managed-schema file contains only four <field> and one <copyField> along with many other elements.

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- doc values are enabled by default for primitive types such as long so we don't index the version field  -->
<field name="_version_" type="long" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

<!-- Only enabled in the "schemaless" data-driven example (assuming the client
     does not know what fields may be searched) because it's very expensive to index everything twice. -->
<copyField source="*" dest="_text_"/>

When other fields are used in a document, their field type will be determined based on field value Java classes, and fields are added to the schema.

Adding Documents
Using a SolrJ client application shown on a next post, randomly generated three millions document are added. This is a server side log at the beginning of the addition and the last part of the logs. It took about 2 mins 9 seconds.

2017-07-24 18:21:22.847 INFO  (qtp1205044462-19) 
   [   x:schemaless] o.a.s.s.ManagedIndexSchema Upgraded to managed schema at /Users/jihwan/devTools/solr-6.6.0/server/solr/schemaless/conf/managed-schema
2017-07-24 18:21:23.536 INFO  (qtp1205044462-19) [   x:schemaless] o.a.s.u.p.LogUpdateProcessorFactory 
   [schemaless]  webapp=/solr path=/update params={wt=javabin&version=2}{add=[dcbb6e1e-3ecc-44c3-a02c-bbc9d0725a63 
    (1573829196224921600), e8392adf-6d50-4030-ac58-eda5b49e5f0b (1573829196279447552), 3807603f-e9ea-43b9-8aa3-4274d001aabf 
    (1573829196281544704), 960167b4-9741-4cea-ace0-ace8874b732b (1573829196282593280), be500d86-177a-4bb9-874d-1bc97c8069ba 
    (1573829196283641856), b14ed131-c088-4675-ac4d-dd40227400dc (1573829196285739008), cfacbdc9-e8d0-4509-993b-f8021c2a7b04 
    (1573829196286787584), 103df904-f61b-4b96-a5d7-1239762a7a12 (1573829196287836160), 18d96163-170b-41b8-85f8-72e36e8bfeef 
    (1573829196289933312), 7c5e491e-f1cf-4d84-a044-57cc1c0ccf6d (1573829196289933313), ... (2000 adds)]} 0 775

2017-07-24 18:23:31.431 INFO  (qtp1205044462-22) [   x:schemaless] o.a.s.u.DirectUpdateHandler2 end_commit_flush
2017-07-24 18:23:31.431 INFO  (qtp1205044462-22) [   x:schemaless] o.a.s.u.p.LogUpdateProcessorFactory 
   [schemaless]  webapp=/solr path=/update params {waitSearcher=true&commit=true&softCommit=false&wt=javabin&version=2}{commit=} 0 114

This is a query result.

After adding the documents, we should notice that the managed-schema file under the '/schemaless/conf' directory was modified. Among several modifications, we can find new fields, which correspond to the fields we used in a document. The Solr determined each document's fields and their type, and updated the managed-schema file.

<field name="_root_" type="string" docValues="false" indexed="true" stored="false"/>
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
<field name="_version_" type="long" indexed="false" stored="false"/>
<field name="birthYear" type="tlongs"/>
<field name="companyName" type="strings"/>
<field name="firstName" type="strings"/>
<field name="id" type="string" multiValued="false" indexed="true" required="true" stored="true"/>
<field name="lastName" type="strings"/>
<field name="permission" type="strings"/>
<field name="state" type="strings"/>

Using this default configuration, a size of the index directory 'data' is 1.26 GB with 274 (segmented) files.

Schema Mode

Although a schemaless is a default mode and support new features such as modifying a schema using API without restarting the Solr server, in fact, Solr Reference Guide says "Schemaless has limitations. It is a bit brute force, and if it guesses wrong, you can't change much about a field without having to reindex", "the Solr community does not recommend going to production without a schema that you have defined yourself"

Developers can explicitly define the schema definition and control field type.

Create a core : bin/solr create -c schema
Created core has a schemaless mode.
Stop the server
Rename 'server/solr/schema/conf/managed-schema' to 'server/solr/schema/conf/schema.xml'
Open the schema.xml file and replace the exiting four fields with these fields.

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- doc values are enabled by default for primitive types such as long so we don't index the version field  -->
<field name="_version_" type="long" indexed="false" stored="false"/>
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="firstName" type="string" indexed="true" stored="true" docValues="false" />
<field name="lastName" type="string" indexed="true" stored="true" docValues="false" />
<field name="birthYear" type="int" indexed="true" stored="true" docValues="false" />
<field name="companyName" type="string" indexed="true" stored="true" docValues="false" />
<field name="state" type="string" indexed="true" stored="true" docValues="false" />
<field name="permission" type="string" indexed="false" stored="true" multiValued="true" />    

<!-- Only enabled in the "schemaless" data-driven example (assuming the client
     does not know what fields may be searched) because it's very expensive to index everything twice. -->
<copyField source="*" dest="_text_"/>

I also changed the schema name from 'example-data-driven-schema' to 'user_schema'
Open a solrconfig.xml and add the schemaFactory element. ClassicIndexSchemaFactory is used for the 'schema' mode.

<schemaFactory class="ClassicIndexSchemaFactory"/>

Eliminate 'AddSchemaFieldsUpdateProcessorFactory' processor element under the <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> element. This processor dynamically adds fields to the schema if an input document contains one or more fields that don't match any field or dynamic field in the schema. So it needs to be removed with the schema mode. The updateRequestProcessorChain "add-unknown-fields-to-the-schema" is used based on an <initParams path="/update/**"> element defined in the solrconfig.xml.

Adding Documents with Schema Mode
Same SolrJ application is used to add 3 millions documents. Based on the server logs, it took 2 minutes 3 seconds. Compare to the schemaless mode (2 mins 9 secs), the execution time is not much different. The data directory size was 1.12 GB (compare to 1.26 GB with schemaless) The size difference is rather caused by different field definition.

This is a query result.

Schemaless with Manually Added Fields
Schemaless mode is not required to define your own fields manually, but you can. You create a core with the default schemaless mode. Then, manually add fields to the managed-schema file before you insert a doc with any field.

I added six fields, which are used in my User document, under the existing fields in the managed-schema file.

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<!-- doc values are enabled by default for primitive types such as long so we don't index the version field  -->
<field name="_version_" type="long" indexed="false" stored="false"/>
<field name="_root_" type="string" indexed="true" stored="false" docValues="false" />
<field name="_text_" type="text_general" indexed="true" stored="false" multiValued="true"/>

<field name="firstName" type="string" indexed="true" stored="true" docValues="false" />
<field name="lastName" type="string" indexed="true" stored="true" docValues="false" />
<field name="birthYear" type="int" indexed="true" stored="true" docValues="false" />
<field name="companyName" type="string" indexed="true" stored="true" docValues="false" />
<field name="state" type="string" indexed="true" stored="true" docValues="false" />
<field name="permission" type="strings" indexed="true" stored="true" />

<!-- Only enabled in the "schemaless" data-driven example (assuming the client
     does not know what fields may be searched) because it's very expensive to index everything twice. -->
<copyField source="*" dest="_text_"/>

Then, run the same SolrJ application to index 3 millions document. Execution time was 2 minutes 3 seconds with 1.18 GB size of the 'data' directory. You will also notice that the managed-schema file was not modified after the indexing.

Friday, August 7, 2015

Solr: Import Data from DB and Performance Comparison

On my previous post, I described the setting for data import from a database and how to define <entity> element from multiple tables: hierarchical <entity> element for each table. Although this structure is explained on most Solr document, this is very slow process.

Using the all setting explained on the previous test, let's full-import data from a database and create indexes on the Solr server using Solr web UI.

1. Open the solr web UI on http://localhost:8983/solr/#/ for example. --> Select a core "addressDB" --> Click Dataimport --> Make sure "full-import" command is selected.

2. Click an "Execute" button. You may also click a "Refresh Status" button as you want to see an updated status. This is the result of the data import using all configurations on the previous post.

All data had been imported but it took 27 m 45 s for 6,071,307 data. It is very slow especially after we imported all same data within 410 seconds using SolrJ client on a previous post.
The number of 'Requests' and 'Fetched' seems very large and number of fetched data is a double of the 'Processed'. In fact, the Solr joins are more like subqueries in a database query.

- Different structure of the data-config.xml and Performance Comparison
Instead of using an hierarchical <entity> structure, let's use one <entity> with a join sql as a value of the query attribute.
Now, every settings/configuration is same as one shown on a previous post except the following <entity> definition in the data-config.xml file.

<document> <entity name="address" transformer="script:getFullName" dataSource="addressDB"
query="SELECT address_db_id, area_code, state, state_en, city,
city_en, sub_city, sub_city_en, street_name,
street_name_en, building_num, building_num_sub, bulk_delivery_place_name, building_name,
legal_dong_name, ri_name, admin_dong_name,
ground_num, ground_num_sub, dong_seq, postal_code
FROM address_db, address_state

WHERE state_id = address_state_id"> <field column="address_db_new_id" name="addressId" /> <field column="area_code" name="areaCode" /> <field column="postal_code" name="postalCode" /> <field column="city" name="city" /> <field column="city_en" name="city_en" /> <field column="sub_city" name="subCity" /> <field column="sub_city_en" name="subCity_en" /> <field column="street_name" name="streetName" /> <field column="street_name_en" name="streetName_en" /> <field column="building_number" name="buildingNumber" /> <field column="bulk_delivery_place_name" name="bulkDeliveryPlaceName"/> <field column="building_name" name="buildingName"/> <field column="legal_dong_name" name="dongName"/> <field column="admin_dong_name" name="adminDongName"/> <field column="ri_name" name="riName"/> <field column="ground_number" name="groundNumber"/> <field column="dong_seq" name="dongSeq"/>

Then, run the full import again. (To remove the existing data, simply delete the 'data' directory under the 'addressDB' directory. Then, restart the Solr)

This is the result of the full import. 4 m 43 s (283 seconds) for the same data with only 1 request!

Thing to Consider: Running one (joined) query in one <entity> element performs much better in general, but we should consider a total data size to be processed on the database and during the data transfer. When the data size is very large, it would take much memory on the database server. With very large data transfer, it will take long time for the Solr to receive/process the first data.

Comment: During the data update including the import, creating/recreating a Searcher and warming up process are important concepts for the Solr search. Unless I mentioned specifically, I used all default configuration. For example, we briefly talked about the 'commit' method using SolrJ on a previous posts. On data import, the following configurations in solrconfig.xml play a role.

<autoCommit> <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> <openSearcher>false</openSearcher> </autoCommit> <autoSoftCommit> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> </autoSoftCommit>

Default value of the solr.autoSoftCommit.maxTime is a 3000 (ms) defined at a $SOLR_HOME/bin/solr.cmd file. The solr.autoCommit.maxTime is not defined and 15000 (ms) is used.

This web page is a good reference for the soft commit and hard commit.

Tuesday, August 4, 2015

Solr: Setting to Import Data from Database

We often have data in relational database and want to use Solr's search capability. In this case, we need to import data from database tables and generate Solr indexing. On this post, I will describe a basic procedure for using Solr's DataImportHandler.

I will show the setting needed for data import from a database on this post, and talk about the performance of data import on a next post.

Before you go further, I assume you completed the Solr installation and created a core as shown on my previous post.

1. To use a data import handler, we need to get necessary jar files: Solr data import jar files and JDBC driver for your database. You can decide a directory where you want to store the jar files. I will have a lib/dih directory under the core we created on my previous post. So, mine is a $SOLR_HOME/example/solr/addressDB/lib/dih directory.

Solr jar files are located at a $SOLR_HOME/dist directory. Let's copy solr-dataimporthandler-4.10.2.jar and solr-dataimporthandler-extras-4.10.2.jar files to the lib/dih directory.

I am using a PostgreSQL. So, I also copied a postgresql-9.3-1101.jdbc41.jar to the lib/dih directory.

2. Add these jar files to a Solr's classpath. To do this, open the solrconfig.xml and add the following line in front of the existing <lib .... /> list.

   <lib dir="./lib/dih" regex=".*\.jar" />

3. Add data import handler capability to the Solr by adding the following to the solrconfig.xml file. It tells that necessary configurations are specified in the data-config.xml shown on the step 4.

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> <str name="indent">false</str> </lst> </requestHandler>

I added this after the </updateHandler> tag.

Before we talk about how to write DB queries, let me talk about design of address database tables. This table design is not well designed, but the purpose is simply having a more than one table. So, I can show you the performance difference depends on how to write DB queries for the data import.

I assume all data has been added to the DB tables using the text data files in the 'data/addressDB'

Table 'address_state' on PostgreSQL

   CREATE TABLE IF NOT EXISTS ADDRESS_STATE
   (
     address_state_id bigint NOT NULL 
                 DEFAULT nextval('ADDRESS_STATE_ID_SEQ'::regclass),
     state character varying(30) NOT NULL,
     state_en character varying(30),
     create_time timestamp with time zone NOT NULL 
                                      DEFAULT current_timestamp(0),
     update_time timestamp with time zone,
     CONSTRAINT PK_ADDRESS_STATE_ID PRIMARY KEY (address_state_id),
     CONSTRAINT UK_ADDRESS_STATE UNIQUE (state)
   )

Table 'address_db' on PostgreSQL

   CREATE TABLE IF NOT EXISTS ADDRESS_DB
   (
     address_db_id bigint NOT NULL DEFAULT 
                          nextval('ADDRESS_DB_ID_SEQ'::regclass),
     area_code character(5) NOT NULL,
     state_id bigint NOT NULL,
     city character varying(30),
     city_en character varying(30),
     sub_city character varying(30),
     sub_city_en character varying(30),
     street_code bigint,
     street_name character varying(30) NOT NULL,
     street_name_en character varying(50),
     is_basement boolean default false,
     building_num smallint,
     building_num_sub smallint,
     building_mgm_num character varying(50),
     bulk_delivery_place_name character varying(30),
     building_name character varying(30),
     legal_dong_code bigint,
     dong_name character varying(30),
     ri_name character varying(30),
     is_mountain boolean default false,
     ground_num smallint,
     dong_seq smallint,
     ground_num_sub smallint,
     postal_code bigint NOT NULL,
     postal_code_seq character varying(10),
     create_time timestamp with time zone NOT NULL 
                                  DEFAULT current_timestamp(0),
     update_time timestamp with time zone,
     CONSTRAINT PK_ADDRESS_DB_ID PRIMARY KEY (address_db_id),
     CONSTRAINT FK_ADDRESS_STATE_ID FOREIGN KEY (state_id)
                    REFERENCES ADDRESS_STATE (address_state_id)
   )

Basically, address_db table has a foreign key 'state_id' referenced to an 'address_state_id' on the address_state table.

4. On the step 3, we just specified a configuration file named 'data-config.xml'. So, we need this file under the 'conf' directory where the solrconfig.xml file is located at. The data-config.xml is shown below and let's talk about a few important things.

4-1. <dataSource> tag: define DB connection information.
4-2. <script> tag and transformer="script:getFullName" in the first <entity> element: We can define a javascript function that can be called for each entity data before constructing each field of a data. This function is to manipulate each data row to make necessary fields we need.
4-3. <entity> tags: How to construct <entity> structures shown here is explained on many documents including Solr reference guide: Outer <entity> has another child <entity> inside the outer <entity> element, and this is how people join two tables (See how 'where' statement in the inner <entity> is used). I will follow this step here, but this structure is unfortunately very slow. I will describe a faster version on a next post.

This is all about the basic setting to run a data import handler. All changes to the solrconfig.xml shown on this post were added to the file in the xmlFils/SolrConfig but they are commented out. You need to uncomment them.

======================= data-config.xml file ==============================
<?xml version="1.0" encoding="UTF-8" ?> <dataConfig> <dataSource name="addressDB" type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5555/address" user="youruser" password="yourpass"/> <script><![CDATA[ function getFullName(row){ var grdNum = row.get('ground_num'); var grdNumSub = row.get('ground_num_sub'); var bldNum = row.get('building_num'); var bldNumSub = row.get('building_num_sub'); var grdNumber = ''; var bldNumber = ''; if(grdNum){ grdNumber = grdNum; if(grdNumSub){ grdNumber = grdNumber + '-' + grdNumSub; }
}
if(bldNum){
bldNumber = bldNum; if(bldNumSub){ bldNumber = bldNumber + '-' + bldNumSub; } } row.put('building_number', bldNumber); row.put('ground_number', grdNumber); return row; } ]]> </script> <document> <entity name="address" transformer="script:getFullName" dataSource="addressDB"
query="SELECT address_db_new_id, area_code, state_id, city, city_en,
sub_city, sub_city_en, street_name, street_name_en,

building_num, building_num_sub, bulk_delivery_place_name,

building_name, legal_dong_name, ri_name, admin_dong_name,

ground_num, ground_num_sub, dong_seq, postal_code FROM address_db_new"> <field column="address_db_new_id" name="addressId" /> <field column="area_code" name="areaCode" /> <field column="postal_code" name="postalCode" /> <field column="city" name="city" /> <field column="city_en" name="city_en" /> <field column="sub_city" name="subCity" /> <field column="sub_city_en" name="subCity_en" /> <field column="street_name" name="streetName" /> <field column="street_name_en" name="streetName_en" /> <field column="building_number" name="buildingNumber" /> <field column="bulk_delivery_place_name" name="bulkDeliveryPlaceName"/> <field column="building_name" name="buildingName"/> <field column="legal_dong_name" name="dongName"/> <field column="admin_dong_name" name="adminDongName"/> <field column="ri_name" name="riName"/> <field column="ground_number" name="groundNumber"/> <field column="dong_seq" name="dongSeq"/>

<entity name="state" dataSource="addressDB"

query="SELECT * FROM address_state_new

WHERE address_state_new_id = ${address.state_id}"> <field column="state" name="state" /> <field column="state_en" name="state_en" /> </entity>
</document> </dataConfig>