I will show the setting needed for data import from a database on this post, and talk about the performance of data import on a next post.
Before you go further, I assume you completed the Solr installation and created a core as shown on my previous post.
1. To use a data import handler, we need to get necessary jar files: Solr data import jar files and JDBC driver for your database. You can decide a directory where you want to store the jar files. I will have a lib/dih directory under the core we created on my previous post. So, mine is a $SOLR_HOME/example/solr/addressDB/lib/dih directory.
Solr jar files are located at a $SOLR_HOME/dist directory. Let's copy solr-dataimporthandler-4.10.2.jar and solr-dataimporthandler-extras-4.10.2.jar files to the lib/dih directory.
2. Add these jar files to a Solr's classpath. To do this, open the solrconfig.xml and add the following line in front of the existing <lib .... /> list.
<lib dir="./lib/dih" regex=".*\.jar" />
3. Add data import handler capability to the Solr by adding the following to the solrconfig.xml file. It tells that necessary configurations are specified in the data-config.xml shown on the step 4.<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> <str name="indent">false</str> </lst> </requestHandler>
Before we talk about how to write DB queries, let me talk about design of address database tables. This table design is not well designed, but the purpose is simply having a more than one table. So, I can show you the performance difference depends on how to write DB queries for the data import.
I assume all data has been added to the DB tables using the text data files in the 'data/addressDB'
Table 'address_state' on PostgreSQL
CREATE TABLE IF NOT EXISTS ADDRESS_STATE ( address_state_id bigint NOT NULL DEFAULT nextval('ADDRESS_STATE_ID_SEQ'::regclass), state character varying(30) NOT NULL, state_en character varying(30), create_time timestamp with time zone NOT NULL DEFAULT current_timestamp(0), update_time timestamp with time zone, CONSTRAINT PK_ADDRESS_STATE_ID PRIMARY KEY (address_state_id), CONSTRAINT UK_ADDRESS_STATE UNIQUE (state) ) |
Table 'address_db' on PostgreSQL
CREATE TABLE IF NOT EXISTS ADDRESS_DB ( address_db_id bigint NOT NULL DEFAULT nextval('ADDRESS_DB_ID_SEQ'::regclass), area_code character(5) NOT NULL, state_id bigint NOT NULL, city character varying(30), city_en character varying(30), sub_city character varying(30), sub_city_en character varying(30), street_code bigint, street_name character varying(30) NOT NULL, street_name_en character varying(50), is_basement boolean default false, building_num smallint, building_num_sub smallint, building_mgm_num character varying(50), bulk_delivery_place_name character varying(30), building_name character varying(30), legal_dong_code bigint, dong_name character varying(30), ri_name character varying(30), is_mountain boolean default false, ground_num smallint, dong_seq smallint, ground_num_sub smallint, postal_code bigint NOT NULL, postal_code_seq character varying(10), create_time timestamp with time zone NOT NULL DEFAULT current_timestamp(0), update_time timestamp with time zone, CONSTRAINT PK_ADDRESS_DB_ID PRIMARY KEY (address_db_id), CONSTRAINT FK_ADDRESS_STATE_ID FOREIGN KEY (state_id) REFERENCES ADDRESS_STATE (address_state_id) ) |
Basically, address_db table has a foreign key 'state_id' referenced to an 'address_state_id' on the address_state table.
4. On the step 3, we just specified a configuration file named 'data-config.xml'. So, we need this file under the 'conf' directory where the solrconfig.xml file is located at. The data-config.xml is shown below and let's talk about a few important things.
4-1. <dataSource> tag: define DB connection information.
4-2. <script> tag and transformer="script:getFullName" in the first <entity> element: We can define a javascript function that can be called for each entity data before constructing each field of a data. This function is to manipulate each data row to make necessary fields we need.
4-3. <entity> tags: How to construct <entity> structures shown here is explained on many documents including Solr reference guide: Outer <entity> has another child <entity> inside the outer <entity> element, and this is how people join two tables (See how 'where' statement in the inner <entity> is used). I will follow this step here, but this structure is unfortunately very slow. I will describe a faster version on a next post.
This is all about the basic setting to run a data import handler. All changes to the solrconfig.xml shown on this post were added to the file in the xmlFils/SolrConfig but they are commented out. You need to uncomment them.
======================= data-config.xml file ==============================
<?xml version="1.0" encoding="UTF-8" ?> <dataConfig> <dataSource name="addressDB" type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5555/address" user="youruser" password="yourpass"/> <script><![CDATA[ function getFullName(row){ var grdNum = row.get('ground_num'); var grdNumSub = row.get('ground_num_sub'); var bldNum = row.get('building_num'); var bldNumSub = row.get('building_num_sub'); var grdNumber = ''; var bldNumber = ''; if(grdNum){ grdNumber = grdNum; if(grdNumSub){ grdNumber = grdNumber + '-' + grdNumSub; }
}
if(bldNum){
bldNumber = bldNum; if(bldNumSub){ bldNumber = bldNumber + '-' + bldNumSub; } } row.put('building_number', bldNumber); row.put('ground_number', grdNumber); return row; } ]]> </script> <document> <entity name="address" transformer="script:getFullName" dataSource="addressDB"
query="SELECT address_db_new_id, area_code, state_id, city, city_en,
sub_city, sub_city_en, street_name, street_name_en,
building_num, building_num_sub, bulk_delivery_place_name,
building_name, legal_dong_name, ri_name, admin_dong_name,
ground_num, ground_num_sub, dong_seq, postal_code
FROM address_db_new">
<field column="address_db_new_id" name="addressId" />
<field column="area_code" name="areaCode" />
<field column="postal_code" name="postalCode" />
<field column="city" name="city" />
<field column="city_en" name="city_en" />
<field column="sub_city" name="subCity" />
<field column="sub_city_en" name="subCity_en" />
<field column="street_name" name="streetName" />
<field column="street_name_en" name="streetName_en" />
<field column="building_number" name="buildingNumber" />
<field column="bulk_delivery_place_name" name="bulkDeliveryPlaceName"/>
<field column="building_name" name="buildingName"/>
<field column="legal_dong_name" name="dongName"/>
<field column="admin_dong_name" name="adminDongName"/>
<field column="ri_name" name="riName"/>
<field column="ground_number" name="groundNumber"/>
<field column="dong_seq" name="dongSeq"/>
<entity name="state" dataSource="addressDB"
query="SELECT * FROM address_state_new
WHERE address_state_new_id = ${address.state_id}">
<field column="state" name="state" />
<field column="state_en" name="state_en" />
</entity>
</document> </dataConfig>
</document> </dataConfig>
No comments:
Post a Comment