Java Tutorial & more: Solr: Import Data from DB and Performance Comparison

On my previous post, I described the setting for data import from a database and how to define <entity> element from multiple tables: hierarchical <entity> element for each table. Although this structure is explained on most Solr document, this is very slow process.

Using the all setting explained on the previous test, let's full-import data from a database and create indexes on the Solr server using Solr web UI.

1. Open the solr web UI on http://localhost:8983/solr/#/ for example. --> Select a core "addressDB" --> Click Dataimport --> Make sure "full-import" command is selected.

2. Click an "Execute" button. You may also click a "Refresh Status" button as you want to see an updated status. This is the result of the data import using all configurations on the previous post.

All data had been imported but it took 27 m 45 s for 6,071,307 data. It is very slow especially after we imported all same data within 410 seconds using SolrJ client on a previous post.
The number of 'Requests' and 'Fetched' seems very large and number of fetched data is a double of the 'Processed'. In fact, the Solr joins are more like subqueries in a database query.

- Different structure of the data-config.xml and Performance Comparison
Instead of using an hierarchical <entity> structure, let's use one <entity> with a join sql as a value of the query attribute.
Now, every settings/configuration is same as one shown on a previous post except the following <entity> definition in the data-config.xml file.

<document> <entity name="address" transformer="script:getFullName" dataSource="addressDB"
query="SELECT address_db_id, area_code, state, state_en, city,
city_en, sub_city, sub_city_en, street_name,
street_name_en, building_num, building_num_sub, bulk_delivery_place_name, building_name,
legal_dong_name, ri_name, admin_dong_name,
ground_num, ground_num_sub, dong_seq, postal_code
FROM address_db, address_state

WHERE state_id = address_state_id"> <field column="address_db_new_id" name="addressId" /> <field column="area_code" name="areaCode" /> <field column="postal_code" name="postalCode" /> <field column="city" name="city" /> <field column="city_en" name="city_en" /> <field column="sub_city" name="subCity" /> <field column="sub_city_en" name="subCity_en" /> <field column="street_name" name="streetName" /> <field column="street_name_en" name="streetName_en" /> <field column="building_number" name="buildingNumber" /> <field column="bulk_delivery_place_name" name="bulkDeliveryPlaceName"/> <field column="building_name" name="buildingName"/> <field column="legal_dong_name" name="dongName"/> <field column="admin_dong_name" name="adminDongName"/> <field column="ri_name" name="riName"/> <field column="ground_number" name="groundNumber"/> <field column="dong_seq" name="dongSeq"/>

Then, run the full import again. (To remove the existing data, simply delete the 'data' directory under the 'addressDB' directory. Then, restart the Solr)

This is the result of the full import. 4 m 43 s (283 seconds) for the same data with only 1 request!

Thing to Consider: Running one (joined) query in one <entity> element performs much better in general, but we should consider a total data size to be processed on the database and during the data transfer. When the data size is very large, it would take much memory on the database server. With very large data transfer, it will take long time for the Solr to receive/process the first data.

Comment: During the data update including the import, creating/recreating a Searcher and warming up process are important concepts for the Solr search. Unless I mentioned specifically, I used all default configuration. For example, we briefly talked about the 'commit' method using SolrJ on a previous posts. On data import, the following configurations in solrconfig.xml play a role.

<autoCommit> <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> <openSearcher>false</openSearcher> </autoCommit> <autoSoftCommit> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> </autoSoftCommit>

Default value of the solr.autoSoftCommit.maxTime is a 3000 (ms) defined at a $SOLR_HOME/bin/solr.cmd file. The solr.autoCommit.maxTime is not defined and 15000 (ms) is used.

This web page is a good reference for the soft commit and hard commit.

Java Tutorial & more

Friday, August 7, 2015

Solr: Import Data from DB and Performance Comparison

No comments:

Post a Comment

Java 9: Flow - Reactive Programming