Friday, September 25, 2015

Solr: Text Analysis, Searching, Default Field and Operators (Incomplete)

** This post is NOT completed **

On my earlier post, I showed a basic schema.xml file and the later posts are based on this schema definition.  This schema definition was too simple to support all features defined in the solrconfig.xml.

You may notice the following error when you started a Solr server although it was OK to run a search query with each field name specified in the query.



Let's talk about this error first.

- What is the meaning of "undefined field text"? and why?
What it means that a field name 'text' is undefined in the schema.xml.  Then, why a specific name 'text'? Is it a required field name?
Let's open the 'solrconfig.xml' and search for the 'text'.  You will find the following.  By the way, request handlers are the entry points for all requests to Solr and the definition below takes care of the '/select' query.

  <requesthandler name="/select" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
       <str name="df">text</str>
     </lst>

You will also see several other similar definitions in the solrconfig.xml.  The field name 'text' is a default field, which is used when a field name is not provided.


Now, I will create a new field with a name 'text'.  The name 'text' is not a required field name.  Since I don't want to change all used 'text' in the solrconfig.xml, I will just define a field with a name 'text'

Based on the address dataset and business logic in the data, this is my new field in the schema.xml.

   <field name="text" type="text_general" indexed="true" 
      stored="true" multiValued="true"/>

Each data in the dataset contains a new and an old address in Korea.  Instead of (or in addition to) having each field separately, this field will save a whole address.  For example, "1600 Test Parkway Salt Lake City, UT. 12345" instead of "1600", "Test Parkway", "Salt Lake City", and so on separately.  Also, multiValued="true" is used to store two addresses: new and old address.

The type of this text field is a 'text_general' and it is my simple definition of this type for Korean in the schema.xml.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
</fieldType>

Now, we need to define what value of this text should be.  Here, I will use the DataImportHandler and we have used a data-config.xml to import data from a database.  In the data-config.xml, I will compose multi values for the text field and the composed value is based on how the (new / old) Korean addresses are made.

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource name="addressDB" type="JdbcDataSource" driver="org.postgresql.Driver" 
         url="jdbc:postgresql://SOME_URL” user=“username” password=“your_pass”/> 
    <script><![CDATA[
      function getModiData(row){
        var grdNum = row.get('ground_num');
        var grdNumSub = row.get('ground_num_sub');
        var bldNum = row.get('building_num');
        var bldNumSub = row.get('building_num_sub');

        var grdNumber = '';
        var bldNumber = '';
        var addressNew = '';
        var addressOld = '';
        
        if(grdNum){
           grdNumber = grdNum;
           if(grdNumSub){
              grdNumber = grdNumber + '-' + grdNumSub;
           }
        }
    
        if(bldNum){
           bldNumber = bldNum;
           if(bldNumSub){
              bldNumber = bldNumber + '-' + bldNumSub;
           }
        }

        addressNew = row.get('area_code') + ' ' + row.get('state') + 
                      ' ' + row.get('city');
        addressOld = row.get('postal_code') + ' ' + row.get('state') +
                     ' ' + row.get('city');
        if(row.get('sub_city')){
           addressNew = addressNew + ' ' + row.get('sub_city');
           addressOld = addressOld + ' ' + row.get('sub_city');
        }
        
        addressNew = addressNew + ' ' + row.get('street_name') + 
                      ' ' + row.get('building_num');
                    
        if(row.get('building_num_sub')){
           addressNew = addressNew + '-' + row.get('building_num_sub');
        }
        
        if(row.get('legal_dong_name')){
           addressOld = addressOld + ' ' + row.get('legal_dong_name');
           if(row.get('admin_dong_name')){
              addressOld = addressOld + ' (' + row.get('admin_dong_name') + ')';
           }
        }else if(row.get('admin_dong_name')){
           addressOld = addressOld + ' ' + row.get('admin_dong_name');
        }else if(row.get('ri_name')){
           addressOld = addressOld + ' ' + row.get('ri_name');
        }
        
        addressOld = addressOld + ' ' + grdNumber;
        
        if(row.get('building_name')){
           addressNew = addressNew + ' ' + row.get('building_name');
           addressOld = addressOld + ' ' + row.get('building_name');
        }else if(row.get('bulk_delivery_place_name')){
           addressNew = addressNew + '-' + row.get('bulk_delivery_place_name');
           addressOld = addressOld + '-' + row.get('bulk_delivery_place_name');
        }
        
        
        row.put('building_number', bldNumber);
        row.put('ground_number', grdNumber);
        row.put('address_new', addressNew);
        row.put('address_old', addressOld);
 
        return row;
      }
    ]]>
    </script>
    <document>
       <entity name="address" transformer="script:getModiData" dataSource="addressDB"
            query="SELECT address_db_id, area_code, state, state_en, city, city_en, 
                          sub_city, sub_city_en, street_name, street_name_en,
                          building_num, building_num_sub, bulk_delivery_place_name,
                          building_name, legal_dong_name, ri_name, admin_dong_name, 
                          ground_num, ground_num_sub, dong_seq, postal_code
                     FROM address_db, address_state
                    WHERE state_id = address_state_id">

          <!-- Other field definitions -->          
          <field column="address_new" name="text"/>
          <field column="address_old" name="text"/>
       </entity>
    </document>
</dataConfig>

In the SQL query, there is no address_new & address_old columns, but they are composed in the transformer written in the javascript function 'getModiData'.  Then, both are defined as a field 'text'.

That is it!
Now, let's start a Solr server again and you should notice there is no "undefined field text" error. Open the Solr admin page and import all data again as shown on a previous post.

Searching
We can run a query using the Solr admin page.  Let;s click the Query link and click the "Execute Query" button without changing any parameter on the admin query page.
As you see, there are total 6,071,307 address data and each data now has a text field with two address values: new address & old address.
On the admin page, there is a field 'q', which means a query.  You may type "state_en:Seoul" (without double quotation mark) in the text box of the 'q'.   It returns all data with a value Seoul for a state_en field.  When you run a Solr query, you may provide query parameter(s) as a form of "field_name:field_value"  The requesting URL for this example is   http://localhost:8983/solr/addressDB/select?q=state_en:Seoul&wt=json&indent=true

Default Field







No comments:

Post a Comment

Java 9: Flow - Reactive Programming

Programming world has always been changed fast enough and many programming / design paradigms have been introduced such as object oriented p...