Friday, September 25, 2015

Solr: Text Analysis, Searching, Default Field and Operators (Incomplete)

** This post is NOT completed **

On my earlier post, I showed a basic schema.xml file and the later posts are based on this schema definition.  This schema definition was too simple to support all features defined in the solrconfig.xml.

You may notice the following error when you started a Solr server although it was OK to run a search query with each field name specified in the query.



Let's talk about this error first.

- What is the meaning of "undefined field text"? and why?
What it means that a field name 'text' is undefined in the schema.xml.  Then, why a specific name 'text'? Is it a required field name?
Let's open the 'solrconfig.xml' and search for the 'text'.  You will find the following.  By the way, request handlers are the entry points for all requests to Solr and the definition below takes care of the '/select' query.

  <requesthandler name="/select" class="solr.SearchHandler">
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
       <str name="df">text</str>
     </lst>

You will also see several other similar definitions in the solrconfig.xml.  The field name 'text' is a default field, which is used when a field name is not provided.


Now, I will create a new field with a name 'text'.  The name 'text' is not a required field name.  Since I don't want to change all used 'text' in the solrconfig.xml, I will just define a field with a name 'text'

Based on the address dataset and business logic in the data, this is my new field in the schema.xml.

   <field name="text" type="text_general" indexed="true" 
      stored="true" multiValued="true"/>

Each data in the dataset contains a new and an old address in Korea.  Instead of (or in addition to) having each field separately, this field will save a whole address.  For example, "1600 Test Parkway Salt Lake City, UT. 12345" instead of "1600", "Test Parkway", "Salt Lake City", and so on separately.  Also, multiValued="true" is used to store two addresses: new and old address.

The type of this text field is a 'text_general' and it is my simple definition of this type for Korean in the schema.xml.

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    </analyzer>
</fieldType>

Now, we need to define what value of this text should be.  Here, I will use the DataImportHandler and we have used a data-config.xml to import data from a database.  In the data-config.xml, I will compose multi values for the text field and the composed value is based on how the (new / old) Korean addresses are made.

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
<dataSource name="addressDB" type="JdbcDataSource" driver="org.postgresql.Driver" 
         url="jdbc:postgresql://SOME_URL” user=“username” password=“your_pass”/> 
    <script><![CDATA[
      function getModiData(row){
        var grdNum = row.get('ground_num');
        var grdNumSub = row.get('ground_num_sub');
        var bldNum = row.get('building_num');
        var bldNumSub = row.get('building_num_sub');

        var grdNumber = '';
        var bldNumber = '';
        var addressNew = '';
        var addressOld = '';
        
        if(grdNum){
           grdNumber = grdNum;
           if(grdNumSub){
              grdNumber = grdNumber + '-' + grdNumSub;
           }
        }
    
        if(bldNum){
           bldNumber = bldNum;
           if(bldNumSub){
              bldNumber = bldNumber + '-' + bldNumSub;
           }
        }

        addressNew = row.get('area_code') + ' ' + row.get('state') + 
                      ' ' + row.get('city');
        addressOld = row.get('postal_code') + ' ' + row.get('state') +
                     ' ' + row.get('city');
        if(row.get('sub_city')){
           addressNew = addressNew + ' ' + row.get('sub_city');
           addressOld = addressOld + ' ' + row.get('sub_city');
        }
        
        addressNew = addressNew + ' ' + row.get('street_name') + 
                      ' ' + row.get('building_num');
                    
        if(row.get('building_num_sub')){
           addressNew = addressNew + '-' + row.get('building_num_sub');
        }
        
        if(row.get('legal_dong_name')){
           addressOld = addressOld + ' ' + row.get('legal_dong_name');
           if(row.get('admin_dong_name')){
              addressOld = addressOld + ' (' + row.get('admin_dong_name') + ')';
           }
        }else if(row.get('admin_dong_name')){
           addressOld = addressOld + ' ' + row.get('admin_dong_name');
        }else if(row.get('ri_name')){
           addressOld = addressOld + ' ' + row.get('ri_name');
        }
        
        addressOld = addressOld + ' ' + grdNumber;
        
        if(row.get('building_name')){
           addressNew = addressNew + ' ' + row.get('building_name');
           addressOld = addressOld + ' ' + row.get('building_name');
        }else if(row.get('bulk_delivery_place_name')){
           addressNew = addressNew + '-' + row.get('bulk_delivery_place_name');
           addressOld = addressOld + '-' + row.get('bulk_delivery_place_name');
        }
        
        
        row.put('building_number', bldNumber);
        row.put('ground_number', grdNumber);
        row.put('address_new', addressNew);
        row.put('address_old', addressOld);
 
        return row;
      }
    ]]>
    </script>
    <document>
       <entity name="address" transformer="script:getModiData" dataSource="addressDB"
            query="SELECT address_db_id, area_code, state, state_en, city, city_en, 
                          sub_city, sub_city_en, street_name, street_name_en,
                          building_num, building_num_sub, bulk_delivery_place_name,
                          building_name, legal_dong_name, ri_name, admin_dong_name, 
                          ground_num, ground_num_sub, dong_seq, postal_code
                     FROM address_db, address_state
                    WHERE state_id = address_state_id">

          <!-- Other field definitions -->          
          <field column="address_new" name="text"/>
          <field column="address_old" name="text"/>
       </entity>
    </document>
</dataConfig>

In the SQL query, there is no address_new & address_old columns, but they are composed in the transformer written in the javascript function 'getModiData'.  Then, both are defined as a field 'text'.

That is it!
Now, let's start a Solr server again and you should notice there is no "undefined field text" error. Open the Solr admin page and import all data again as shown on a previous post.

Searching
We can run a query using the Solr admin page.  Let;s click the Query link and click the "Execute Query" button without changing any parameter on the admin query page.
As you see, there are total 6,071,307 address data and each data now has a text field with two address values: new address & old address.
On the admin page, there is a field 'q', which means a query.  You may type "state_en:Seoul" (without double quotation mark) in the text box of the 'q'.   It returns all data with a value Seoul for a state_en field.  When you run a Solr query, you may provide query parameter(s) as a form of "field_name:field_value"  The requesting URL for this example is   http://localhost:8983/solr/addressDB/select?q=state_en:Seoul&wt=json&indent=true

Default Field







Saturday, August 29, 2015

hashCode, equals methods and Set collection in Java

On this post, let's talk about a little basic part of programming, but important to understand.  We know the importance of the hash value of an object, but sometimes have a hard time to understand a bug caused by using a hashCode.

Now, let's look at the following code.  This Pojo.java is a simple object with two member variables, and the hashCode() and equals methods are auto-generated by Eclipse. Look good, isn't it?

public class Pojo {
   private int number = 0;
   private String str = "";
   public int getNumber() {
      return number;
   }
   public void setNumber(int number) {
      this.number = number;
   }
   public String getStr() {
      return str;
   }
   public void setStr(String str) {
      this.str = str;
   }
   @Override
   public int hashCode() {
      final int prime = 31;
      int result = 1;
      result = prime * result + number;
      result = prime * result + ((str == null) ? 0 : str.hashCode());
      return result;
   }
   @Override
   public boolean equals(Object obj) {
      if (this == obj)
         return true;
      if (obj == null)
         return false;
      if (getClass() != obj.getClass())
         return false;
      
      Pojo other = (Pojo) obj;
      if (number != other.number)
         return false;
      if (str == null) {
         if (other.str != null)
            return false;
      } else if (!str.equals(other.str))
         return false;
      return true;
   } 
}


Let's look at a Testing code. Can you say what the printed statements are?

public class Test {
   public static void main(String[] args) {
      Set<Pojo> pojoSet = new HashSet<Pojo>();
      for (int i = 1; i < 10 ; i++){
         Pojo pojo = new Pojo();
         pojo.setNumber(i);
         pojoSet.add(pojo);
  
         if(i == 5){
            pojo.setStr(“String Value”);
         }
      }
  
      Pojo[] data = pojoSet.toArray(new Pojo[pojoSet.size()]);
      Set<Pojo> anotherSet = new HashSet<Pojo>(Arrays.asList(data));

      for(Pojo pojo : pojoSet){
         if(pojo.getNumber() == 5){
            boolean isRemoved = anotherSet.remove(pojo);
            System.out.println("From anotherSet: " + isRemoved);
            isRemoved = pojoSet.remove(pojo);
            System.out.println("From pojoSet: " + isRemoved);
         }
      }
   }
}

Outputs are
   From anotherSet: true
   From pojoSet: false

When an object is added to a HashSet, the object's hashCode method is called and calculate the hash value of the object in the HashSet.  When a value of the 'Str' is changed later, the hashCode of that object in the HashSet doesn't get updated.
When this object is passed to a method in the HashMap (in our case, remove method), a hash value of the passed object is calculated and compare this hash value with hash value in the HashSet.

In our example, the hash value of the 5th object in the pojoSet is calculated with an empty Str value.  After that, the Str value is changed.  When the pojo object in the second for loop is passed to the remove method, a hash value of this object is calculated with a non empty value of the Str.  Therefore, the second output statement shows the 'false'.

In production code, we often have an Entity object with a Set type property. During a business process, we sometimes create a new entity that is a property of another. In this case, we add the newly created java object to the set and persist the parent entity object to insert the newly created object.   During the insertion, the DB returns an id of the object and the object is updated.  If the id (or any other updated field) is used in the hashKey method of the entity, we no longer find the newly created java object from the Set (until the Set is reloaded).

Thursday, August 27, 2015

Creating Solr Index using a Hadoop MapReduce (data in files)

I have posted how to create Solr index using SolrJ and DataImportHandler. Importing of more than 6,070,000 address data less than 5 minutes on my local machine using a DataImportHanlder is not too bad.  Nevertheless, it will be worth to try a distributed system and I will use Hadoop's MapReduce on this post.

I assume that Solr server is installed and running. Also, you are able to run Hadoop on your machine as shown on my previous posts: Install Hadoop, Configuration and Running Hadoop,  Running MapReduce.

Since Hadoop can feed each line of data in 17 different files to a map method, it seemed easier to send HTTP requests to Solr for data indexing, but I encountered several things to consider.

Things to Consider:
1.  How to include third part libraries such as SolrJ libraries to the Hadoop?
I sure can use a Java URL object (without SolrJ libraries) and execute a HTTP request to the Solr after constructing a HTTP request parameter string, but using an existing library is usually beneficial.

2.  How to generate an unique 'addressId' value in multiple map tasks, which may run on multiple JVM?
On the schema.xml, the addressId field is an unique field.  In the database, I used a database sequence and made sure this value is unique in the database.  When I use a client with the SolrJ, the client ran on one JVM and I made sure this value is unique by incrementing a value by 1.

3.  New design to break up the process is necessary and consider efficiency of the process between the map and the reduce.  During the address indexing, we just process each line (without any elimination) and post the data to the Solr.  So, need to think about how many Map tasks and Reduce tasks need to be used.

Approaches: 
1.  Inside the jar file with the address MapReduce, we may include all necessary jar files in a lib directory.  In this case, a size of thee jar file will be larger and each different jar file may contains same depending jar files.  Instead, we can put necessary jar file to HDFS and share it if necessary.

Hadoop has a command option '-libjars' followed by a list of jar files separated with a comma:
-libjars jar1, jar2,jar3....

2.  Since several map and reduce tasks can run on multiple JVM, we cannot simply use a static count variable.  One way to have a unique id, we can use Solr's UUID techniques as shown on the Solr Wiki page for my example on this post.
Another way, which I will use, is using hadoop's jab and task id.  This is a table shown on a book "Hadoop-The Definitive Guide" written by Tom White.

Property Name Type Description Example
mapreduce.job.id String The job id job_200811201130_0004
mapreduce.task.id String The task id task_200811201130_0004_m_000003
mapreduce.task.attempt.id String The task attempt id attempt_200811201130_0004_m_000003_0
mapreduce.task.partition int The index of the task within the job 3
mapreduce.task.ismap boolean Whether this task is a map task true

When you need an each id shown on the table above in your java codes, Java API will be more useful.   Context parameter of the map/reduce method provides a way to retrieve ids.
For example,
context.getTaskAttemptID().getJobID().getId() , context.getTaskAttemptID().getTaskID().getId()

3.  In our address indexing case, we don't need to drop or combine any line of data.  Since each line is processed by a map function, I think each line can populate a data object, which will be sent to Solr, in a map method.  Then, use a send and commit method for every certain number of data as we did on the simple SolrJ example.  
Hadoop performs a sorting for the output of the map before the reduce process.   When we process each line within a map method and not produce any output for the reducer, we can eliminate  data sorting and copying process for the reducer.  In fact, I don't need a reducer at all for this, but need a combiner to take care of remaining data and to call a commit method for the data run on a same JVM.

Implementation:
1. Map

public class AddressMap2 extends Mapper<LongWritable, Text, Text, Text>{
   static final SolrServer server = new HttpSolrServer(
                                "http://localhost:8983/solr/addressDB");
   static final List<SolrInputDocument> docList = 
                                     new ArrayList<SolrInputDocument>();
   static boolean isWritten = false;
   static int count = 1;
   static final MAX_DOC_PER_JVM = 800000; 
   
   @Override
   public void map(LongWritable key, Text value, Context context)
                             throws IOException, InterruptedException {

      String[] lineTerms = value.toString().split("\\|");
      try{
         Integer.parseInt(lineTerms[0].trim());
      }catch(NumberFormatException ne){
         //All data starts with a number.  When we get here, this data is one
         //of the first two comment lines. 
         return;
      }

      int countNumber = (context.getTaskAttemptID().getTaskID().getId() * 
                                          MAX_DOC_PER_JVM) + count;
      AddressInFileParser parser = new AddressInFileParser();
      
      SolrInputDocument doc = parser.parse(value.toString(), countNum);
      docList.add(doc);
         
      if(!docList.isEmpty() && (count%1000 == 0)){
         try {
            server.add(docList);
            docList.clear();
         } catch (SolrServerException e) {
            throw new InterruptedException(e.getMessage());
         }
      }
      count++;
      
      if(!isWritten){
         context.write(new Text("dummy"), new Text(Integer.toString(count)));
         isWritten = true;
      }      
   }
}

- MAX_DOC_PER_JVM is calculated based on a total number of address data (about 6,070,000) and number of mapper we can approximate, which is about 10.
* Why 10?
Size of the input address file shown on the step 6 below is 1226231976 bytes, and Hadoop splits the input file to the size of file block, which is 128 MB (134217728 bytes) by default.  Therefore, 1226231976 / 134217728 = 9.14 --> 10
To confirm it, see the Result section below.

- We don't need to generate any output from the mapper, but the combiner won't be called without any output from the mapper.  So, I am writing one dummy output per a JVM.


2. AddressInFileParser used in the Map

public class AddressInFileParser {
   /** 
    * @param line.  This line is expected to have the following format.
    *      area_code|state|state_en|city|city_en|sub_city|sub_city_en|
    *      street_code|street_name|street_name_en|is_basement|building_num|
    *      building_num_sub|building_mgm_num|bulk_delivery_place_name|
    *      building_name|legal_dong_code|legal_dong_name|ri_name|admin_dong_name|
    *      is_mountain|ground_num|dong_seq|ground_num_sub|postal_code|postal_code_seq
    *
    * @param addressId
    * @return
    */
   public SolrInputDocument parse(String line, int addressId){
      String[] lineTerms = line.split("\\|");
      int parseIndex = 0;
      
      SolrInputDocument doc = new SolrInputDocument();
      doc.addField("addressId", addressId);
      doc.addField("areaCode", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("state", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("state_en", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("city", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("city_en", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("subCity", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("subCity_en", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      parseIndex++; //Skip street_code
      
      doc.addField("streetName", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("streetName_en", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      parseIndex++; //Skip is_basement
      
      String bldNumber = StringUtil.nonNullTrim(lineTerms[parseIndex++]);
      if (!StringUtil.isEmpty(lineTerms[parseIndex])) {
         bldNumber = bldNumber + "-" + lineTerms[parseIndex].trim();
      }
      parseIndex++;
      doc.addField("buildingNumber", bldNumber);

      parseIndex++; //Skip building_mgm_num
      doc.addField("bulkDeliveryPlaceName", 
                          StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("buildingName", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      parseIndex++; //Skip legal_dong_code
      
      doc.addField("dongName", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("riName", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      doc.addField("adminDongName", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      parseIndex++; // skip is_mountain

      String grdNumber = StringUtil.nonNullTrim(lineTerms[parseIndex++]);
      doc.addField("dongSeq", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      
      if (!StringUtil.isEmpty(lineTerms[parseIndex])) {
         grdNumber = grdNumber + "-" + lineTerms[parseIndex];
      }
      parseIndex++;
      doc.addField("groundNumber", grdNumber);
      
      doc.addField("postalCode", StringUtil.nonNullTrim(lineTerms[parseIndex++]));
      
      return doc;
   }
}


3. Combiner

public class AddressReduce2 extends 
                           Reducer<Text, Text, Text, IntWritable>{
   @Override
   public void reduce(Text key, Iterable<Text> values, Context context) 
         throws IOException{
      try {
         if(!AddressMap2.docList.isEmpty()){
            AddressMap2.server.add(AddressMap2.docList);
         }
         AddressMap2.server.commit();
      } catch (SolrServerException e) {
         e.printStackTrace();
      }
   }
}


4. Client

public class AddressMRClient 
                   extends Configured implements Tool{
   @Override
   public int run(String[] args) throws Exception {
      if (args.length != 2) {
         System.err.printf("Usage: %s [generic options] <input> <output>\n",
                      getClass().getSimpleName());
         ToolRunner.printGenericCommandUsage(System.err);
         return -1; 
      }
       
      Job job = Job.getInstance(getConf(), "Solr address");
      job.setJarByClass(getClass());

      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
       
      job.setMapperClass(AddressMap2.class);
      job.setCombinerClass(AddressReduce2.class);
       
      job.setMapOutputKeyClass(Text.class);
      job.setMapOutputValueClass(Text.class);
       
      return job.waitForCompletion(true)? 0:1;
   }

   public static void main(String[] args) throws Exception{
      int exitCode = ToolRunner.run(new AddressMRClient(), args);
      System.exit(exitCode);
   }
}


5.  pom.xml for the project

<project xmlns="http://maven.apache.org/POM/4.0.0" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
                             http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.jihwan.learn.solr</groupId>
  <artifactId>address-db-client</artifactId>
  <version>1.0</version>

  <properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <hadoop.version>2.6.0</hadoop.version>
  </properties>
    
  <dependencies>
    <dependency>
      <groupId>org.apache.solr</groupId>
      <artifactId>solr-solrj</artifactId>
      <version>4.10.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.solr</groupId>
      <artifactId>solr-dataimporthandler</artifactId>
      <version>4.10.2</version>
    </dependency>
    <dependency>
      <groupId>org.apache.solr</groupId>
      <artifactId>solr-dataimporthandler-extras</artifactId>
      <version>4.10.2</version>
    </dependency>
    <dependency>
      <groupId>commons-codec</groupId>
      <artifactId>commons-codec</artifactId>
      <version>1.10</version>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>jcl-over-slf4j</artifactId>
      <version>1.7.6</version>
    </dependency>
      <!-- Hadoop main client artifact -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>${hadoop.version}</version>
    </dependency>
    <!-- Unit test artifacts -->
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.apache.mrunit</groupId>
      <artifactId>mrunit</artifactId>
      <version>1.1.0</version>
      <classifier>hadoop2</classifier>
      <scope>test</scope>
    </dependency>
    <!-- Hadoop test artifact for running mini clusters -->
    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-minicluster</artifactId>
      <version>${hadoop.version}</version>
      <scope>test</scope>
    </dependency>
  </dependencies>
  
  <build>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>3.1</version>
        <configuration>
          <source>1.6</source>
          <target>1.6</target>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-jar-plugin</artifactId>
        <version>2.5</version>
        <configuration>
          <outputDirectory>${basedir}</outputDirectory>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project>


6.  Combining input files.
There are 17 address data files based on region and the size of each files is from 17.4 MB to 175.8 MB.  In Hadoop, it is better to have a small number of large files.  This is from a book "Hadoop - The Definitive Guide"

"Hadoop works better with a small number of large files than a large 
number of small files. One reason for this is that FileInputFormat 
generates splits in such a way that each split is all or part of a 
single file. If the file is very small (“small” means significantly
smaller than an HDFS block) and there are a lot of them, each map 
task will process very little input, and there will be a lot of them 
(one per file), each of which imposes extra bookkeeping overhead. 
Compare a 1 GB file broken into eight 128 MB blocks with 10,000 or so 
100 KB files. The 10,000 files use one map each, and the job time can 
be tens or hundreds of times slower than the equivalent one with a 
single input file and eight map tasks."

To combine the files, I ran the following command on my MacBook.
     $ sh -c 'cat *.txt > addressmerge.txt'


7.  Running the application.
-  Again, need to start hadoop and may need to create directory since I am using a /tmp directory on this post.
   $ cd $HADOOP_PREFIX
   $ bin/hdfs namenode -format
   $ sbin/start-dfs.sh
   $ sbin/start-yarn.sh
   $ bin/hdfs dfs -mkdir /user
   $ bin/hdfs dfs -mkdir /user/<username>

Now, create a directory for the address data file.
   $ bin/hdfs dfs -mkdir /user/<username>/input
   $ bin/hdfs dfs -mkdir /user/<username>/input/addressMerge

- Copy the data file on the local to Hadoop file system.

   $ cd DIR_OF_addressmerge.txt_YOU_CREATED_ON_STEP6
   $ hadoop fs -copyFromLocal ./addressmerge.txt input/addressMerge

- Compile the MapReduce codes.

   $ cd DIR_OF_YOUR_PROJECT
   $ mvn compile

- Start the Solr server.  You may need to delete 'data' directory under $SOLR_INSTALLED_DIR/example/solr/addressDB before the start.

- Run the application.  All library dependency is specified using a "-libjars" option
** line shown here is separated, but it should be all one line command **

   $ hadoop jar address-db-client-1.0.jar com.jihwan.learn.solr.AddressMRClient 
-conf conf/hadoop-localhost.xml -libjars 
/Users/jihwan/.m2/repository/org/apache/solr/solr-solrj/4.10.2/solr-solrj-4.10.2.jar,
/Users/jihwan/.m2/repository/commons-io/commons-io/2.3/commons-io-2.3.jar,
/Users/jihwan/.m2/repository/org/apache/httpcomponents/httpclient/4.3.1/httpclient-4.3.1.jar,
/Users/jihwan/.m2/repository/org/apache/httpcomponents/httpcore/4.3/httpcore-4.3.jar,
/Users/jihwan/.m2/repository/org/apache/httpcomponents/httpmime/4.3.1/httpmime-4.3.1.jar,
/Users/jihwan/.m2/repository/org/noggit/noggit/0.5/noggit-0.5.jar,
/Users/jihwan/.m2/repository/org/slf4j/slf4j-api/1.7.6/slf4j-api-1.7.6.jar 
input/addressMerge addressOutMerge

8.  Result.
This Hadoop MapReduce application created all 6,071,307 address data on the Solr and it took about a little more that 5 minutes.  This is slower than the data import from the database shown on a previous post, but we should consider time to insert data to the postgresql DB from the data files, which was about 10 minutes on my laptop.

Let's study some of output during the application running.


- The application started at 20:49:35.
- Total input paths to process : 1  --> we have only one input file 'addressmerge.txt'
- number of splits:10  -->  Hadoop split one input file to 10 smaller files that fit to the Hadoop max block size, which is 128MB by default.   Hadoop will also run 10 mappers.

- running in uber mode: false  --> This is not a small job that can be run sequentially on one node.



- The job was completed at 20:54:41.
- Launched map tasks=12  -->  Earlier I said Hadoop will run 10 mappers, but it indicates 12 map tasks were running.
- Killed map tasks=2  -->  Although 12 map tasks were running, 2 of them were killed.  Hadoop often runs more than one same task parallely and kills duplicate tasks after another same task was completed first.
- Launched reduce tasks=1 --> Even if we didn't specify a reducer, Hadoop still use a default reducer org.apache.hadoop.mapreduce.Reducer, which simply writes all its input to its output.  Since we didn't write anything from the combiner, the reducer didn't do anything.

- Map input records=6071341  --> We know we had total 6071307 address data.  Before we merge 17 address files, each file had two comment lines, which we dropped in the map method.  So, 6071341 - (17*2) = 6071307.
- Map output records=10 , Combine input records=10 -->  In the map method, we only created one output data per a JVM and we had 10 mappers on different JVM.  So, total output records from the map is 10, which is same as the combine input records.

Warning:
On this post, I just described how to run MapReduce to create Solr Indexing, but we should also pay attention to what is happening on Solr side.  Solr makes a new searcher visible to clients on each commit call and it triggers expensive processes such as autowarming and recreating caches for the new searcher.  Therefore, developers need to be careful before calling the commit method from multiple servers with/without some interval.




Friday, August 7, 2015

Solr: Import Data from DB and Performance Comparison

On my previous post, I described the setting for data import from a database and how to define <entity> element from multiple tables: hierarchical <entity> element for each table.  Although this structure is explained on most Solr document, this is very slow process.

Using the all setting explained on the previous test, let's full-import data from a database and create indexes on the Solr server using Solr web UI.

1.  Open the solr web UI on http://localhost:8983/solr/#/ for example. --> Select a core "addressDB" --> Click Dataimport --> Make sure "full-import" command is selected.


2.  Click an "Execute" button.  You may also click a "Refresh Status" button as you want to see an updated status.   This is the result of the data import using all configurations on the previous post.


All data had been imported but it took 27 m 45 s for 6,071,307 data.  It is very slow especially after we imported all same data within 410 seconds using SolrJ client on a previous post.
The number of 'Requests' and 'Fetched' seems very large and number of fetched data is a double of the 'Processed'.  In fact, the Solr joins are more like subqueries in a database query.

- Different structure of the data-config.xml and Performance Comparison
Instead of using an hierarchical <entity> structure, let's use one <entity> with a join sql as a value of the query attribute.
Now, every settings/configuration is same as one shown on a previous post except the following <entity> definition in the data-config.xml file.

<document> <entity name="address" transformer="script:getFullName" dataSource="addressDB" 
query="SELECT address_db_id, area_code, state, state_en, city, 
city_en, sub_city, sub_city_en, street_name, 
street_name_en, building_num, building_num_sub, bulk_delivery_place_name, building_name, 
legal_dong_name, ri_name, admin_dong_name, 
ground_num, ground_num_sub, dong_seq, postal_code
FROM address_db, address_state
WHERE state_id = address_state_id"> <field column="address_db_new_id" name="addressId" /> <field column="area_code" name="areaCode" /> <field column="postal_code" name="postalCode" /> <field column="city" name="city" /> <field column="city_en" name="city_en" /> <field column="sub_city" name="subCity" /> <field column="sub_city_en" name="subCity_en" /> <field column="street_name" name="streetName" /> <field column="street_name_en" name="streetName_en" /> <field column="building_number" name="buildingNumber" /> <field column="bulk_delivery_place_name" name="bulkDeliveryPlaceName"/> <field column="building_name" name="buildingName"/> <field column="legal_dong_name" name="dongName"/> <field column="admin_dong_name" name="adminDongName"/> <field column="ri_name" name="riName"/> <field column="ground_number" name="groundNumber"/> <field column="dong_seq" name="dongSeq"/>
<field column="state" name="state" /> <field column="state_en" name="state_en" /> </entity>  
</document>

Then, run the full import again.  (To remove the existing data, simply delete the 'data' directory under the 'addressDB' directory.  Then, restart the Solr)

This is the result of the full import.  4 m 43 s (283 seconds) for the same data with only 1 request!

Thing to Consider:  Running one (joined) query in one <entity> element performs much better in general, but we should consider a total data size to be processed on the database and during the data transfer.  When the data size is very large, it would take much memory on the database server.  With very large data transfer, it will take long time for the Solr to receive/process the first data.

Comment: During the data update including the import, creating/recreating a Searcher and warming up process are important concepts for the Solr search.  Unless I mentioned specifically, I used all default configuration.  For example, we briefly talked about the 'commit' method using SolrJ on a previous posts.  On data import, the following configurations in solrconfig.xml play a role.
<autoCommit> <maxTime>${solr.autoCommit.maxTime:15000}</maxTime> <openSearcher>false</openSearcher> </autoCommit> <autoSoftCommit> <maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime> </autoSoftCommit>

Default value of the solr.autoSoftCommit.maxTime is a 3000 (ms) defined at a $SOLR_HOME/bin/solr.cmd file.   The solr.autoCommit.maxTime is not defined and 15000 (ms) is used.

This web page is a good reference for the soft commit and hard commit.


Tuesday, August 4, 2015

Solr: Setting to Import Data from Database

We often have data in relational database and want to use Solr's search capability.  In this case, we need to import data from database tables and generate Solr indexing.   On this post, I will describe a basic procedure for using Solr's DataImportHandler.

I will show the setting needed for data import from a database on this post, and talk about the performance of data import on a next post.

Before you go further, I assume you completed the Solr installation and created a core as shown on my previous post.


1.  To use a data import handler, we need to get necessary jar files: Solr data import jar files and JDBC driver for your database.  You can decide a directory where you want to store the jar files.  I will have a lib/dih directory under the core we created on my previous post.  So, mine is a $SOLR_HOME/example/solr/addressDB/lib/dih directory.

Solr jar files are located at a $SOLR_HOME/dist directory.  Let's copy solr-dataimporthandler-4.10.2.jar and solr-dataimporthandler-extras-4.10.2.jar files to the lib/dih directory.  

I am using a PostgreSQL. So, I also copied a postgresql-9.3-1101.jdbc41.jar to the lib/dih directory.

2.  Add these jar files to a Solr's classpath.  To do this, open the solrconfig.xml and add the following line in front of the existing <lib .... /> list.

   <lib dir="./lib/dih" regex=".*\.jar" />

3.  Add data import handler capability to the Solr by adding the following to the solrconfig.xml file.  It tells that necessary configurations are specified in the data-config.xml shown on the step 4.

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> <str name="indent">false</str> </lst> </requestHandler>

I added this after the </updateHandler> tag.

Before we talk about how to write DB queries, let me talk about design of address database tables. This table design is not well designed, but the purpose is simply having a more than one table.  So, I can show you the performance difference depends on how to write DB queries for the data import.

I assume all data has been added to the DB tables using the text data files in the 'data/addressDB'

Table 'address_state' on PostgreSQL
   CREATE TABLE IF NOT EXISTS ADDRESS_STATE
   (
     address_state_id bigint NOT NULL 
                 DEFAULT nextval('ADDRESS_STATE_ID_SEQ'::regclass),
     state character varying(30) NOT NULL,
     state_en character varying(30),
     create_time timestamp with time zone NOT NULL 
                                      DEFAULT current_timestamp(0),
     update_time timestamp with time zone,
     CONSTRAINT PK_ADDRESS_STATE_ID PRIMARY KEY (address_state_id),
     CONSTRAINT UK_ADDRESS_STATE UNIQUE (state)
   )

Table 'address_db' on PostgreSQL

   CREATE TABLE IF NOT EXISTS ADDRESS_DB
   (
     address_db_id bigint NOT NULL DEFAULT 
                          nextval('ADDRESS_DB_ID_SEQ'::regclass),
     area_code character(5) NOT NULL,
     state_id bigint NOT NULL,
     city character varying(30),
     city_en character varying(30),
     sub_city character varying(30),
     sub_city_en character varying(30),
     street_code bigint,
     street_name character varying(30) NOT NULL,
     street_name_en character varying(50),
     is_basement boolean default false,
     building_num smallint,
     building_num_sub smallint,
     building_mgm_num character varying(50),
     bulk_delivery_place_name character varying(30),
     building_name character varying(30),
     legal_dong_code bigint,
     dong_name character varying(30),
     ri_name character varying(30),
     is_mountain boolean default false,
     ground_num smallint,
     dong_seq smallint,
     ground_num_sub smallint,
     postal_code bigint NOT NULL,
     postal_code_seq character varying(10),
     create_time timestamp with time zone NOT NULL 
                                  DEFAULT current_timestamp(0),
     update_time timestamp with time zone,
     CONSTRAINT PK_ADDRESS_DB_ID PRIMARY KEY (address_db_id),
     CONSTRAINT FK_ADDRESS_STATE_ID FOREIGN KEY (state_id)
                    REFERENCES ADDRESS_STATE (address_state_id)
   )


Basically, address_db table has a foreign key 'state_id' referenced to an 'address_state_id' on the address_state table.


4.  On the step 3, we just specified a configuration file named 'data-config.xml'.  So, we need this file under the 'conf' directory where the solrconfig.xml file is located at.  The data-config.xml is shown below and let's talk about a few important things.

   4-1.  <dataSource> tag:  define DB connection information.
   4-2.  <script> tag and transformer="script:getFullName" in the first <entity> element:  We can define a javascript function that can be called for each entity data before constructing each field of a data. This function is to manipulate each data row to make necessary fields we need.
   4-3.  <entity> tags:  How to construct <entity> structures shown here is explained on many documents including Solr reference guide: Outer <entity> has another child <entity> inside the outer <entity> element, and this is how people join two tables (See how 'where' statement in the inner <entity> is used).  I will follow this step here, but this structure is unfortunately very slow.  I will describe a faster version on a next post.

This is all about the basic setting to run a data import handler.  All changes to the solrconfig.xml shown on this post were added to the file in the xmlFils/SolrConfig but they are commented out.  You need to uncomment them.

======================= data-config.xml file ==============================
<?xml version="1.0" encoding="UTF-8" ?> <dataConfig> <dataSource name="addressDB" type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://localhost:5555/address" user="youruser" password="yourpass"/> <script><![CDATA[ function getFullName(row){ var grdNum = row.get('ground_num'); var grdNumSub = row.get('ground_num_sub'); var bldNum = row.get('building_num'); var bldNumSub = row.get('building_num_sub'); var grdNumber = ''; var bldNumber = ''; if(grdNum){ grdNumber = grdNum; if(grdNumSub){ grdNumber = grdNumber + '-' + grdNumSub; }
}
if(bldNum){
bldNumber = bldNum; if(bldNumSub){ bldNumber = bldNumber + '-' + bldNumSub; } } row.put('building_number', bldNumber); row.put('ground_number', grdNumber); return row; } ]]> </script> <document> <entity name="address" transformer="script:getFullName" dataSource="addressDB" 
query="SELECT address_db_new_id, area_code, state_id, city, city_en,
sub_city, sub_city_en, street_name, street_name_en, 
building_num, building_num_sub, bulk_delivery_place_name, 
building_name, legal_dong_name, ri_name, admin_dong_name, 
ground_num, ground_num_sub, dong_seq, postal_code FROM address_db_new"> <field column="address_db_new_id" name="addressId" /> <field column="area_code" name="areaCode" /> <field column="postal_code" name="postalCode" /> <field column="city" name="city" /> <field column="city_en" name="city_en" /> <field column="sub_city" name="subCity" /> <field column="sub_city_en" name="subCity_en" /> <field column="street_name" name="streetName" /> <field column="street_name_en" name="streetName_en" /> <field column="building_number" name="buildingNumber" /> <field column="bulk_delivery_place_name" name="bulkDeliveryPlaceName"/> <field column="building_name" name="buildingName"/> <field column="legal_dong_name" name="dongName"/> <field column="admin_dong_name" name="adminDongName"/> <field column="ri_name" name="riName"/> <field column="ground_number" name="groundNumber"/> <field column="dong_seq" name="dongSeq"/>
<entity name="state" dataSource="addressDB"
query="SELECT * FROM address_state_new 
WHERE address_state_new_id = ${address.state_id}"> <field column="state" name="state" /> <field column="state_en" name="state_en" /> </entity>  
</document> </dataConfig>


Java 9: Flow - Reactive Programming

Programming world has always been changed fast enough and many programming / design paradigms have been introduced such as object oriented p...