Friday, July 24, 2015

Solr: Indexing Data using a SlorJ Client. Ver. 2

My first version of the client, which was very slow, has the following process as shown on the previous post.

   SolrServer server = new HttpSolrServer("HOST_URL/solr/addressDB");
   for(int i=1; i<=10000; i++){
      SolrInputDocument doc = new SolrInputDocument();
      doc.addField(.....);
      doc.addField(.....);
      
      server.add(doc);
      if(i%200 == 0){
         server.commit();
      }
   }

Let's look at the API of SolrServer object.
 - add(SolrInputDocument doc) : Adds a single document
 - commit() :  Performs an explicit commit, causing pending documents to be committed for indexing.

People may expect a batch process: Keep 200 documents on the server object and send the 200 to the server on the commit method call.  Unfortunately, this is not how it works.

When you see a solr.log in a $SOLR_HOME/example/logs directory, you will see each update process per each call of the add method.  You can also see the source code of the SolrServer object. As you see, an update request is processed on each add method call.

  public UpdateResponse add(SolrInputDocument doc ) throws 
                      SolrServerException, IOException {
    return add(doc, -1);
  }

  public UpdateResponse add(SolrInputDocument doc, int commitWithinMs) throws 
                      SolrServerException, IOException {
    UpdateRequest req = new UpdateRequest();
    req.add(doc);
    req.setCommitWithin(commitWithinMs);
    return req.process(this);
  }

Solr server itself has a kind of batch process: changed/added documents is not visible to a current searcher and commit opens a new searcher with the updated index.  Data added to the Solr is not searchable until the commit is made.  So, the description of the API explains what is happening on a Solr server.

1.  Solr Client version 2.
This is the updated process.

   SolrServer server = new HttpSolrServer("HOST_URL/solr/addressDB");
   int dataCount = 0;
   List<SolrInputDocument> docList = 
                     new ArrayList<SolrInputDocument>();

   for(int i=1; i<=SOME_DATA_SIZE; i++){
      SolrInputDocument doc = new SolrInputDocument();
      doc.addField(.....);
      doc.addField(.....);
      
      docList.add(doc);
      dataCount++;
      
      if(i%200 == 0){
         server.add(docList);
         server.commit();
         docList.clear();
      }
   }

   if(docList.size() > 0){
      server.add(docList);
      server.commit();
      docList.clear();
   }

This client added all 6,071,307 address data to the Solr server in about 410 seconds on my machine.  (Version 1 : 263,000 data processed in 600 seconds)  As shown on the Java code from the version 1, I called an add method every 1000 documents and a commit methods for every 2000 documents.  Calling both methods at the same time is not necessary, but calling a commit method would be less frequent.

To see all data added, open a webpage on http://localhost:8983/solr/ .  Then, select a core 'addressDB' --> click 'Query' --> click 'Execute Query' button.  By default, it returns 10 data.




Thursday, July 23, 2015

Solr: Indexing Data using a SolrJ Client. Ver. 1

I assume you finished Solr installation and created a core shown on my previous post.  I will also use address dataset explained on my another post.  So, it would be beneficial if you understand the dataset a little bit.


1.  Create a schema.xml under a $SOLR_HOME/example/solr/addressDB/conf directory.  From the data files, I will use 19 fields to make one document.  These fields should be defined as a <field> element in this file.  This is the schema.xml I will use, and it is in this repository.  This is very simple version of a schema file for the simple example.  To take an advantage of Solr's search capabilities, we need to update this file on a future post.



The field named '_version_' is for Solr's optimistic concurrent control, and it is assigned automatically when an add process is performed.  For the update, you can retrieve the version along with other fields, update field(s), and pass the retrieved version when a document is updated (saved).  When the existing document has a different version number, the Solr will generate an error for the update process.  

When a field type is a string_ci, any space in the field value will be removed during an index creation and a search query run.

When a field type is a string_ng, any space and a comma in the field value will be removed during an index creation and a search query run.  In addition to this, an EdegNGram filter will be applied during the index creation.  Let's say maxGamSize=3 and a field value is 'my test', for example.  In this case, this field value will have three indexes associated with it: 'm', 'my', and 'myt' (after eliminating space by the specified 'solr.PatternReplaceFilteerFactory'.)  What it means that the field value 'my test' will be one of return values when a search query passes a 'my' (without quote).


2.  On my previous post, the Solr server showed an error because of non-existence of the schema.xml file.  Now, you should see the Solr server with a core named addressDB.



Now, I will write a client application using SolrJ and populate documents on the Solr.

3.  Download address dataset for our example.  See a previous post for the dataset.

4.  Posting data to the Solr server from a Java client using SolrJ is simple.  First of all, this is a pom.xml for the client project

<project xmlns="http://maven.apache.org/POM/4.0.0" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.jihwan.learn.solr</groupId>
  <artifactId>address-db-client</artifactId>
  <version>1.0</version>
  
  <dependencies>
    <dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-solrj</artifactId>
    <version>4.10.2</version>
 </dependency>
 <dependency>
    <groupId>commons-codec</groupId>
    <artifactId>commons-codec</artifactId>
    <version>1.10</version>
    </dependency>
 <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>jcl-over-slf4j</artifactId>
    <version>1.7.6</version>
    </dependency>
  </dependencies>
</project>


5.  This is the first version of a basic format of a Client process, and this format is shown on several tutorial pages.

   SolrServer server = new HttpSolrServer("HOST_URL/solr/addressDB");
   for(int i=1; i<=10000; i++){
      SolrInputDocument doc = new SolrInputDocument();
      doc.addField(.....);
      doc.addField(.....);
      
      server.add(doc);
      if(i%200 == 0){
         server.commit();
      }
   }

(** On version 5+, the HttpSolrServer object is deprecated.  Need to use a HttpSolrClient object instead)

Unfortunately, this is extremely slow to load large data.  On my MacBook Pro, only about 263,000 address data was processed in 10 minutes.

On my next post, I will show the next version of the client code after slight modification.

############################################################################
This is the first version of a Java client code.

package com.jihwan.learn.solr;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrInputDocument;

public class AddressClient {
  // The directory where the address files located at.
  private static final String FILE_DIR = “SOME_DIRECTORY/data/addressDB/";

  public static void main(String[] args) throws FileNotFoundException {

    SolrServer server = new HttpSolrServer("http://localhost:8983/solr/addressDB");
     
    File folder = new File(FILE_DIR);
    File[] listOfFiles = folder.listFiles();

    File txtfile = null;
    BufferedReader br = null;

    try {
      long startTime = System.currentTimeMillis();
      int dataCount = 0;

      for (File file : listOfFiles) {
        if (file.isFile()) {
          if (file.getName().endsWith(".txt")) {
            String line;
            System.out.println("Read a file " + file.getName());
                  
            txtfile = new File(FILE_DIR + file.getName());
            br = new BufferedReader(new InputStreamReader(new FileInputStream(txtfile)));
           
            // Dump headers
            line = br.readLine(); // English
            line = br.readLine(); // Korean

            while ((line = br.readLine()) != null) {
              SolrInputDocument doc = lineParser(line, ++dataCount); 
              server.add(doc);
                     
              if (dataCount % 1000 == 0) {
                server.commit();
              }
            }

            if (dataCount % 1000 != 0) {
              server.commit();
            }
            
            br.close();
          }
        }
      }
         
      long endTime = System.currentTimeMillis();
      System.out.println("Execution Time: " + (endTime-startTime));
         
    }catch (IOException ioE) {
      ioE.printStackTrace();
    }catch (SolrServerException e) {
      e.printStackTrace();
    }finally {
      try {
        br.close();
      }catch (IOException e) {
        e.printStackTrace();
      }
    }
  }
  public static SolrInputDocument lineParser(String line, int id) {
     String[] lineTerms = line.split("\\|");
     int parseIndex = 0;

     SolrInputDocument doc = new SolrInputDocument();
      
     doc.addField("addressId", addressId);
     doc.addField("areaCode", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("state", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("state_en", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("city", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("city_en", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("subCity", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("subCity_en", nonNullTrim(lineTerms[parseIndex++]));
     parseIndex++; //Skip street_code
      
     doc.addField("streetName", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("streetName_en", nonNullTrim(lineTerms[parseIndex++]));
     parseIndex++; //Skip is_basement
      
     String bldNumber = nonNullTrim(lineTerms[parseIndex++]);
     if (!isEmpty(lineTerms[parseIndex])) {
        bldNumber = bldNumber + "-" + lineTerms[parseIndex].trim();
     }
     parseIndex++;
     doc.addField("buildingNumber", bldNumber);
     parseIndex++; //Skip building_mgm_num

     doc.addField("bulkDeliveryPlaceName", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("buildingName", nonNullTrim(lineTerms[parseIndex++]));
     parseIndex++; //Skip legal_dong_code
      
     doc.addField("dongName", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("riName", nonNullTrim(lineTerms[parseIndex++]));
     doc.addField("adminDongName", nonNullTrim(lineTerms[parseIndex++]));
     parseIndex++; // skip is_mountain

     String grdNumber = nonNullTrim(lineTerms[parseIndex++]);
     doc.addField("dongSeq", nonNullTrim(lineTerms[parseIndex++]));
      
     if (!StringUtil.isEmpty(lineTerms[parseIndex])) {
       grdNumber = grdNumber + "-" + lineTerms[parseIndex];
     }
     parseIndex++;
     doc.addField("groundNumber", grdNumber);
      
     doc.addField("postalCode", nonNullTrim(lineTerms[parseIndex++]));
      
     return doc;
  }

  public static boolean isEmpty(String s) {
    return s == null || s.trim().length() < 1;
  }

  public static String nonNullTrim(String value) {
    return (isEmpty(value) ? null : value.trim());
  }
}


Sunday, July 19, 2015

Apache Solr: Installation and Creating a Core

On this post, I will briefly show how to install Apache Solr.  Solr 4.10.2 is used for this post and I assume you have a Java installed on your machine.

Note: At the first time, I used a Solr 5.2.1 and found that a dynamic schema along with a mechanism for handling unknown fields are used on a created solrconfig.xml.   To leverage my existing knowledge, I decided to use a version 4.10.2.

I. Installation.
1.  Download Apache Solr and unzip.  This is my initial directory structure of the Solr

2. Create a core for this address search.

     $ cd example/solr
     $ mkdir addressDB
     $ mkdir addressDB/conf

Now, create a create a core.properties under the addressDB directory. This file will have a single line shown below.  The 'addressDB' is a name of a core we will use.

    name=addressDB

3.  Create necessary files under addressDB/conf directory.  At this point, solrconfig.xml is a required file and it can be copied from $SOLR_HOME/example/solr/collection1/conf/solrconfig.xml

In the solrconfig.xml, there is a configuration for Query Elevation Component.  We will comment out the following elements from the file at this point.

  <searchComponent name="elevator" class="solr.QueryElevationComponent" >
    <!-- pick a fieldType to analyze queries -->
    <str name="queryFieldType">string</str>
    <str name="config-file">elevate.xml</str>
  </searchComponent>

  <!-- A request handler for demonstrating the elevator component -->
  <requestHandler name="/elevate" class="solr.SearchHandler" startup="lazy">
    <lst name="defaults">
      <str name="echoParams">explicit</str>
      <str name="df">text</str>
    </lst>
    <arr name="last-components">
      <str>elevator</str>
    </arr>
  </requestHandler>

Also, copy all 'admin-*.html' files from $SOLR_HOME/example/solr/collection1/conf to $SOLR_HOME/example/solr/addressDB/conf
     
After this, this is a directory structure of the addressDB directory.
 


4.  Run the following command to start the Solr.  Default port number is 8983.
    $ bin/solr start

You should be able to open the Solr admin web UI on localhost:8983/solr

Your solr server started correctly, but your addressDB core has an initialization error because of non-existence of a schema.xml file.  This file will be created on a next post.

4. Run the following command to stop the server.
    $ bin/solr stop -all

5. By default, the command 'bin/solr start' uses a max 512M heap memory.  To increase the heap memory, you may specify the size.  ex:  $ bin/solr start -m 1024M

6.  There is an example directory named 'collection1' under the $SOLR_HOME/example/solr directory.  At this point, just rename a file core.properties inside the 'collection1' to 'core.propertiesTemp'.  Otherwise, the configuration on my later posts "may" conflict with this collection1, but it is not required.


II.  Files You Should Know

These files are located at this repository.
1.  solr.xml
This file is located at a $SOLR_HOME/example/solr directory.  This file defines properties related to host, logging, sharding, and solrcloud.  You need to open this file and may see it.  We will not change this file at this point.

2.  solrconfig.xml
This file is located at a $SOLR_HOME/example/solr/addressDB/conf directory.

This file contains lots of configuration data, which are very important.  You should open this file and look at the contents of this file at least.  Generated file contains well documented description for each configuration.  This file name can be changed in a core.roperties, but we will keep the default name, which is a solrconfig.xml

3.  core.properties
This file is located at a $SOLR_HOME/example/solr/addressDB directory.  This file can have several properties, but the one required property is a 'name' property.  In this file, you may have at least one line shown below.

     name=addressDB

'addressDB' is a name of the core you will use.

If this file exists, but there is no name value, the default core name is the directory name that contains this file.  When this fie doesn't exist, this core will not be auto-detected.

4.  schema.xml
This file defines the structure of your index, including fields and their field type.  This file should be also located at $SOLR_HOME/example/solr/addressDB/conf directory, but it was NOT automatically generated by the command.  I will talk about this file more when I create/run a Solr index.


Specification of My Laptop & Address Dataset Used on My Posts

1.  Specification of my Macbook.

Processor Name Intel Core i7
Processor Speed 2.3 GHz
Number of Processors 1
Total Number of Cores 4
L2 Cache (per Core) 256 KB
L3 Cache 6 MB
Memory 16 GB


2.  Dataset: Address DB  
      (Updated on Aug. 6, 2015: Use new address files with a new 5 digit postal code)
On my posts, I will use Korean address data.  This data is published by Korea Post Office.  Total number of data is 6,071,307 in 17 different files.  So, these data are reasonably large for my Solr and/or Hadoop examples.

The files are located at my public repository at https://bitbucket.org/jihwan11/openfiles 
Address data is for Korea address and its contents are written in Korean.  Nevertheless, some fields such as city_en, state_en, etc are in English.  I think these fields would be enough for my examples.

Each line of each file describes one address and each field is separated with a | (vertical bar).
This is the format of each line.  

area_code|state|state_en|city|city_en|sub_city|sub_city_en|street_code|street_name|street_name_en|is_basement|building_num|building_num_sub|building_mgm_num|bulk_delivery_place_name|building_name|legal_dong_code|legal_dong_name|ri_name|admin_dong_name|is_mountain|ground_num|dong_seq|ground_num_sub|postal_code|postal_code_seq

Area_code is a new 5 digit postal code and it may have a leading 0.  Postal_code and postal_code_seq are old postal code.

Data is from the 3rd line of each file and, an example line is
06309|서울특별시|Seoul|강남구|Gangnam-gu|||116804166060|개포로30길|Gaepo-ro 30-gil|0|15|0|1168010300102080012021276||LG전선|1168010300|개포동||개포4동|0|1208|01|12|135962|001

Thursday, July 2, 2015

How to Run Hadoop Code on Your Laptop.


On my previous post, I talked about how to install Hadoop.
Now, I will describe how to run a simple Hadoop code on your machine.  

Since I am using a Maven and its pom.xml, hadoop jar files should be available on your (local) maven repository. 

1.  If you want to install hadop jar files from your local machine, use the command shown below:
     (If you want to install the jar files from the Maven repository, you can skip this process and define jar info shown on the repository correctly on the pom.xml)
  •  hadoop-common-2.6.0.jar, hadoop-nfs-2.6.0.jar in a $HADOOP_PREFIX/share/hadoop/common directory
  • hadoop-mapreduce-client-common-2.6.0.jar, hadoop-mapreduce-client-core-2.6.0.jar in a $HADOOP_PREFIX/share/hadoop/mapreduce directory.   
  • slf4j-api-1.7.5.jar and a slf4j-log4j12-1.7.5.jar in a $HADOOP_PREFIX/share/hadoop/common/lib directory, but these may not be required.
Then,
      mvn install:install-file -Dfile=<path-to-file> -DgroupId=<group-id> -DartifactId=<artifact-id> -Dversion=<version> -Dpackaging=<packaging> -DgeneratePom=true

For example,
    $ cd $HADOOP_PREFIX/share/hadoop/common 
    $ mvn install:install-file -Dfile=hadoop-common-2.6.0.jar -DgroupId=org.apache.hadoop -DartifactId=hadoop-common -Dversion=2.6.0 -Dpackaging=jar -DgeneratePom=true

(If you prefer, you can use your own values for the groupId and the artifactId, but you need to use the same values on the pom.xml shown below.)


2. Create a maven project.
     This is my maven project in Eclipse.

  • pom.xml
  • <project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://maven.apache.org/POM/4.0.0" xsi:schemalocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelversion>4.0.0</modelversion>
      <groupid>com.jihwan.learn.hadoop</groupid>
      <artifactid>chapterThree</artifactid>
      <version>0.1</version>
      <name>Chapter three examples</name>
      <properties>
         <hadoop.version>2.6.0</hadoop.version>
      </properties>
      <dependencies>
         <dependency>
            <groupid>org.apache.hadoop</groupid>
            <artifactid>hadoop-common</artifactid>
            <version>${hadoop.version}</version>
         </dependency>
         <dependency>
            <groupid>org.apache.hadoop</groupid>
            <artifactid>hadoop-nfs</artifactid>
            <version>${hadoop.version}</version>
         </dependency>
         <dependency>
            <groupid>org.apache.hadoop</groupid>
            <artifactid>hadoop-mapreduce-client-common</artifactid>
            <version>${hadoop.version}</version>
         </dependency>
         <dependency>
            <groupid>com.hadoop.mapreduce</groupid>
            <artifactid>hadoop-mapreduce-client-core</artifactid>
            <version>${Hadoop.version}</version>
         </dependency>
      </dependencies>
    </project>
    
    
  • Java code: It is from a book "Hadoop: The Definite Guide" written by Tom White.
  •      package com.jihwan.learn.hadoop
    
         import java.io.InputStream;
         import java.net.URL;
         import org.apache.hadoop.fs.FsUrlStreamHandlerFactory;
         import org.apache.hadoop.io.IOUtils;
    
         public class URLCat {
            static{
               URL.setURLStreamHandlerFactory(
                   new FsUrlStreamHandlerFactory());
            } 
    
            public static void main(String[] args) throws Exception{
               InputStream in = null;
               try{
                  in = new URL(args[0]).openStream();
                  IOUtils.copyBytes(in, System.out, 4096, false);
               }finally{
                  IOUtils.closeStream(in);
               }
            }
         }
    
    
  • quangle.txt: It is from a book "Hadoop: The Definite Guide" written by Tom White.
  •      On the top of the Crumpetty Tree
         The Quangle Wangle sat,
         But his face you could not see,
         On account of his Beaver Hat.

3. Create a jar file using a maven command
    $ mvn package

4. It creates a chapterThree-0.1.jar file under the target directory

5. Before you run the URLCat, make sure the hadoop local server is running.  This prerequisite steps are shown on the previous post.
   $ cd $HADOOP_PREFIX
   $ bin/hdfs nameode -format
   $ sbin/start-dfs.sh
   $ bin/hdfs dfs -mkdir /user
   $ bin/hdfs dfs -mkdir /user/jihwan       #jihwan is my user id
 
    Then change a directory to your maven project directory for the next step.

6.  After running the start-dfs.sh, you should be able to open http://localhost:50070/ and it should look like this.
 


7.  Let's copy a local file 'quangle.txt' on your project home directory to Hadoop server.
    $ hadoop fs -copyFromLocal quangle.txt  quangleCopy.txt

8.  Now, it is time to run the URLCat application.
    $ cd target
    $ export HADOOP_CLASSPATH=chapterThree-0.1.jar
    $ hadoop  com.jihwan.learn.hadoop.URLCat  hdfs://localhost:9000/user/jihwan/quangleCopy.txt



This is it. Have fun!!!

Java 9: Flow - Reactive Programming

Programming world has always been changed fast enough and many programming / design paradigms have been introduced such as object oriented p...