Sunday, November 26, 2017

Java 9: Flow - Reactive Programming

Programming world has always been changed fast enough and many programming / design paradigms have been introduced such as object oriented programming, domain driven design, functional programming, reactive programming and so on, although they are not always conflicting to each other.  Java 8 introduced Streams and Lambda Expression and supports the functional programming.  Java 9 introduced a Flow API and established a common interface for the Reactive Programming.

Reactive Streams is an API.  According to the Reactive Manifesto, Reactive Systems are Responsive, Resilient, Elastic, and (Asynchronous) Message Driven.  Large systems are composed of smaller ones and therefore depend on the Reactive properties.

Java 9 Flow API
The java.util.concurrent.Flow class in Java 9 corresponds to the Reactive Streams API, and there are already supporting implementations such as AkkaRatpackReactor, etc.  The Flow has four nested interfaces: Flow.Processor, Flow.Publisher, Flow.Subscriber, and Flow.Subscription.   Among these four interfaces, the Java 9 provides (only) one implementing class of the Publisher: java.util.concurrent.SubmissionPublisher.

@FunctionalInterface
public static interface Publisher<T> {
   public void subscribe(Subscriber<? super T> subscriber);
}
 
public static interface Subscriber<T> {
   public void onSubscribe(Subscription subscription);
   public void onNext(T item);
   public void onError(Throwable throwable);
   public void onComplete();
}
 
public static interface Subscription {
   public void request(long n);
   public void cancel();
}
 
public static interface Processor<T,R> extends Subscriber<T>, Publisher<R> {
}

The Publisher publishes a stream of data that the Subscriber is asynchronously subscribed, and it usually defines its own Subscription implementation by having an inner Subscription class as like a private BufferedSubscription class in the SubmissionPublisher class.  It usually publishes itms to the subscriber asynchronously using an ExecutorExecutorServiceForkJoinPool as an implementing class for better work management.)  Subscription's two defined methods let a Subscriber requests items or cancels it based on Subscriber's need and capacity.  It is a concept of 'back pressure' defined under the "Message Driven" property of the Reactive Streams and I will briefly talk about the Back Pressure later on this article.

The Subscriber arranges items to be requested or processed and this is some description of each method defined in the API.
 - onComplete :  Method invoked when it is known that no additional Subscriber method invocations will occur for a Subscription that is not already terminated by error, after which no other Subscriber methods are invoked by the Subscription.

 - onError(Throwable throwable) :  Method invoked upon an unrecoverable error encountered by a Publisher or Subscription, after which no other Subscriber methods are invoked by the Subscription.

 - onNext(T item) :  Method invoked with a Subscription's next item.

 - onSubscribe(Subscription subscription)Method invoked prior to invoking any other Subscriber methods for the given Subscription.

Example Code
Let's now see an implementation example using a SubmissionPublisher provided in Java 9.

- Implementation of Subscriber

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import java.util.concurrent.Flow.Subscriber;
import java.util.concurrent.Flow.Subscription;

public class MySubscriber<T> implements Subscriber<T> {
   private Subscription subscription;
   private String name;

   public MySubscriber(String name) {
      this.name = name;
   }

   @Override
   public void onComplete() {
      System.out.println(name + ": onComplete");
   }

   @Override
   public void onError(Throwable t) {
      System.out.println(name + ": onError");
      t.printStackTrace();
   }

   @Override
   public void onNext(T msg) {
      System.out.println(name + ": " + msg.toString() + " received in onNext");
      subscription.request(1);
   }

   @Override
   public void onSubscribe(Subscription subscription) {
      System.out.println(name + ": onSubscribe");
      this.subscription = subscription;
      subscription.request(1);
   }
}


- The Main using the SubmissionPublisher

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import java.util.concurrent.SubmissionPublisher;

public class FlowMain {

   public static void main(String[] args) {
      SubmissionPublisher<String> publisher = new SubmissionPublisher<>();

      MySubscriber<String> subscriber = new MySubscriber<>("Mine");
      MySubscriber<String> subscriberYours = new MySubscriber<>("Yours");
      MySubscriber<String> subscriberHis = new MySubscriber<>("His");
      MySubscriber<String> subscriberHers = new MySubscriber<>("Hers");

      publisher.subscribe(subscriber);
      publisher.subscribe(subscriberYours);
      publisher.subscribe(subscriberHis);
      publisher.subscribe(subscriberHers);

      publisher.submit("One");
      publisher.submit("Two");
      publisher.submit("Three");
      publisher.submit("Four");
      publisher.submit("Five");

      try {
         Thread.sleep(1000);
      } catch (InterruptedException e) {
         e.printStackTrace();
      }
      publisher.close();
   }
}

Let's run the main application first and then talk about how this Flow application works.

- Output
This is one possible output depending on a status of the runtime parallel execution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Mine: onSubscribe
Hers: onSubscribe
His: onSubscribe
Yours: onSubscribe
Yours: One received in onNext
His: One received in onNext
Mine: One received in onNext
Hers: One received in onNext
Mine: Two received in onNext
His: Two received in onNext
Yours: Two received in onNext
His: Three received in onNext
Mine: Three received in onNext
Hers: Two received in onNext
Mine: Four received in onNext
His: Four received in onNext
Yours: Three received in onNext
His: Five received in onNext
Mine: Five received in onNext
Hers: Three received in onNext
Yours: Four received in onNext
Hers: Four received in onNext
Yours: Five received in onNext
Hers: Five received in onNext
Mine: onComplete
Hers: onComplete
His: onComplete
Yours: onComplete

- Explanation of the Flow Application
It has one SubmissionPublisher and four subscribers that subscribe to the publisher.  Depending on the implementation of the publisher, but in general, a Publisher implements an asynchronous implementation. In the SubmissionPublisher, the common pool instance of the ForkJoinPool is used unless it cannot support parallelism.

I put a break point on a line 32 on the MySubscriber.java and a line 18 on the FlowMain.java.  This is my debug window, which shows a usage of the ForkJoinPool.commonPool and multiple workers running.

During a call of the subscribe method in the SubmissionPublisher, the onSubscribe method of the MySubscriber is called. The line 33 on the MySubscriber controls a number of items to the current unfulfilled demand for this subscription.  When a publisher publishes an item / message on the line 18~22 of the FlowMain.java, the onNext method of the MySubscriber is called.

Without the line 26 on the MySubscriber, the subscribers will not receive more than one item because the subscriber requested only one item on the line 33.  This is how the subscriber controls a number of items to handle. (again, Back Pressure). For unbounded number of items, Long.MAX_VALUE can be used on the line 33:  subscription.request(Long.MAX_VALUE)

Per each publication activity (using the submit method) on the line 18 ~ 22, all subscribers of the publisher get a notification with an item / message which is shown on the line 5 ~ 24 of the Output.  You may notice that the order of output is not same as the order of the subscription because each worker in the ForkJoinPool works in parallel. 

When the publisher closes the publication, the onComplete method of the Subscriber is called and the subscriber can do any necessary process.

Back Pressure
On this article, I mentioned the 'back pressure'.  This concept is defined in the Message Driven characteristic of the Reactive Systems.  The Reactive Manifesto defines the "Back Pressure" like this:

"When one component is struggling to keep-up, the system as a whole needs to respond in a sensible way. It is unacceptable for the component under stress to fail catastrophically or to drop messages in an uncontrolled fashion. Since it can’t cope and it can’t fail it should communicate the fact that it is under stress to upstream components and so get them to reduce the load. This back-pressure is an important feedback mechanism that allows systems to gracefully respond to load rather than collapse under it. The back-pressure may cascade all the way up to the user, at which point responsiveness may degrade, but this mechanism will ensure that the system is resilient under load, and will provide information that may allow the system itself to apply other resources to help distribute the load."

Summary
A paradigm of the Reactive Programming is not new, and many other products have already started to support the paradigm.  It is great to have Java 9 adding a final class Flow and providing a systematic way to implement the reactive programming.  You may often need to implement your own Publisher class based on the interfaces defined in the Flow.  During the implementation, developers need to focus on the core of the reactive programming:  Responsive, Resilient, Elastic, and (Asynchronous) Message Driven.

To achieve the core of the reactive programming, the concurrent programming also become more important. Understanding of Future /CompletableFuture, Executor / ExecutorService, ForkJoinPool, Callable, etc are very necessary and I may cover them on a next post.

Saturday, November 18, 2017

Lambda Expression / Functional Programming - Not Always Beautiful.

Functional Programming has been gaining much attentions in the past few years.  Java 8 started to support the lambda expression and several new programming languages adopted the functional programming.  While I also have been learning it and watching many developers saying how great the functional programming is, I have also been remembering an old story I read before an age 10: The Naked King (or The Emperor's New Clothes)  - Everyone says "the most beautiful clothes" until a pure child says "the king is naked!"

I exaggerated.  Functional programming is very useful and worth to be popular, but it definitely requires careful thought before blindly saying "it is very beautiful".  A blog "Stop abusing lambda expression" also talks about what people need to be careful and I completely agree with this blog.

On this post, I will show you a few things we need to think about as I convert simple algorithm shown on the previous post to a lambda expression in Java 8 and to functions in Scala.

- Minimum Difference of Each Two Numbers in a Sorted Array / List
The previous post "Algorithm, Performance, and Back to the Basic" showed one solution.  When an input array is in a sorted order, this is a simple imperative code..
[Code 1]
public int findMinAbsDifferenceSorted(int[] numbers) {
   int min = Integer.MAX_VALUE;
   for(int i=1; i<numbers.length; i++){
      int diff = numbers[i] - numbers[i-1];
      if(diff < min){
         min = diff;
      }
   }
   return min;
}

- Functional Way
I am not very good at functional programming yet. So, I thought about how I can convert the simple imperative method to a functional way.  Since the functional programming suppose to be simple, less maintainable, and less error prone, this is how I write.  I forced myself to write the method this way, but I think creating another IntStream may not be a good functional way. (Functional programming experts:  Please suggest me the better code)
[Code 2]
public int findMinAbsDifferenceSortStream(int[] numbers) {
   return IntStream.range(0, numbers.length-1)
            .map(i -> numbers[i+1] - numbers[i])
            .min().getAsInt();
}

- Methods in Scala
We can convert the imperative method to functional programing in a few different ways and I will show you two different methods.
Without much thinking, this could be a simple way in Scala
[Code 3]
{for(i <- 1 to numbers.length -1) yield numbers(i) - numbers(i-1)}.min

After some consideration, this could be another way.
[Code 4]
numbers.tail.zip(numbers).map(n => n._1 - n._2).min

As you see, the Code 3 & 4 are much simpler than the Code 2 or Code 1.  Great! Are we happy now?  If you want to say "Yes", please wait until you answer the following questions.

1. Which code is better between Code 3 and Code 4? (Yes, I am not even asking you to think about different languages)
2. What would be a Big O notation on these Code 3 & 4?  Can you anticipate the growth rate?

It is known that Functional Programming focuses on "what to do" instead of "how to do" as like the imperative programming.  So, the Code 3 & 4 shows us "what to do" but is it really enough for software engineers?

- Execution Time of Each Method
Running 5 times with 100,000 random numbers :

Code 3:  22958 ~ 50029 ms
Code 4:  6 ~ 33 ms

Running 5 times with 10,000,000 random numbers :

Code 1:  5 ~ 9 ms
Code 2:  9 ~ 13 ms
Code 3:  Didn't even run it
Code 4:  3262 ~ 11068 ms

I understand that there was some unfairness: Scala uses an Int object and Java code used an primitive 'int' data type.  So, I also used a List<Integer> data type for the Code 1 & 2.  The execution time with 10,000,000 random numbers are

Code 1 with List<Integer>:  21 ~ 30 ms
Code 2 with List<Integer>:  26 ~ 38 ms

- Summary
I showed execution time of each method although I understand the execution time isn't everything.  Functional programming makes an parallel process simpler and we can utilize the usage of a multi core processor.  Nevertheless, I am asking how we can evaluate the algorithm of Code 3 and Code 4 before running the code with larger data set.  This much slower execution time might be not only a matter of an executing algorithm itself, but also using and creating immutable objects.

As we can see from many other resources, the Functional programming has many strengths, but I just want to say that software engineers need to see more than "what to do".  Without seeing "how to do", we may have hard time to control the quality of software. (although the definition of the quality may be different to different people.)

Sunday, November 5, 2017

Algorithm, Performance and Back to the Basic

I have had some chances to interview people for a Sr. software engineer position.  Sometimes, people don't remember what they learned at a college after working for companies more than 10 years.  I understand all situations and excuses because I may be same.

So, I chose a easier problem and wanted to see how they approach the problem within the first 10 minutes out of 50 minutes given to me.  OK. I got this question from a HackerRank.com, and it was a problem for 15 points, which means a easy problem.


Basically, question is about finding a minimum absolute difference between any two numbers in a array with a size 'n'.  Surprisingly enough, I didn't have any interesting discussion with candidates.  I will talk about very basic algorithm and efficiency here and continue my talk on a Lambda expression or functional programming on the next post.

- Brute-Force Approach
Most developers regardless of their job title can solve the problem using a nested two for loop without even thinking about much.

public int findMinAbsDifferenceLoop(int[] numbers) {
   int min = Integer.MAX_VALUE;
   for(int i=0; i<numbers.length; i++){
      for(int j=i+1; j<numbers.length; j++){
         int diff = Math.abs((numbers[i] - numbers[j]));
         if(min > diff){
            min = diff; 
         }
      }
   }
   return min;
}

Any problem with this simple solution if your interest is simply getting an answer?

Then, what is the big O notation of this algorithm?  Simply O(n^2)!!  What does it mean? As a number of data in the array grows, this execution time grows as fast as a graph of a square of x grows.

- Improvement after a Little Thinking.
Since we need to compare every each two numbers, the previous approach with two loops might be necessary.  But what if the numbers in the array were sorted? Do we still need to compare the first number with the second, third, and fourth number to find a minimum difference? No! Right?

public int findMinAbsDifferenceSorted(int[] numbers) {
   int min = Integer.MAX_VALUE;
   for(int i=1; i<numbers.length; i++){
      int diff = numbers[i] - numbers[i-1];
      if(diff < min){
         min = diff;
      }
   }
   return min;
}

This logic? Simply O(n)!!
What about the expense of the sorting??  Yes, Yes, Yes...!! We should have talked about it even if you were not a software engineer!

There are many different sorting algorithms and each algorithm has a best, a worst, and an average case as shown on this wiki page.  Nevertheless, most people will not have a problem with saying an average of O(n log n).

public int findMinAbsDifferenceSorted(int[] numbers) {
   Arrays.sort(numbers);
   int min = Integer.MAX_VALUE;
   for(int i=1; i<numbers.length; i++){
      int diff = numbers[i] - numbers[i-1];
      if(diff < min){
         min = diff;
      }
   }
   return min;
}

So, this logic has O(n log n) + O(n) --> O(n log n)!

- Execution Time
I created random numbers and this is the execution time on my mac book pro with a processor 2.3 GHz Intel Core i7.

X-axis indicates a number of data in the array starting from 10,000 to 100,000 incremented by 2,000. Y-axis indicates an execution time in millisecond.  Execution time for 100,000 data with the double for loop was 12,627 millisecond, but one with the Sorted data was 6 millisecond.  As you see, the execution time with the sorted data is almost flat on this graph.

I am not patient enough to run more number of data using the inner for loop.  This is execution time for the sorted data, size from 100,000 to 1,000,000 incremented by 20,000.  Although, there are some fluctuated execution time, it took only 78 millisecond to calculate the min difference with 1 million numbers.


Summary
Some people try to solve this problem with O(n) algorithm.  Great! But I will not talk about a better algorithm, because my purpose of this post is asking people to think about the basic and to provide you some refreshment before I talk about the functional programming on the next post.


Sunday, October 22, 2017

Java 8: Stream

Stream was introduced in Java 8 and it provides very useful and compact ways of programming along with the Lamda expression, also introduced in Java 8.

Java API defines the stream as a stream of object references: a sequence of elements supporting sequential and parallel aggregate operations.  Stream is not a data structure that stores elements.  It conveys elements from a source through a pipeline of computational operations.   Collections and streams have different goals.  Collections are mainly for the efficient management of and access to their elements.  Streams do not provide a way of directly accessing the elements, but are more concerned with describing their source and the computational operation on the source. Source could be an array, a collection, a generator function, an I/O channel, and so on.

To perform a computation, stream operations are composed into a stream pipeline. A stream pipeline consists of a source, intermediate operations and a terminal operation. Streams are lazy; computation on the source data is only performed when the terminal operation is initiated, and source elements are consumed only as needed..

Let's better understand the stream with examples.

[Code-1]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
public static void main(String[] args) {
  List<Integer> numbers = new ArrayList<>();
      
  for(int i=30; i>=1; i--){
    numbers.add(i);
  }
      
  List<Integer> resultNumbers = numbers.stream()
            .filter(p -> p%2==0)
            .map(n -> n*2)
            .collect(Collectors.toList());
      
  resultNumbers.forEach(n-> System.out.print(n + " "));
}

Collection in Java 8 provides two methods that generate the Stream : stream() and parallelStream().  On this example, the stream() returns a sequential Stream with the 'numbers' as its source. The Stream provides several methods for the 'computation' such as map, find, count, forEach, limit, and so on.  Some of these methods returns a Stream after an operation and are used as an intermediate operation. Some of them returns a final object or value and they are used as a terminal operation.

- filter(Predicate<? super T> predicate) 
It filters elements of the stream that matches the given predicate.  Predicate is a boolean-valued function of one argument.  On the line 9, "p -> p%2 == 0" is the Predicate and "p" is the one passing argument.  "p%2==0" is a boolean-value returning expression.  In this example, it returns a Stream with even numbers in the source 'numbers'

- map(Function<? super T, ? extends R> mapper)
It applies the given function to each element of the stream.On the line 10, it doubles each number 'n' in the Stream of the filtered stream in the source 'numbers' (from the line 9)

Output:

60 56 52 48 44 40 36 32 28 24 20 16 12 8 4 

- Execution order of methods in Stream
To see the order of execution, let's change the code and run it.  This time, the 'numbers' list contains 10 numbers initially.

[Code-2]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
public static void main(String[] args) {
  List<Integer> numbers = new ArrayList<>();
      
  for(int i=10; i>=1; i--){
    numbers.add(i);
  }
      
  List<Integer> resultNumbers = numbers.stream()
            .filter( p -> {System.out.print(" # F " + p); return p%2==0;})
            .map(n -> {System.out.print(" # M " + n*2); return (n*2);})
            .collect(Collectors.toList());
      
  System.out.print("\nFinal: ");
  resultNumbers.forEach(n-> System.out.print(n + " "));
}

Output:

# F 10 # M 20 # F 9 # F 8 # M 16 # F 7 # F 6 # M 12 # F 5 # F 4 # M 8 # F 3 # F 2 # M 4 # F 1
Final: 20 16 12 8 4 

It doesn't complete one intermediate method with all elements before calling the second intermediate method.  It filters each element and passes the element to the map if the element meets the Predicate condition before filtering the next element. 

The 'characteristic' of the stream is maintained. As said earlier, Streams are lazy: computation on the source data is only performed when the terminal operation is initiated

The following code also outputs the same result because the computation is not performed until the terminal operation 'collect' is initiated.

[Code-3]
1
2
3
4
5
 Stream<Integer> st = numbers.stream()
            .filter( p -> {System.out.print(" # F " + p); return p%2==0;});
      
 st = st.map(n -> {System.out.print(" # M " + n*2); return (n*2);});
 List<Integer> resultNumbers = st.collect(Collectors.toList());

If you put a breakpoint and run it on Debug mode, you will notice that there will be no output until executing the line 5.

- Return object and the element's order in the return object.
A 'new' List object is created in the 'Collectors.toList()' and this is the method description on the API document:

 /**
  * Returns a {@code Collector} that accumulates the input elements into a
  * new {@code List}. There are no guarantees on the type, mutability,
  * serializability, or thread-safety of the {@code List} returned; if more
  * control over the returned {@code List} is required, use {@link #toCollection(Supplier)}.
  *
  * @param <T> the type of the input elements
  * @return a {@code Collector} which collects all the input elements into a
  * {@code List}, in encounter order
  */
 public static <T>  Collector<T, ?, List<T>> toList() {

It says "return in encounter order".  This is the order in which the source makes its elements available by its iteration order.  Although we use a parallel stream, the same encounter order applies.

- Parallel Stream
[Code-2] uses a 'numbers.stream()' on the line 8.  Collection in Java 8 also provides a 'parallelStream' method that returns a parallel stream.  Let's see how the output changes.

[Code-4]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
 public static void main(String[] args) {
   List<Integer> numbers = new ArrayList<>();
   for (int i = 10; i >= 1; i--) {
     numbers.add(i);
   }

   List<Integer> resultNumbers = numbers.parallelStream().filter(p -> {
          System.out.print(" # F " + p);
          return p % 2 == 0;})
        .map(n -> {
          System.out.print(" # M " + n * 2);
          return (n * 2);})
        .collect(Collectors.toList());

   System.out.print("\nFinal: ");
   resultNumbers.forEach(n -> System.out.print(n + " ")); 
 }

Output: 

 # F 4 # F 6 # M 12 # F 8 # M 16 # F 1 # F 5 # F 2 # F 10 # M 20 # F 3 # F 9 # M 4 # F 7 # M 8
Final: 20 16 12 8 4 

Since the parallel stream runs the methods in parallel, the output would be different each time, and the displaying order is random except one thing: "#F " outputs come before the corresponding "#M" output. (For example, "# F 2" comes before the " # M 4")

Parallelism and Joining Threads
For parallel stream, ordering constraint can degrade the execution performance.  For example, a 'limit()' method with the parallel steam may require some buffering to ensure the order and undermining the benefit of the parallelism.

I will add the 'limit' method between the filter and map methods shown in the [Code-4] with more number of element in the source.

[Code-5]
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
 public static void main(String[] args) {
   List<Integer> numbers = new ArrayList<>();
   for (int i = 30; i >= 1; i--) {
     numbers.add(i);
   }

   List<Integer> resultNumbers = numbers.parallelStream().filter(p -> {
          System.out.print(" # F " + p);
          return p % 2 == 0;})
        .limit(10)
        .map(n -> {
          System.out.print(" # M " + n * 2);
          return (n * 2);})
        .collect(Collectors.toList());

   System.out.print("\nFinal: ");
   resultNumbers.forEach(n -> System.out.print(n + " ")); 
 }

Output:  (Some output from the filter are omitted to save the space)

 # F 11 # F 17 # F 26 # F 29 # F 27 # F 21 # F 22 ... # F 10 # F 14 # F 13 # M 36 # M 28 # M 24 # M 48 # M 52 # M 44 # M 40 # M 56 # M 60 # M 32
Final: 60 56 52 48 44 40 36 32 28 24 

The order of filter method is random by running it in parallel, and the order of the map method is random, but they are separated.  The reason is that the limit(10) method joins all threads used for the filter method before it limits the elements to 10.  After that, the map method run again in parallel.  (I said "joins all threads", but the ForkJoinTask, which run within a ForkJoinPool, is used.  ForkJoinTask is a thread-like entity that is much lighter than a normal thread)

The final output is same in encounter order used in the Collectors.toList().

Friday, August 4, 2017

Docker: Optimal Image for Microservices with Multi-Stage Build (and TomEE)

On an earlier post, I showed how TomEE builds a single executable jar and it makes an easier deployment.  Now, I would like to build a Docker image file with a TomEE RESTful web service, but noticed a problem with a size of a created Docker image file.  So, I spent some time to find a better way to reduce the image size.  In Docker, it is called a "multi-stage builds" which is introduced in Docker 17.05.

Simple RESTful service

package com.jihwan.javaee.test.endpoint;

import javax.ws.rs.DefaultValue;
import javax.ws.rs.GET;
import javax.ws.rs.Path;
import javax.ws.rs.QueryParam;

@Path("/hello")
public class HelloRest {
   @GET
   public String sayHello(@DefaultValue("Friend") @QueryParam("name") String name){
      return "Hello " + name + "!!";
   }
}

Maven pom.xml
Same, but a simpler pom.xml without any hibernate related jar files

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

  <modelVersion>4.0.0</modelVersion>
  <groupId>com.jihwan.javaee</groupId>
  <artifactId>test</artifactId>
  <version>0.1-SNAPSHOT</version>
  <packaging>war</packaging>

  <properties>
    <tomee.version>7.0.2</tomee.version>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <failOnMissingWebXml>false</failOnMissingWebXml>
    <maven.compiler.target>1.8</maven.compiler.target>
    <maven.compiler.source>1.8</maven.compiler.source>
  </properties>
  
  <dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.tomee/tomee-embedded -->
    <dependency>
      <groupId>org.apache.tomee</groupId>
      <artifactId>tomee-embedded</artifactId>
      <version>${tomee.version}</version>
      <scope>provided</scope>
      <exclusions>
         <exclusion>
            <groupId>org.apache.openjpa</groupId>
            <artifactId>openjpa</artifactId>
         </exclusion>
      </exclusions>
    </dependency>
  </dependencies>

  <build>
    <finalName>${project.artifactId}</finalName>
    <plugins>
      <plugin>
        <groupId>org.apache.tomee.maven</groupId>
        <artifactId>tomee-maven-plugin</artifactId>
        <version>${tomee.version}</version>
        <configuration>
          <tomeeVersion>${tomee.version}</tomeeVersion>
          <tomeeClassifier>plus</tomeeClassifier>

          <webapps>
            <webapp>${project.groupId}:${project.artifactId}:${project.version}?name=${project.artifactId}</webapp> 
          </webapps>
      
          <synchronization>
            <extensions>
              <extension>.class</extension>
            </extensions>
          </synchronization>

          <reloadOnUpdate>true</reloadOnUpdate>
          <systemVariables>
            <tomee.serialization.class.blacklist>-</tomee.serialization.class.blacklist>
            <openejb.system.apps>true</openejb.system.apps>
          </systemVariables>
        </configuration>
      </plugin>
    </plugins>
  </build>
</project> 


Dockerfile
This 'DockerfileFull' file follows a regular pattern of the Docker build process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
 FROM XXX-maven:3.5

 WORKDIR /app

 COPY . /app/

 RUN mvn clean package tomee:exec

 EXPOSE 8080

 CMD ["./start.sh"]

The XXX-maven:3.5 is a base image I used.  It is created by a private company, and only contains an (Alpine) linux OS, Java 8 and maven.  Its size is 617MB.

TomEE also provides a Docker image, but using a TomEE maven plugin is very simple as shown here.

Building an Image
     docker build -f DockerfileFull -t hellofull .

It perfectly creates a Docker image and run it, but this image has a few issues.  Let's look at the image size first.

i31596-is7:test jihwan.kim$ docker images
REPOSITORY      TAG             IMAGE ID          CREATED            SIZE
hellofull       latest          caf4ff63ac96      2 minutes ago      971MB

The size is 971 MB!  What happened? A size of the base image 'XXX-maven:3.5' is 617MB and a size of the test-exec.jar (explained on a previous post) with the pom.xml used here is only 50MB.  So why is this image so large?

When you ran the build command, you could notice that lots of jar files were downloaded at the line 7 of the DockerfileFull.  During the maven build, all necessary jar files were downloaded and saved in a maven repository of the OS in the base image.

Then, a target directory is created and saves lots of files along with the executable 'test-exec.jar'.  All of these files are saved in the 'hellofull' image.

Another issue is from the line 5 of the DockerfileFull file. It copies every files including all of your source codes.  Since a Docker image is built with several layers, your try of removing any files after the line 7 will not actually remove files.

By the way, this is the 'start.sh' script used in the Docker CMD.

#!/bin/bash
java -jar target/test-exec.jar

Multi-Stage Build
To handle the issues, you may build your application on a build server and copy only necessary files to the image without building the app in a Docker.  It may be a reasonable solution if your company and system are manageable size, but it doesn't always work this way.

"With multi-stage builds, you use multiple FROM statements in your Dockerfile. Each 'FROM' instruction can use a different base, and each of them begins a new stage of the build. You can selectively copy artifacts from one stage to another, leaving behind everything you don’t want in the final image." (From Docker document)

Now, let's write another Dockerfile.  I named it 'Dockerfile'

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
FROM XXX-maven:3.5 as builder

WORKDIR /app

COPY . /app/

RUN mvn clean package tomee:exec

FROM XXX-maven:3.5

COPY --from=builder /app/target/test-exec.jar /app/target/test-exec.jar

COPY --from=builder /app/start.sh /app/start.sh

EXPOSE 8080

CMD ["./start.sh"]

Then, build an image 'hello' :     docker build -t hello .

This is an image size.  667MB!  This is the exact size of the base XXX-maven (617MB) plus the test-exec.jar (50MB)

i31596-is7:test jihwan.kim$ docker images
REPOSITORY       TAG         IMAGE ID          CREATED            SIZE
hello            latest      96217cabf95d      3 minutes ago      667MB

Now, let's see information of each layer of the image. (Only top part of the entire history is shown to protect the history of the base image).  It doesn't contain the first stage of layers.

i31596-is7:test jihwan.kim$ docker history 96217cabf95d
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
96217cabf95d        9 minutes ago       /bin/sh -c #(nop)  CMD ["./start.sh"]           0B                  
96bb88e1568d        9 minutes ago       /bin/sh -c #(nop)  EXPOSE 8080/tcp              0B                  
568b079ef6d5        9 minutes ago       /bin/sh -c #(nop) COPY file:f79c59694e5f31...   42B                 
3e7db3746768        9 minutes ago       /bin/sh -c #(nop) COPY file:3fb1ab0cf1e09e...   50.1MB              
d2d465d1a5db        4 days ago          /bin/sh -c #(nop)  CMD ["/bin/bash"]            0B                  
<missing>           4 days ago          /bin/sh -c #(nop) WORKDIR /app                  0B                  
<missing>           4 days ago          /bin/sh -c #(nop) COPY file:a9b17ab946a74d...   1.97kB              
<missing>           4 days ago          |2 BASE_URL=https://apache.osuosl.org/mave...   10.2MB              
<missing>           4 days ago          /bin/sh -c #(nop)  ENV MAVEN_CONFIG=/root/.m2   0B                  
<missing>           4 days ago          /bin/sh -c #(nop)  ENV MAVEN_HOME=/usr/sha...   0B                  
<missing>           4 days ago          /bin/sh -c #(nop)  ENV MAVEN_VERSION=3.5.0      0B                  
<missing>           4 days ago          /bin/sh -c #(nop)  ARG BASE_URL=https://ap...   0B 


Running the App

      docker run -p 9000:8080 hello



Monday, July 24, 2017

Solr 6: Adding Documents using a SolrJ Client Application

I created a SolrJ client application to populate documents shown on a previous post about Solr schemaless.  This application includes all necessary jar files for the SoltJ applications: solr-solrj-6.6.0.jar in a <SolrInstalled>/dist directory and all jar files under a <SolrInstalled>/dist/solrj-lib directory.

UserCreateMain.java

import java.util.ArrayList;
import java.util.List;

public class UserCreateMain {

  public static void main(String[] args) {    
    int totalDoc = 3_000_000;
    int docPerThread = 1_000_000;
    
    List<SolrDocPojo> docList = new ArrayList<>();
    for(int numOfDoc = 1; numOfDoc <= totalDoc; numOfDoc++){
      UserDoc user = new UserDoc();
      user.populateFields();
      docList.add(user);
    
      if(numOfDoc%docPerThread == 0){
        (new SolrImport(docList)).start();
        docList=new ArrayList<>();
      }
    }
    
    if(docList.size()>0){
      (new SolrImport(docList)).start();
    }
  }
}

SolrDocPojo.java
import org.apache.solr.common.SolrInputDocument;

public interface SolrDocPojo {
  public SolrInputDocument converToSolrDoc();
  public void populateFields();
}

UserDoc.java
import java.util.HashSet;
import java.util.Random;
import java.util.Set;
import java.util.UUID;

import org.apache.solr.common.SolrInputDocument;

public class UserDoc implements SolrDocPojo {
  private String id;
  private String firstName;
  private String lastName;
  private Integer birthYear;
  private String companyName;
  private String state;
  private Set<String> permission = new HashSet<>();
  
  private static final Random RANDOM = new Random();
  private static final String[] COMPANIES = {"Google", "FB", "Samsung", "Intel", "Netflex", 
      "Micro", "Zions", "OC Tanner", "GE", "Goldman", "Aegen", "GlaxoSmithKline", "Ford"};
  
  public UserDoc() {
    super();
    id = UUID.randomUUID().toString();
  }


  @Override
  public SolrInputDocument converToSolrDoc() {
    SolrInputDocument solrDoc = new SolrInputDocument();
    solrDoc.setField("id", id);
    solrDoc.setField("firstName", firstName);
    solrDoc.setField("lastName", lastName);
    solrDoc.setField("birthYear", birthYear);
    solrDoc.setField("companyName", companyName);
    solrDoc.setField("state", state);
    solrDoc.setField("permission", permission);
    return solrDoc;
  }

  @Override
  public void populateFields() {
    state = US.randomState().getANSIAbbreviation();
    birthYear = 1930 + RANDOM.nextInt(80);
    companyName = COMPANIES[RANDOM.nextInt(COMPANIES.length)];

    int firstSeparator = id.indexOf('-');
    firstName = "first" + id.substring(0, firstSeparator);
    lastName = "last" + id.substring(firstSeparator+1, id.indexOf('-', firstSeparator+1));
    
    int numOfPermission = RANDOM.nextInt(11); //0~10
    for(int i=0; i<numOfPermission; i++){ //max 10 permissions
      permission.add("permission"+RANDOM.nextInt(10));
    }
  }

  //Getters and Setters are omitted
}

SolrImport.java
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.common.SolrInputDocument;


public class SolrImport extends Thread {
  final int SOLR_BATCH_SIZE = 2000;

  List<SolrDocPojo> docList = null;

  public SolrImport(List<SolrDocPojo> docList) {
    super();
    this.docList = docList;
  }

  public void run(){
    List<SolrInputDocument> inputList =
        new ArrayList<SolrInputDocument>();

    boolean commit = false;
    for(SolrDocPojo doc: docList){
      inputList.add(doc.converToSolrDoc());
      if( inputList.size() % SOLR_BATCH_SIZE == 0){
        sendToSolr(inputList, commit);
        inputList.clear();
        commit = !commit;
        System.out.println("sendToSolr executed");
      }
    }

    if(inputList.size() > 0){
      sendToSolr(inputList, true);
      inputList.clear();
    }

    System.out.println("done");
  }

  private void sendToSolr(List<SolrInputDocument> docList, boolean commit) {
    try {
      SolrEndPoint.client.add(docList);
      if(commit)
        SolrEndPoint.client.commit();

    } catch (SolrServerException e) {
      e.printStackTrace();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }
}

US.java
This enum class defines all US state names. Except the 'randomState' method, this code is from https://github.com/AustinC/UnitedStates/blob/master/src/main/java/unitedstates/US.java

import java.util.Arrays;
import java.util.List;
import java.util.Random;

public enum US {
  ALABAMA("Alabama","AL","US-AL"),
  ALASKA("Alaska","AK","US-AK"),
  ARIZONA("Arizona","AZ","US-AZ"),
  ARKANSAS("Arkansas","AR","US-AR"),
  CALIFORNIA("California","CA","US-CA"),
  COLORADO("Colorado","CO","US-CO"),
  CONNECTICUT("Connecticut","CT","US-CT"),

  //Omitted

  WYOMING("Wyoming","WY","US-WY"),
  PUERTO_RICO("Puerto Rico","PR","US-PR");

  private static final List<US> VALUES = Arrays.asList(values());
  private static final int SIZE = VALUES.size();
  private static final Random RANDOM = new Random();

  public static US randomState()  {
    return VALUES.get(RANDOM.nextInt(SIZE));
  }
  
  //Omitted
}

SolrEndPoint.java
import org.apache.solr.client.solrj.impl.HttpSolrClient;

public class SolrEndPoint {
  static final HttpSolrClient client = new HttpSolrClient.Builder("http://localhost:8983/solr/schemaless").build();  
}


Java 9: Flow - Reactive Programming

Programming world has always been changed fast enough and many programming / design paradigms have been introduced such as object oriented p...