SolrServer server = new HttpSolrServer("HOST_URL/solr/addressDB"); for(int i=1; i<=10000; i++){ SolrInputDocument doc = new SolrInputDocument(); doc.addField(.....); doc.addField(.....); server.add(doc); if(i%200 == 0){ server.commit(); } }
Let's look at the API of SolrServer object.
- add(SolrInputDocument doc) : Adds a single document
- commit() : Performs an explicit commit, causing pending documents to be committed for indexing.
People may expect a batch process: Keep 200 documents on the server object and send the 200 to the server on the commit method call. Unfortunately, this is not how it works.
When you see a solr.log in a $SOLR_HOME/example/logs directory, you will see each update process per each call of the add method. You can also see the source code of the SolrServer object. As you see, an update request is processed on each add method call.
public UpdateResponse add(SolrInputDocument doc ) throws SolrServerException, IOException { return add(doc, -1); } public UpdateResponse add(SolrInputDocument doc, int commitWithinMs) throws SolrServerException, IOException { UpdateRequest req = new UpdateRequest(); req.add(doc); req.setCommitWithin(commitWithinMs); return req.process(this); }
Solr server itself has a kind of batch process: changed/added documents is not visible to a current searcher and commit opens a new searcher with the updated index. Data added to the Solr is not searchable until the commit is made. So, the description of the API explains what is happening on a Solr server.
1. Solr Client version 2.
This is the updated process.
SolrServer server = new HttpSolrServer("HOST_URL/solr/addressDB"); int dataCount = 0; List<SolrInputDocument> docList = new ArrayList<SolrInputDocument>(); for(int i=1; i<=SOME_DATA_SIZE; i++){ SolrInputDocument doc = new SolrInputDocument(); doc.addField(.....); doc.addField(.....); docList.add(doc); dataCount++; if(i%200 == 0){ server.add(docList); server.commit(); docList.clear(); } } if(docList.size() > 0){ server.add(docList); server.commit(); docList.clear(); } |
This client added all 6,071,307 address data to the Solr server in about 410 seconds on my machine. (Version 1 : 263,000 data processed in 600 seconds) As shown on the Java code from the version 1, I called an add method every 1000 documents and a commit methods for every 2000 documents. Calling both methods at the same time is not necessary, but calling a commit method would be less frequent.
To see all data added, open a webpage on http://localhost:8983/solr/ . Then, select a core 'addressDB' --> click 'Query' --> click 'Execute Query' button. By default, it returns 10 data.
No comments:
Post a Comment