Thursday 7 May 2015

Experiment with String Split

Splitting string based on some token is very common operation in application, in this blog i will share some options that are mostly used and types of overhead involved in it.

String.split
This is the most common approach and it looks harmless unless you look at the code!
First it creates ArrayList for storing values and then this arraylist is converted to String array

This function produces too much of garbage and it provides only one way to extract values.

String Tokenizer 
This is much better than String.split because it does not need intermediate buffer like ArrayList/String Array to hold the values, but it creates String objects for each token which adds to garbage.

One of the good thing about StringTokenizer is that it does not force caller to use fixed data structure , so caller is free to decide on what to do with values, it is just like Streaming operation.

String.split vs StringTokenizer
Lets look at some performance numbers. In this test i take below sample string

String[] values = {
        "this,is,simple,test",
        "lets,see,how,it,works",
        "this,is,very,simple,test"};

Each line is split 10 million times






















So definitely StringTokenizer is winner in this case and it is because it produces less garbage as compared to String.split but it still produces string objects.

It is possible to avoid creating those String object also by using Recycle Charsequence which is just like String but gives lots of flexibility.

Lets look at another implementation using recycle charsequence approach.


It is very simple technique to avoid intermediate string object, lets look at performance number using this approach.






















Recycle charSequence shines in this test, it is around 3X times faster than String.split an 2X times faster than StringTokenizer.

Code available @ github

2 comments:

  1. Interesting that object reuse > algorithmic improvement here

    ReplyDelete
  2. Yes garbage produced is showing big impact on performance.

    ReplyDelete