Are you ready: High throughput

Showing posts with label High throughput. Show all posts

Saturday, 20 July 2013

ArrayList Using Memory Mapped File

Introduction
In-Memory computing is picking up due to affordable hardware, most of the data is kept in RAM to meet latency and throughput goal, but keeping data in RAM create Garbage Collector overhead especially if you don't pre allocate.
So effectively we need garbage less/free approach to avoid GC hiccups

Garbage free/less data structure
There are couple of option to achieve it
- Object Pool
Object pool pattern is very good solution, i wrote about that in Lock Less Object Pool blog

- Off Heap Objects
JVM has very good support for creating off-heap objects. You can get rid of GC pause if you take this highway and highway has its own risk!

-MemoryMapped File
This is mix of Heap & Off Heap, like best of world.

Memory mapped file will allow to map part of the data in memory and that memory will be managed by OS, so it will create very less memory overhead in JVM process that is mapping file.
This can help in managing data in garbage free way and you can have JVM managing large data.
Memory Mapped file can be used to develop IPC, i wrote about that in power-of-java-memorymapped-file blog

In this blog i will create ArrayList that is backed up by MemoryMapped File, this array list can store millions of object and with almost no GC overhead. It sounds crazy but it is possible.

Lets gets in action
In this test i use Instrument object that has below attribute
- int id
- double price

So each object is of 12 byte.
This new Array List holds 10 Million Object and i will try to measure writer/read performance

Writer Performance

X Axis - No Of Reading
Y Axis - Time taken to add 10 Million in Ms

Adding 10 Million element is taking around 70 Ms, it is pretty fast.

Writer Throughput

Lets look at another aspect of performance which is throughput

X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions

Writer throughput is very impressive, i ranges between 138 Million to 142 Million

Reader Performance

X Axis - No Of Reading
Y Axis - Time taken to read 10 Million in Ms

It is taking around 44 Ms to read 10 Million entry, very fast. With such type of performance you definitely challenge database.

Reader Throughput

X Axis - No Of Reading
Y Axis - Throughput /Second , in Millions

Wow Throughput is great it is 220+ million per second

It looks very promising with 138 Million/Sec writer throughput & 220 Million/Sec reader throughput.

Comparison With Array List
Lets compare performance of BigArrayList with ArrayList,

Writer Throughput - BigArrayList Vs ArrayList

Throughput of BigArrayList is almost constant at around 138 Million/Sec, ArrayList starts with 50 Million and drops under 5 million.

ArrayList has lot of hiccups and it is due to
- Array Allocation
- Array Copy
- Garbage Collection overhead

BigArrayList is winner in this case, it is 7X times faster than arraylist.

Reader Throughput - BigArrayList Vs ArrayList

ArrayList performs better than BigArrayList, it is around 1X time faster.

BigArrayList is slower in this case because
- It has to keep mapping file in memory as more data is requested
- There is cost of un-marshaling

Reader Throughput for BigArrayList is 220+ Million/Sec, it is still very fast and only few application want to process message faster than that.
So for most of the use-case this should work.

Reader performance can be improved by using below techniques
- Read message in batch from mapped stream
- Pre-fetch message by using Index, like what CPU does

By doing above changes we can improve performance by few million, but i think for most of the case current performance is pretty good

Conclusion
Memory mapped file is interesting area to do research, it can solve many performance problem.
Java is now being used for developing trading application and GC is one question that you have to answer from day one, you need to find a way to keep GC happy and MemoryMapped is one thing that GC will love it.

Code used for this blog is available @ GitHub , i ran test with 2gb memory.
Code does't handle some edge case , but good enough to prove the point that that MemoryMapped file can be winner in many case.

Sunday, 16 June 2013

Fast ReadWrite Lock

Locks are bad for performance, but still they are good fit for some scenario.

In this post i will review java Read/Write lock and alternate implementation of fast Read/Write lock

What is Readers-Writer lock
Lets do quick re-cap of read/write lock.

As name suggest it is shared lock , it is shared between readers/writers. So you can have either reader(s) or writer. You can have multiple readers or one writer active at any given point of time. So it is very good fit for the case where number of read operation is more than write.

What is the issue with Java RW lock
By design only reader(s) or writer can be active at any given point of time, so it will cause starvation when contention is high.

Advance RW lock will have additional features like

Priority - for eg writer has more priority than readers or otherwise
Fairness - This can be based on waiting time

It is complex to build lock with these set of rules and many things are in control of OS, for eg scheduling of thread.
With all these issues rw locks are still used.

How does RW lock perform.
Lets try to measure rw lock performance under different scenario

Details of problem that will be used for testing

Read/Write to Hashmap, writer threads adds 1 Million entry to map and reader tries to read those value back, for write operation Write lock is used and for read operation Read lock is used.
I will try to test java rw lock under no-contention and gradually try to increase contention to see how it performs.

Legend to read graph

1W - One writer
1W-1R - One writer & One reader
1W-2R - One write & Two reader

Writer Performance

X Axis - Writer/reader Threads
Y Axis - Time to 1 Milli entry

Case when there is "1 Writer & 1 Reader" shows wired number on my system, i will ignore that for comparison.

When there is no contention writer takes around 73 ms and as we add more reader threads contention increases, it starts to slow down, in my example when total thread count reaches to 4 performance drops by 10 times. That is big drop for "1Writer-3Reader" scenario, just think what will happen if more readers are added.

Writer Throughput
Lets have look at throughput

X Axis - Writer/reader Threads
Y Axis - Throughput/Sec

I will ignore case "1W-1R"

Throughput also takes big hit under contention, it falls from sky to roof, for 1 writer throughput is 14 million and once total thread count reaches to 4 throughput drops to 1.4 million which is 90% less.

Writer is just one part of RW lock, lets look at the performance of reader.

Reader Performance

X Axis - Writer/reader Threads
Y Axis - Time to add 1 Million entry

Performance degrade trend continues in readers also.

Reader Throughput

Reader throughput also takes hit under contention.

Readers without writer contention
This scenario is very realistic because

Mostly writer thread will populate data at application start up time
Or based on some write event to update data incrementally

90% + times application only reads the data and it is the performance of readers threads that will matter most by the end of day. Lets have look at performance when only readers are active.

In this test data in hashmap was pre-populated and only readers threads were executing.
Numbers now looks real :-) .

It takes more time to read data as number of readers increase, same thing is observed for throughput also, it drops.

Alternate Reader/Writer Lock
Lets compare alternate implementation of RW lock

Readers without writer contention

Fast Lock outperforms java ReadWriteLock in both the test.

In the first test, time taken to read is almost flat with Fast RW Lock and for java based it is going up like stock price as we keep on adding more readers thread.

In throughput test java rw lock starts from 28 million and it drops to 1.6 million, that is 95% drop.
Fast rw lock starts from 58 million and drops to 34 million, that is around 40% drop.

Under heavy contention(i.e 4 readers), java readerwriter lock is around 20X times slow, that looks like comparing apple vs orange, but test result shows that.

So for readers new lock looks great, lest measure with writer.

Writer Performance - Java RW vs Fast RW

Fastlock performs better than java RW lock under contentions

Reader Performance - Java RW vs Fast RW

What makes FastRW lock fast
Now lets understand why new read/write lock is fast.

RW lock that i have used for testing is an implementation of SeqLock. SeqLock is very common in linux world, it provide parallelism for readers & writer. Reader never blocks writer & reader uses optimistic approach to read the data.

Readers in seqlock never update any sate in ReadWrite lock and thus reduces cpu cache traffic,so just load operation is involved for reading data.

So in nutshell it provides

Less cpu cache traffic
Readers never block writer
Reader never changes state of lock, no expensive store operation
Readers can slow down if write frequency is very high
Writer always get priority

With so much of benefit i am sure you will never use java RW lock!

Conclusion
SeqLock are very good alternative for reader/writer lock and especially in multicore world where we are use new technique to get best out of processor. Someday java will add support for seqlock.
Code is available @ GitHub for your reference.

Sunday, 9 June 2013

CPU cache access pattern

Things to know about low latency

If you are developing low latency application then you have to remove bottleneck that can cause your latency to increase, some of the bottleneck that comes to my mind

- Input/Output - I/O has big effect in latency it will cause CPU to stall. SSD disk may be one option, but still you want to keep this out of core processing logic.

- Network - Same effect as I/O, so you have to reduce Network read/write in your core logic, there are many software and hardware solution to improve this.

- Locks in code - Using non-blocking techniques.

- Keeping data in RAM - I will explore this option in blog.

So we can keep all data in RAM to get best latency possible. RAM is very cheap, so lets put every thing in RAM. In theory you should be able to keep CPU busy if all the data that you need for computing is available in RAM, but in practical it doesn't happen that way because of underlying datastructure used.
CPU stalls for some time due to slow access of RAM.

Experiment
Lets try simple experiment to prove that keeping all data in RAM is just not enough to achieve low latency.

Problem:
25 million instrument object are stored in memory and each object has 2 attribute
- Instrument Id
- Price
Sets of input price is passed as input and all the instrument matching that price will be returned. It is a simple search based on price attribute.

Structure Of Instrument Object
Two types of instrument object is created for this test.

- As simple java object
Instrument object is simple java class with 2 attribute

class Instrument
{
public int id;
private int price;
public Instrument(int id, int price) {
super();
this.id = id;
this.price = price;
}

public int getId() {
return id;
}

public int getPrice() {
return price;
}
}

- Column type object
For each attribute array is created, this structure is more CPU cache friendly, we will why is is so later.
Sample ColumnInstrument Object

class ColumnInstrumentStore
{

private int[] ids;
private int[] prices;

public ArrayInstrumentStore(int size)
{
ids = new int[size];
prices = new int[size];
}

public int getId(int index) {
return ids[index];
}

public int getPrice(int index) {
return prices[index];
}
public void setValue(int index, int id, int price) {
ids[index] = id;
prices[index] = price;
}

}

Lets measure performance
Lets try to measure search cost for 2 types of instrument object,

X axis - No of reading
Y axsis - Time taken in Ms

Linear search is performed on 25 million instrument object.
Instrument object defined as column is around 5% to 25% faster as compared to normal java based instrument object, on average it is 15% faster as compared to java based object.

Lets look at another number - Time take for each iteration. Time taken for 1 iteration is very less so i have used nano second to measure that.

X axis - No of reading
Y axis - Time taken in NanoSecond

This explains why search on column based objects is fast, time spent per iteration is around 15% less .

Why Column based object is fast
Lets look at the reason why search performs better on Column layout type of object as compared to normal object.

- CPU Pre-Fetching - Today almost all the processor supports hardware pre-fetch. CPU pre-fetch data that is required by application to reduce memory latency

- Better use of CPU cache - CPU moves data from main memory to cache via cache lines, in current generation of CPU, cache line is of 64 byte, so if you ask for 1 byte of data, processor will still bring 64 byte of data from ram to CPU cache line. This is called - Spatial locality, definition from wikipedia -

Spatial locality: if a particular memory location is referenced at a particular time, then it is likely that nearby memory locations will be referenced in the near future. In this case it is common to attempt to guess the size and shape of the area around the current reference for which it is worthwhile to prepare faster access

This CPU behavior can be used to reduce round trip to ram by co-locating you data.

Lets take Instrument object to understand layout

- Object Layout

- Column Layout

In case of column based instrument, data is stored in integer array, if program request for single value of array , it will get next 15 element of array due to cache line transfer, but in case of normal object it will get only 7 instrument price.
Normal object layout requires double round trip to memory and that is the reason why it is slow.
If you co locate your objects then you can use CPU cache much more efficiently.

Most of the real object are more complex than Instrument object that i have used and you can imagine number of round trip it require to RAM.
You can get more details of memory latency from latency-number-that-you-should-know
Just to put those number in context -

L1 Cache : 0.5 ns
Main Memory : 100 ns ( 20x times more than L1)

How it performs in multi-threaded world.
In today's time it is impossible to get machine with single core, we live in multi core world!
Each logical core has its own private cache, so if you data structure is cache friendly then it will scale linearly as you add more threads/core to your processing.

Lets look at some number for multi threaded search.

Search latency

Loop Iteration

Both search & iteration per second is improving as more number of threads are added, this will become more effective for NUMA machines because memory access is not uniform.

Just by having CPU cache friendly layout you can improve application performance
Column layout have more benefit, more on next blog.

Sunday, 19 May 2013

Lock Less Java Object Pool

It is being while i wrote anything, i has been busy with my new job that involve doing some interesting work in performance tuning. One of the challenge is to reduce object creation during critical part of application.

Garbage Collection hiccups has been main pain point in java for some time, although java has improved over time with GC algorithmic. Azul is market leader developing pause less GC but Azul JVM are not free as speech!

Creating too many temporary/garbage object does't work too well because it create work for GC and it is going to have effect on latency, too much garbage also does't work well with multi core system because it causes cache pollution.

So how should we fix this ?

Garbage less coding
This is only possible if you know how many object you need upfront and pre-allocate them, but in reality that is very difficult to find that , but in-case if you still managed to do that then you have to worry about another issue

You might not have enough memory to hold all the object you need
You have to handle concurrency

So what is the solution for above problem

There is Object Pool design pattern that can address both of the above issue,it lets you to specify num of object that you need in pool and handles concurrent request to serve object request.

Object Pool has been base of many application that has low latency requirement, flavor of object pool is Flyweight design pattern.

Both of above pattern will help us in avoiding object creation, that is great so now GC work is reduced and in theory our application performance should improve but in practical does't happen that way because Object Pool/Flyweight has to handle concurrency and whatever advantage you get by avoiding object creation is lost because of concurrency issue.

What are most common way to handle concurrency

Object pool is typical producer/consumer problem and it can be solved by using following techniques

Synchronized - This was the only way to handle concurrency before JDK 1.5, apache has written wonder full object pool API based on synchronized

Locks - Java added excellent support for concurrent programming since JDK 1.5, there has been some work to use Locks to develop Object Pool for eg furious-objectpool

Lock Free - I could not find any implementation that is built using fully lock free technique, but furious-objectpool use mix of ArrayBlocking queue & ConcurrentLinked queue

Lets measure performance

In this test i have created pool of 1 Million object and those object are accessed by different pool implementation, objects are taken from pool and return back to pool.

This test first starts with 1 thread and then number of threads are increased to measure how different pool implementation perform under contention

X Axis - No Of Threads

Y Axis - Time in Ms - Lower time is better

This test include pool from Apache, Furious Pool & ArrayBlocking based Pool

Apache one is worst and as number of threads increase performance degrades further and reason for same is Apache pool is based on heavy use of "synchronized"

Other two Furious & ArrayBlocking based pool performs better but both of them also slows down as contention increase.

ArrayBlocking queue based pool takes around 1000 ms for 1 Million items when 12 threads are trying to access the pool, Furious pool which internally uses Arrayblocking queue takes around 1975 ms for same thing.

I have to do some more detail investigation to find out why Furious is taking double time because it is also based on ArrayBlocking queue.

Performance of arrayblocking queue is decent but it is lock based approach, what type of performance we can get if we can implement lock free pool.

Lock free pool.

Implementing lock free pool is not impossible but bit difficult because you have to handle multiple producer & consumer.

I will implement hybrid pool which will use lock on the producer side & non blocking technique on the consumer side.

Lets have look some numbers

I performed same test with new implementation (FastPool) and it is almost 30% faster than ArayBlocking queue.

30% improvement is not bad, it can definitely help is meeting latency goal.

What makes Fast Pool fast!

I used couple of technique to make it fast

Producer are lock based - Multiple producer are managed using locks, this is same as Array Blocking queue, so nothing great about this.

Immediate publication of released item - it publishes element before lock is released using cheap memory barrier. This gives some gain

Consumer are non blocking - CAS is used to achieve this, consumer are never blocked due to producer. Array Blocking queue blocks consumer because it use same lock for producer & consumer

Thread Local to maintain value locality - Thread Local is used to acquire last value that was used, this reduces contention to great extent.

If you are interested in having look at code then it is available @ FastObjectPool.java

Resources
java-object-tutorial

Tuesday, 25 December 2012

Are you using correct logging framework ?

Logging framework is heart of every application, it helps in troubleshooting production issue,knowing how your application is being use,what are bottle neck in application and may more.

Using right logging framework is key, Wikipedia has list of famous one in java world.

So logging has lot of benefit but it also brings lot of overhead ,so you do trade-off for benefit that you get. It is interesting to measure cost of overhead and we all know it has big overhead because it is I/O bound, so we come with best strategy on what to log, how much to log, which one is best framework etc.

One problem with current logging framework is that it , not much work has been done on performance improvement, especially if you talk about taking advantage of multi core architecture. Application try to do more work to take advantage of multiple cores and that may result into more..... log message and since log framework are not their yet, so they don't scale.

Cost Of Log Message

Lets measure time spent in logging.
I wrote java program that sums the number of element in array and it is done in parallel. Array is divided in chunk and each task sums that chunk and log message before & after it process chunk. I have taken simple computation problem because we want to measure cost of logging.

For bench marking purpose , i do calculation with/without logging to check how much time we spend on logging and numbers are crazy.

Java JDK 1.4 logging framework(java.util.logging) is used for benchmark, i only took java one because all other like log4j , apache logging , SLFJ etc are almost same when they log message to file.

Details of Machine
OS : Windows 7, 64 bit
Processor : Intel i5, 2.40 GHz
No Of Cores : 4

Numbers are really crazy, once you add logging, performance drops by 6X to 10X times, it is not worth it to use logging.

More time is spent in logging than doing real work, but fact of life is we still need logging and if it is not there then it will be nightmare to troubleshoot production issue.

Why it is so slow!

To find out why is super slow, we have to dive into the code, so lets have look at default handler(FileHandler) that java logging framework provides.

FileHandler has synchronized function that perform core logging and we know what type of performance degradation you can get in multi threaded env, so that is number 1 reason for slowness.

public synchronized void publish(LogRecord record) {

if (!isLoggable(record)) {

return;

}

super.publish(record);

flush();

.........

..........

...........

}

Other reason for slowness is that I/O operation is performed for each and every call of log message, there is no buffering done, not sure why it is done like that may be to keep it simple and all these simplicity adds to performance cost.

All of these techniques result in lot of contention and we see big degrade in performance by harm-less looking code.

What can we do now ?

So we have to find some alternate framework that does't make our application 10X times slow!

We have to do couple of quick things.

- Get rid of synchronized

- Try to add some I/O buffering

- Use Async for logging

So i did add these stuff and created simple logging util and measured the cost of same.

So the light green one is the logger that has some of improvement and it is amazing to see that we are back to same performance and it is not adding any significant overhead, it is almost as there is no logging in code.

So by just switching logging framework we can get up to 6X to 10X benefit.

What gives up these performance improvement

- ConcurrentLinkedQueue is used to store all the log message, ConcurrentLinkedQueue is lock free queue but it is not bounded. Bouded queue can be used to reduce risk where producer is faster than consumer.

- Buffering is added for I/O operation, these reduces I/O operation

- Prealocated bytebuffer is used to keep all the message that needs to writen to file.

- Lock based wait strategy is used when there is no message in queue, better waiting strategy can be used based on requirement, but for logs i think lock based are fine.

Conclusion

So time has come to re-think logging framework used, just using by using correct logging framework we can get big performance boost. To make application more responsive, we have to have to look for alternate. This is very important in low latency & high throughput space.

Example that i used is very basic, it does't have all the features, but can be easily extended to add those stuff.

Link To Code

Code available @ github

Friday, 30 November 2012

Lock Free bounded queue

Lock free bounded queue

JDK 1.5 was first step towards adding serious concurrency support to java, it changed the way you write concurrent program and in other version 1.6, 1.7 etc more concurrency support.

One of the important building block for concurrent programming is Queue, java has lot of them ArrayBlockingQueue, ArrayDeque, ConcurrentLinkedQueue, DelayQueue, LinkedBlockingDeque, LinkedBlockingQueue, PriorityBlockingQueue, PriorityQueue, SynchronousQueue.
I am sure you will find one Queue that will fit your need!

Locks are used in most of the queue for managing concurrent access to data structure because of obvious reason of simplicity to use, lock based strategy makes it difficult to develop low latency system, it could result in priority inversion, context switching, deadlock etc
Alternate way is using Non Blocking Algorithm , which is bit difficult to develop, one of the non blocking algorithm is lock free algorithm

Java added support for look free via java.util.concurrent.atomic package, so now time for some action with lock free api.

Java 1.5+ has lock free queue ConcurrentLinkedQueue, which is very fast but it is unbounded and such type queue can be very risky to use real application because most of the time queue is not balanced, it is either full or empty and if it is unbounded then system can run out of resource.

I will implement lock free bounded queue and measure it performance against lock based bounded queue(ArrayBlocking), lock free unbounded queue(ConcurrentLinkedQueue). I chose these queue because they are fastest queue available in JDK.

Number Game

Test Envinorment Details
OS : Windows 7, 64 bit
Processor : Intel (R) Core(TM) i7-2820QM CPU @ 2.30 GHz
Processor Arch : Sandy Bridge , 4 physical cores/8 threads

For this test 10 Million message was added to queue using single producer and consumer number are in range of 1 to 5, non blocking call(offer/poll) call is used to produce/consume message from queue, performance is best when there is 1 consumer and as number of consumer increase performance degrades, so it is evident that contention is the culprit.

Cpu Usage

Cpu usage for LockFreeBoundedQueue

Cpu usage for ArrayBlockingQueue

From CPU usage it is clear that contention is causing drop in cpu usage and with lock free queue CPU usage is in acceptable range.

Test Code of LockFreeBoundedQueue

Link To Code

Above code is just an idea to demonstrate that how powerful non blocking algorithm can be. This code can be enhanced to add more feature like multiple producer, support for blocking call put/take.
Before adding blocking support(put/take) to queue we have to think about blocking strategy, there are couple of option spinning, hybrid spinning etc.

Saturday, 3 November 2012

Latency number that you should know

Many of you work on low latency & high throughput system. Key to developing such system is understanding latency be it of cpu cache, Ram, disk or network. I found some interesting latency number, understanding these number is very important because all these are based on speed of light and we all know that we will never be able to get faster than light.

L1 cache reference ......................... 0.5 ns
Branch mispredict ............................ 5 ns
L2 cache reference ........................... 7 ns
Mutex lock/unlock ........................... 25 ns
Main memory reference ...................... 100 ns             
Compress 1K bytes with Zippy ............. 3,000 ns  =   3 µs
Send 2K bytes over 1 Gbps network ....... 20,000 ns  =  20 µs
SSD random read ........................ 150,000 ns  = 150 µs
Read 1 MB sequentially from memory ..... 250,000 ns  = 250 µs
Round trip within same datacenter ...... 500,000 ns  = 0.5 ms
Read 1 MB sequentially from SSD* ..... 1,000,000 ns  =   1 ms
Disk seek ........................... 10,000,000 ns  =  10 ms
Read 1 MB sequentially from disk .... 20,000,000 ns  =  20 ms
Send packet CA->Netherlands->CA .... 150,000,000 ns  = 150 ms

CPU Cache latency cost

L1 cache is nearest to core and as you move away you can see that latency is taking hit and if you are doing it billion times then it is going to get converted to human noticeable delay, here is what it will look like.

Minute

L1 cache reference                  0.5 s         One heart beat (0.5 s)

Branch mispredict                   5 s           Yawn

L2 cache reference                  7 s           Long yawn

Mutex lock/unlock                   25 s          Taking couple of deep breath.

Hour

Main memory reference               100 s         Answering nature call

Week

Read 1 MB sequentially from memory 2.9 days Welcome long weekend

So to get better performance you have to try to come close to L1/L2 cache which is really difficult things to do. You have develop Cache oblivious algorithm to get super performance.

I/O latency cost

Read 1 MB sequentially from disk .... 20,000,000 ns = 20 ms

Read 1 MB sequentially from SSD(~1GB/sec) ..... 1,000,000 ns  =   1 ms

So for normal disk on an average can read 50 MB/Sec

For SSD 1000 MB/Sec

50 MB/Sec is pretty fast and there are many techniques by which you can increase sequential read more by adjusting buffer size of read, so before making decision on what type of disk you use, you should make sure are you able to process data at the rate it is read from normal Disk. If you can’t process as fast as that then no point in getting SSD disk.

Reference

https://gist.github.com/2841832

Friday, 2 November 2012

Power of Java MemoryMapped File

In JDK 1.4 interesting feature of Memory mapped file was added to java, which allow to map any file to OS memory for efficient reading. Memory mapped file can be used to developed IPC type of solution. This article is experiment with memory mapped file to create IPC.

Some details about Memory Mapped File, definition from WIKI

A memory-mapped file is a segment of virtual memory which has been assigned a direct byte-for-byte correlation with some portion of a file or file-like resource. This resource is typically a file that is physically present on-disk, but can also be a device, shared memory object, or other resource that the operating system can reference through a file descriptor. Once present, this correlation between the file and the memory space permits applications to treat the mapped portion as if it were primary memory.

Sample Program

There are two java program one is writer and other is reader. Writer is producer and tries to write to Memory Mapped file , reader is consumer and it reads message from memory mapped file. This is just a sample program to show to idea, it does't handle many edge case but good enough to build something on top of memory mapped file.

MemoryMapWriter

import java.io.File;

import java.io.FileNotFoundException;

import java.io.IOException;

import java.io.RandomAccessFile;

import java.nio.MappedByteBuffer;

import java.nio.channels.FileChannel;

public class MemoryMapWriter {

public static void main(String[] args) throws FileNotFoundException, IOException, InterruptedException {

File f = new File("c:/tmp/mapped.txt");

f.delete();

FileChannel fc = new RandomAccessFile(f, "rw").getChannel();

long bufferSize=8*1000;

MappedByteBuffer mem =fc.map(FileChannel.MapMode.READ_WRITE, 0, bufferSize);

int start = 0;

long counter=1;

long HUNDREDK=100000;

long startT = System.currentTimeMillis();

long noOfMessage = HUNDREDK * 10 * 10;

for(;;)

{

if(!mem.hasRemaining())

{

start+=mem.position();

mem =fc.map(FileChannel.MapMode.READ_WRITE, start, bufferSize);

}

mem.putLong(counter);

counter++;

if(counter > noOfMessage )

break;

}

long endT = System.currentTimeMillis();

long tot = endT - startT;

System.out.println(String.format("No Of Message %s , Time(ms) %s ",noOfMessage, tot)) ;

}