Wednesday 16 December 2020

Sparse Matrix

 Vectors are a very powerful data structure and it allows to write/read data using index. These vectors can be combined to create a matrix. Matrix are used in numerical analysis.   

One of the problems with matrix is that most of the real world matrix are sparse and the traditional way of allocating 2 Dimensional arrays to represent matrix can waste lots of space and also need lots of upfront memory.

In this post I will share some ways of storing sparse matrix.

In this example I will use Integer value as an example but these techniques can be used for any data type.

Dense Matrix

This is the most straightforward and default way. This is also most efficient for read and write access but needs a full matrix to be allocated before using it.  




public class IntDenseVectorValue {

final int[][] values;

public IntDenseVectorValue(int xSize, int ySize) {
this.values = new int[xSize][ySize];
}

public void set(int x, int y, int value) {
values[x][y] = value;
}


public int get(int x, int y) {
return values[x][y];
}
}

Many real world data sets are sparse, so lots of space gets wasted in such situations. 

Block Dense Matrix

This technique is based on creating small blocks of W width and each block is a small 2 dimension array.

During set/get X variables are used to identify block numbers.



This approach avoids upfront allocation and a very good fix for incremental data load. Indirection layer is added with help of Map actual block is located. Performance wise this is the same as a Dense vector but with less memory overhead. 

Code snippet for block dense matrix.

public class IntDenseBlockVectorValue {
final int blockSize;
final int width;
final int length;
final Map<Integer, int[][]> values = new HashMap<>();

public IntDenseBlockVectorValue(int x, int y, int blockSize) {
this.length = x;
this.width = y;
this.blockSize = blockSize;
}

public void set(int x, int y, int value) {
check(x, y);
int blockIndex = x / blockSize;
int[][] block = values.computeIfAbsent(blockIndex, index -> new int[blockSize][width]);
int rowIndex = x % blockSize;
block[rowIndex][y] = value;
}

private void check(int x, int y) {
if (x >= length || y >= width) {
throw new ArrayIndexOutOfBoundsException(String.format("Index [%s,%s] does not exists", x, y));
}
}

public int get(int x, int y) {
check(x, y);

int blockIndex = x / blockSize;
int[][] block = values.get(blockIndex);
int rowIndex = x % blockSize;
return block[rowIndex][y];
}

} 

Sparse Row Matrix

This technique takes block allocation to the next level by doing allocation at row level.

Map contains an entry for every row and a single dimension vector is used as a value. 


This helps in reducing wastage to a great extent. If some rows are not added then that row never gets allocated. 

Code snippet


public class IntSparseRowVectorValue {
private final Map<Integer, int[]> values = new HashMap<>();
private final int width;

public IntSparseRowVectorValue(int width) {
this.width = width;
}

public void set(int x, int y, int value) {
int[] row = values.computeIfAbsent(x, index -> new int[width]);
row[y] = value;
}

public int get(int x, int y) {
int[] row = values.get(x);
return row[y];
}
} 


Sparse Row/Column Matrix

This decomposed row vector to individual columns and gives ultimate memory gain with some trade off of read/write access.

This is modeled as Map of Map.



Code snippet

public class SparseRowColVectorValue {

private final Map<Integer, Map<Integer, Integer>> values = new HashMap<>();

public void set(int x, int y, int value) {
Map<Integer, Integer> row = values.computeIfAbsent(x, index -> new HashMap<>());
row.put(y, value);
}

public int get(int x, int y) {
Map<Integer, Integer> row = values.get(x);
return row.get(y);
}

} 

One thing to note in this approach is that the default map of JDK will need a wrapper of primitive type to store value and it can result in some performance overhead due to boxing/unboxing. This issue can be solved by having a primitive Map. 

 

Conclusion

Each of sparse vector representation is special purpose and should be used based on usecase. All of these types can be mixed by having abstraction on top of it that will dynamically identify correct matrix implementation based on underlying data.



 







No comments:

Post a Comment