If correctly applied, data chunking may reduce the number of seeks through the SDS data array to find the data to be read or written, thereby improving I/O performance. However, it should be remembered that data chunking, if incorrectly applied, can significantly reduce the performance of reading and/or writing to an SDS. Knowledge of how chunked SDSs are created and accessed and application-specific knowledge of how data is to be read from the chunked SDSs are necessary in avoiding situations where data chunking works against the goal of I/O performance optimization.
The following figure illustrates the difference between a non-chunked SDS and a chunked SDS.
Specifically, the issues that affect the process of reading from chunked SDSs are
The issues that affect the process of writing to chunked SDSs are
To illustrate this, the instance of subsetting in non-chunked SDSs will first be described. Consider the example of a non-chunked, two-dimensional, 2,000 x 1,600 SDS array of integer data. The following figure shows how this array is filled with data in a row-wise fashion. (Each square in the array shown represents 100 x 100 integers.)
The efficiency of subsetting from chunked SDSs is partly determined by the size of the chunk: the smaller the chunk size, the more seeks will be necessary. Chunking can substantially improve I/O performance when data is read along the slowest-varying dimension. It can substantially degrade performance when data is read along the fastest-varying dimension.
Consider the example of a tiled, two-dimensional SDS containing one million bytes of image data. Each tile of image data has been compressed as illustrated in the following figure.
NOTE: It is recommended that the SDwritechunk routine be used to write to a compressed and chunked SDS. SDwritechunk can perform this operation more efficiently than the combination of SDsetcompress and SDwritedata. This is because the chunk information provided by the user to the SDwritechunk routine must be retrieved from the file by SDwritedata, and therefore involves more computational overhead.
If the chunk size is substantially larger than this, increased I/O overhead will be involved in reading the chunk and increased performance overhead will be involved in the decompression of the data if it is compressed. If the chunk size is substantially smaller than this, increased performance and memory/disk storage overhead will be involved in the HDF library's operations of accessing and keeping track of more chunks, as well as the danger of exceeding the maximum number of chunks per file. (64K)
It is recommended that the chunk size be at least 8K bytes.
There is a potential performance problem when subsets are read from chunked datasets and insufficient chunk cache space has been allocated. The cause of this problem is the fact that two separate levels of the library are working to read the subset into memory and these two levels have a different perspective on how the data in the dataset is organized.
Specifically, higher-level routines like SDreaddata access the data in a strictly row-wise fashion, not according to the chunked layout. However, the lower-level code that directly performs the read operation accesses the data according to the chunked layout.
As an illustration of this, consider the 4 x 12 dataset depicted in the following figure.
3 4 5 6 7 8 9 10Then the second row:
23 24 25 26 27 28 29 30This involves two reads, two disk accesses and sixteen numbers.
Now suppose that this dataset is tiled with 2 x 2 element tiles. On the disk, the data in this dataset is stored as twelve separate tiles, which for the purposes of this example will be labelled A through L.
A request is made to read the aforementioned subset of numbers into memory. As before, SDreaddata will determine the order the numbers will be read in. The routine has no information about the tiled layout. The higher-level code will again request the values in the first rows of tiles B through E from the lower level code on the first read operation.
In order to access those numbers the lower level code must read in four tiles: B, C, D, E. It reads in tiles B and C, retrieving the values 3, 4, 5, and 6. However, as the cache space is now completely filled, it must overwrite tile B in the cache to access the values 7 and 8, which are in tile D. It then has to overwrite tile C to access the values 9 and 10, which are in tile E. Note that, in each case, half of the values from the tiles that are read in are unused, even though those values will be needed later.
Next, the higher-level code requests the second row of the subset. The lower-level code must reread tile B to access the values 23 and 24. But tile B is no longer in the chunk cache. In order to access tile B, the lower-level code must overwrite tile D, and so on. By the time the subset read operation is complete, it has had to read in each of the tiles twice. Also, it has had to perform 8 disk accesses and has read 32 values.
Now consider a more practical example with the following parameters:
Suppose the dataset is tiled as follows:
When the first row has been read, the cache contains tiles D and E.
The second row is then read. The higher-level code first requests tile A, however the cache is full, so it must overwrite tile D to read tile A. Then the following set of cache overwrites occur.
Essentially, five times more disk accesses are being performed and 900 times more data is being read than with the untiled 3,000 x 8,400 dataset. The severity of the performance degradation increases in a non-linear fashion as the size of the dataset increases.
From this example it should be apparent that, to prevent this kind of chunk cache "thrashing" from occurring, the size of the chunk cache should be made equal to, or greater than, the number of chunks along the fastest-varying dimension of the dataset. In this case, the chunk cache size should be set to 4.
When a chunked SDS is opened for reading or writing, the default cache size is set to the number of chunks along the fastest-varying dimension of the SDS. This will prevent cache thrashing from occurring in situations where the user does not set the size of the the chunk cache. Caution should be exercised by the user when altering this default chunk cache size.