Skip to content

Commit 9ba2580

Browse files
Add ZSTD support (#104)
Add support for compressing the data with ZSTD a new fast and efficient compression algorithm. ZSTD is pulled in as a submodule. This bumps the version to 0.4.0. Co-authored-by: Richard Shaw <richard@phas.ubc.ca>
1 parent 0aee87e commit 9ba2580

22 files changed

+578
-70
lines changed

.github/workflows/main.yml

+4-1
Original file line numberDiff line numberDiff line change
@@ -45,8 +45,11 @@ jobs:
4545
pip install -r requirements.txt
4646
pip install pytest
4747
48+
# Pull in ZSTD repo
49+
git submodule update --init
50+
4851
# Installing the plugin to arbitrary directory to check the install script.
49-
python setup.py install --h5plugin --h5plugin-dir ~/hdf5/lib
52+
python setup.py install --h5plugin --h5plugin-dir ~/hdf5/lib --zstd
5053
5154
- name: Run tests
5255
run: pytest .

.github/workflows/wheels.yml

+4-3
Original file line numberDiff line numberDiff line change
@@ -26,14 +26,15 @@ jobs:
2626
run: python -m cibuildwheel --output-dir wheelhouse-hdf5-${{ matrix.hdf5 }}
2727
env:
2828
CIBW_ARCHS_LINUX: "x86_64"
29-
CIBW_BEFORE_BUILD_LINUX: chmod +x .github/workflows/install_hdf5.sh; .github/workflows/install_hdf5.sh ${{ matrix.hdf5 }}
30-
CIBW_ENVIRONMENT: "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib"
29+
CIBW_BEFORE_BUILD_LINUX: chmod +x .github/workflows/install_hdf5.sh; .github/workflows/install_hdf5.sh ${{ matrix.hdf5 }};
30+
git submodule update --init
31+
CIBW_ENVIRONMENT: "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib ENABLE_ZSTD=1"
3132
CIBW_TEST_REQUIRES: pytest
3233
# Install different version of HDF5 for unit tests to ensure the
3334
# wheels are indepedent of HDF5 installation
3435
CIBW_BEFORE_TEST: chmod +x .github/workflows/install_hdf5.sh; .github/workflows/install_hdf5.sh 1.8.11;
3536
# Run units tests but disable test_h5plugin.py
36-
CIBW_TEST_COMMAND: CI_BUILD_WHEEL=1 pytest {package}/tests
37+
CIBW_TEST_COMMAND: pytest {package}/tests
3738

3839
# Package wheels and host on CI
3940
- uses: actions/upload-artifact@v2

.gitmodules

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[submodule "zstd"]
2+
path = zstd
3+
url = https://github.com/facebook/zstd

README.rst

+20-9
Original file line numberDiff line numberDiff line change
@@ -21,12 +21,12 @@ is performed within blocks of data roughly 8kB long [1]_.
2121

2222
This does not in itself compress data, only rearranges it for more efficient
2323
compression. To perform the actual compression you will need a compression
24-
library. Bitshuffle has been designed to be well matched Marc Lehmann's
25-
LZF_ as well as LZ4_. Note that because Bitshuffle modifies the data at the bit
24+
library. Bitshuffle has been designed to be well matched to Marc Lehmann's
25+
LZF_ as well as LZ4_ and ZSTD_. Note that because Bitshuffle modifies the data at the bit
2626
level, sophisticated entropy reducing compression libraries such as GZIP and
2727
BZIP are unlikely to achieve significantly better compression than simpler and
28-
faster duplicate-string-elimination algorithms such as LZF and LZ4. Bitshuffle
29-
thus includes routines (and HDF5 filter options) to apply LZ4 compression to
28+
faster duplicate-string-elimination algorithms such as LZF, LZ4 and ZSTD. Bitshuffle
29+
thus includes routines (and HDF5 filter options) to apply LZ4 and ZSTD compression to
3030
each block after shuffling [2]_.
3131

3232
The Bitshuffle algorithm relies on neighbouring elements of a dataset being
@@ -50,7 +50,7 @@ used outside of python and in command line utilities such as ``h5dump``.
5050
.. [1] Chosen to fit comfortably within L1 cache as well as be well matched
5151
window of the LZF compression library.
5252
53-
.. [2] Over applying bitshuffle to the full dataset then applying LZ4
53+
.. [2] Over applying bitshuffle to the full dataset then applying LZ4/ZSTD
5454
compression, this has the tremendous advantage that the block is
5555
already in the L1 cache.
5656
@@ -62,6 +62,8 @@ used outside of python and in command line utilities such as ``h5dump``.
6262

6363
.. _LZ4: https://code.google.com/p/lz4/
6464

65+
.. _ZSTD: https://github.com/facebook/zstd
66+
6567

6668
Applications
6769
------------
@@ -97,11 +99,14 @@ Installation for Python
9799

98100
Installation requires python 2.7+ or 3.3+, HDF5 1.8.4 or later, HDF5 for python
99101
(h5py), Numpy and Cython. Bitshuffle is linked against HDF5. To use the dynamically
100-
loaded HDF5 filter requires HDF5 1.8.11 or later.
102+
loaded HDF5 filter requires HDF5 1.8.11 or later. If ZSTD support is enabled the ZSTD
103+
repo needs to pulled into bitshuffle before installation with::
104+
105+
git submodule update --init
101106

102-
To install::
107+
To install bitshuffle::
103108

104-
python setup.py install [--h5plugin [--h5plugin-dir=spam]]
109+
python setup.py install [--h5plugin [--h5plugin-dir=spam] --zstd]
105110

106111
To get finer control of installation options, including whether to compile
107112
with OpenMP multi-threading, copy the ``setup.cfg.example`` to ``setup.cfg``
@@ -112,6 +117,8 @@ Bitshuffle and LZF filters outside of python), set the environment variable
112117
``HDF5_PLUGIN_PATH`` to the value of ``--h5plugin-dir`` or use HDF5's default
113118
search location of ``/usr/local/hdf5/lib/plugin``.
114119

120+
ZSTD support is enabled with ``--zstd``.
121+
115122
If you get an error about missing source files when building the extensions,
116123
try upgrading setuptools. There is a weird bug where setuptools prior to 0.7
117124
doesn't work properly with Cython in some cases.
@@ -133,9 +140,13 @@ the filter will be available only within python and only after importing
133140
The filter can be added to new datasets either through the `h5py` low level
134141
interface or through the convenience functions provided in
135142
`bitshuffle.h5`. See the docstrings and unit tests for examples. For `h5py`
136-
version 2.5.0 and later Bitshuffle can added to new datasets through the
143+
version 2.5.0 and later Bitshuffle can be added to new datasets through the
137144
high level interface, as in the example below.
138145

146+
The compression algorithm can be configured using the `filter_opts` in
147+
`bitshuffle.h5.create_dataset()`. LZ4 is chosen with:
148+
`(BLOCK_SIZE, h5.H5_COMPRESS_LZ4)` and ZSTD with:
149+
`(BLOCK_SIZE, h5.H5_COMPRESS_ZSTD, COMP_LVL)`. See `test_h5filter.py` for an example.
139150

140151
Example h5py
141152
------------

bitshuffle/__init__.py

+15-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# flake8: noqa
12
"""
23
Filter for improving compression of typed binary data.
34
@@ -11,6 +12,8 @@
1112
bitunshuffle
1213
compress_lz4
1314
decompress_lz4
15+
compress_zstd
16+
decompress_zstd
1417
1518
"""
1619

@@ -19,6 +22,7 @@
1922

2023
from bitshuffle.ext import (
2124
__version__,
25+
__zstd__,
2226
bitshuffle,
2327
bitunshuffle,
2428
using_NEON,
@@ -28,6 +32,16 @@
2832
decompress_lz4,
2933
)
3034

35+
# Import ZSTD API if enabled
36+
zstd_api = []
37+
if __zstd__:
38+
from bitshuffle.ext import (
39+
compress_zstd,
40+
decompress_zstd,
41+
)
42+
43+
zstd_api += ["compress_zstd", "decompress_zstd"]
44+
3145
__all__ = [
3246
"__version__",
3347
"bitshuffle",
@@ -37,4 +51,4 @@
3751
"using_AVX2",
3852
"compress_lz4",
3953
"decompress_lz4",
40-
]
54+
] + zstd_api

bitshuffle/ext.pyx

+119-3
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,23 @@ cdef extern from b"bitshuffle.h":
3333
int block_size) nogil
3434
int bshuf_decompress_lz4(void *A, void *B, int size, int elem_size,
3535
int block_size) nogil
36+
IF ZSTD_SUPPORT:
37+
int bshuf_compress_zstd_bound(int size, int elem_size, int block_size)
38+
int bshuf_compress_zstd(void *A, void *B, int size, int elem_size,
39+
int block_size, const int comp_lvl) nogil
40+
int bshuf_decompress_zstd(void *A, void *B, int size, int elem_size,
41+
int block_size) nogil
3642
int BSHUF_VERSION_MAJOR
3743
int BSHUF_VERSION_MINOR
3844
int BSHUF_VERSION_POINT
3945

46+
__version__ = "%d.%d.%d" % (BSHUF_VERSION_MAJOR, BSHUF_VERSION_MINOR,
47+
BSHUF_VERSION_POINT)
4048

41-
__version__ = str("%d.%d.%d").format(BSHUF_VERSION_MAJOR, BSHUF_VERSION_MINOR,
42-
BSHUF_VERSION_POINT)
43-
49+
IF ZSTD_SUPPORT:
50+
__zstd__ = True
51+
ELSE:
52+
__zstd__ = False
4453

4554
# Prototypes from bitshuffle.c
4655
cdef extern int bshuf_copy(void *A, void *B, int size, int elem_size)
@@ -451,3 +460,110 @@ def decompress_lz4(np.ndarray arr not None, shape, dtype, int block_size=0):
451460
return out
452461

453462

463+
IF ZSTD_SUPPORT:
464+
@cython.boundscheck(False)
465+
@cython.wraparound(False)
466+
def compress_zstd(np.ndarray arr not None, int block_size=0, int comp_lvl=1):
467+
"""Bitshuffle then compress an array using ZSTD.
468+
469+
Parameters
470+
----------
471+
arr : numpy array
472+
Data to be processed.
473+
block_size : positive integer
474+
Block size in number of elements. By default, block size is chosen
475+
automatically.
476+
comp_lvl : positive integer
477+
Compression level applied by ZSTD
478+
479+
Returns
480+
-------
481+
out : array with np.uint8 data type
482+
Buffer holding compressed data.
483+
484+
"""
485+
486+
cdef int ii, size, itemsize, count=0
487+
shape = (arr.shape[i] for i in range(arr.ndim))
488+
if not arr.flags['C_CONTIGUOUS']:
489+
msg = "Input array must be C-contiguous."
490+
raise ValueError(msg)
491+
size = arr.size
492+
dtype = arr.dtype
493+
itemsize = dtype.itemsize
494+
495+
max_out_size = bshuf_compress_zstd_bound(size, itemsize, block_size)
496+
497+
cdef np.ndarray out
498+
out = np.empty(max_out_size, dtype=np.uint8)
499+
500+
cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] arr_flat
501+
arr_flat = arr.view(np.uint8).ravel()
502+
cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] out_flat
503+
out_flat = out.view(np.uint8).ravel()
504+
cdef void* arr_ptr = <void*> &arr_flat[0]
505+
cdef void* out_ptr = <void*> &out_flat[0]
506+
with nogil:
507+
for ii in range(REPEATC):
508+
count = bshuf_compress_zstd(arr_ptr, out_ptr, size, itemsize, block_size, comp_lvl)
509+
if count < 0:
510+
msg = "Failed. Error code %d."
511+
excp = RuntimeError(msg % count, count)
512+
raise excp
513+
return out[:count]
514+
515+
@cython.boundscheck(False)
516+
@cython.wraparound(False)
517+
def decompress_zstd(np.ndarray arr not None, shape, dtype, int block_size=0):
518+
"""Decompress a buffer using ZSTD then bitunshuffle it yielding an array.
519+
520+
Parameters
521+
----------
522+
arr : numpy array
523+
Input data to be decompressed.
524+
shape : tuple of integers
525+
Shape of the output (decompressed array). Must match the shape of the
526+
original data array before compression.
527+
dtype : numpy dtype
528+
Datatype of the output array. Must match the data type of the original
529+
data array before compression.
530+
block_size : positive integer
531+
Block size in number of elements. Must match value used for
532+
compression.
533+
534+
Returns
535+
-------
536+
out : numpy array with shape *shape* and data type *dtype*
537+
Decompressed data.
538+
539+
"""
540+
541+
cdef int ii, size, itemsize, count=0
542+
if not arr.flags['C_CONTIGUOUS']:
543+
msg = "Input array must be C-contiguous."
544+
raise ValueError(msg)
545+
size = np.prod(shape)
546+
itemsize = dtype.itemsize
547+
548+
cdef np.ndarray out
549+
out = np.empty(tuple(shape), dtype=dtype)
550+
551+
cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] arr_flat
552+
arr_flat = arr.view(np.uint8).ravel()
553+
cdef np.ndarray[dtype=np.uint8_t, ndim=1, mode="c"] out_flat
554+
out_flat = out.view(np.uint8).ravel()
555+
cdef void* arr_ptr = <void*> &arr_flat[0]
556+
cdef void* out_ptr = <void*> &out_flat[0]
557+
with nogil:
558+
for ii in range(REPEATC):
559+
count = bshuf_decompress_zstd(arr_ptr, out_ptr, size, itemsize,
560+
block_size)
561+
if count < 0:
562+
msg = "Failed. Error code %d."
563+
excp = RuntimeError(msg % count, count)
564+
raise excp
565+
if count != arr.size:
566+
msg = "Decompressed different number of bytes than input buffer size."
567+
msg += "Input buffer %d, decompressed %d." % (arr.size, count)
568+
raise RuntimeError(msg, count)
569+
return out

bitshuffle/h5.pyx

+3
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ Constants
1414
1515
H5FILTER : The Bitshuffle HDF5 filter integer identifier.
1616
H5_COMPRESS_LZ4 : Filter option flag for LZ4 compression.
17+
H5_COMPRESS_ZSTD : Filter option flag for ZSTD compression.
1718
1819
Functions
1920
=========
@@ -54,13 +55,15 @@ cdef extern from b"bshuf_h5filter.h":
5455
int bshuf_register_h5filter()
5556
int BSHUF_H5FILTER
5657
int BSHUF_H5_COMPRESS_LZ4
58+
int BSHUF_H5_COMPRESS_ZSTD
5759

5860
cdef extern int init_filter(const char* libname)
5961

6062
cdef int LZF_FILTER = 32000
6163

6264
H5FILTER = BSHUF_H5FILTER
6365
H5_COMPRESS_LZ4 = BSHUF_H5_COMPRESS_LZ4
66+
H5_COMPRESS_ZSTD = BSHUF_H5_COMPRESS_ZSTD
6467

6568
# Init HDF5 dynamic loading with HDF5 library used by h5py
6669
if not sys.platform.startswith('win'):

0 commit comments

Comments
 (0)