python - Efficiently save to disk (heterogeneous) graph of lists, tuples, and NumPy arrays -
i regularly dealing large amounts of data (order of several gb), stored in memory in numpy arrays. often, dealing nested lists/tuples of such numpy arrays. how should store these disk? want preserve list/tuple structure of data, data has compressed conserve disk space, , saving/loading needs fast.
(the particular use case i'm facing right 4000-element long list of 2-tuples x
x[0].shape = (201,)
, x[1].shape = (201,1000)
.)
i have tried several options, have downsides:
pickle
storagegzip
archive. works well, , results in acceptable disk space usage, extremely slow , consumes lot of memory while saving.numpy.savez_compressed
. fasterpickle
, unfortunately allows either sequence of numpy arrays (not nested tuples/lists have) or dictionary-style way of specifying arguments.storing hdf5 through
h5py
. seems cumbersome relatively simple needs. more importantly, looked lot this, , there not seem straightforward way store heterogeneous (nested) lists.hickle
seems want, unfortunately it's incompatible python 3 @ moment (which i'm using).
i thinking of writing wrapper around numpy.savez_compressed
, determine nested structure of data, store structure in variable nest_structure
, flatten full graph, , store both nest_structure
, flattened data using numpy.savez_compressed
. then, corresponding wrapper around numpy.load
understand nest_structure
variable, , re-create graph , return it. however, hoping there out there.
you may shelve
package. wraps heterogeneous pickled objects in convenient file. shelve
oriented more toward "persistent storage" classic save-to-file model.
the main benefit of using shelve
can conveniently save kinds of structured data. main disadvantage of using shelve
is python-specific. unlike hdf-5 or saved matlab files or simple csv files, isn't easy use other tools data.
example of saving (out of habit, created objects , copy them df
, don't need this. save directly items in df
):
import shelve import numpy np = np.arange(0, 1000, 12) b = "this string" class c(object): alpha = 1.0 beta = [3, 4] c = c() class c(object): alpha = 1.0 beta = [3, 4] c = c() df = shelve.open('test.shelve', 'c') df['a'] = df['b'] = b df['c'] = c df.sync() exit()
following above example, recovering data:
import shelve import numpy np class c(): alpha = 1.0 beta = [3, 4] df = shelve.open('test.shelve') print(df['a']) print(df['b']) print(df['c'].alpha)
Comments
Post a Comment