python - Efficiently save to disk (heterogeneous) graph of lists, tuples, and NumPy arrays -


i regularly dealing large amounts of data (order of several gb), stored in memory in numpy arrays. often, dealing nested lists/tuples of such numpy arrays. how should store these disk? want preserve list/tuple structure of data, data has compressed conserve disk space, , saving/loading needs fast.

(the particular use case i'm facing right 4000-element long list of 2-tuples x x[0].shape = (201,) , x[1].shape = (201,1000).)

i have tried several options, have downsides:

  • pickle storage gzip archive. works well, , results in acceptable disk space usage, extremely slow , consumes lot of memory while saving.

  • numpy.savez_compressed. faster pickle, unfortunately allows either sequence of numpy arrays (not nested tuples/lists have) or dictionary-style way of specifying arguments.

  • storing hdf5 through h5py. seems cumbersome relatively simple needs. more importantly, looked lot this, , there not seem straightforward way store heterogeneous (nested) lists.

  • hickle seems want, unfortunately it's incompatible python 3 @ moment (which i'm using).

i thinking of writing wrapper around numpy.savez_compressed, determine nested structure of data, store structure in variable nest_structure, flatten full graph, , store both nest_structure , flattened data using numpy.savez_compressed. then, corresponding wrapper around numpy.load understand nest_structure variable, , re-create graph , return it. however, hoping there out there.

you may shelve package. wraps heterogeneous pickled objects in convenient file. shelve oriented more toward "persistent storage" classic save-to-file model.

the main benefit of using shelve can conveniently save kinds of structured data. main disadvantage of using shelve is python-specific. unlike hdf-5 or saved matlab files or simple csv files, isn't easy use other tools data.

example of saving (out of habit, created objects , copy them df, don't need this. save directly items in df):

import shelve import numpy np = np.arange(0, 1000, 12) b = "this string" class c(object):    alpha = 1.0    beta = [3, 4] c = c() class c(object):    alpha = 1.0    beta = [3, 4] c = c() df = shelve.open('test.shelve', 'c') df['a'] = df['b'] = b df['c'] = c df.sync() exit() 

following above example, recovering data:

import shelve import numpy np class c():    alpha = 1.0    beta = [3, 4] df = shelve.open('test.shelve') print(df['a']) print(df['b']) print(df['c'].alpha) 

Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

How to get the ip address of VM and use it to configure SSH connection dynamically in Ansible -

javascript - Get parameter of GET request -