python - Efficiently processing a very large unicode string into csv -

August 15, 2015

usually i'm able find answers dilemmas pretty on site perhaps problem requires more specific touch;

i have ~50 million long unicode string download tektronix oscilloscope. getting assigned pain in a** memory (sys.getsizeof() reports ~100 mb)

the problem lies in need turn csv can grab 10,000 of 10 million comma sep values (this fixed)... 1) have tried split(",") method, using this, ram usage on python kernel spikes 300 mb....but process efficient (except when loop ~100 times in 1 routine...somewhere between iterations 40-50, kernel spits memory error.) 2) wrote own script after downloading absurdly long string, scans the number of commas until see 10,000 , stops, turning values between commas floats , populating np array. pretty efficient memory usage perspective (from before importing file after running script, memory usage changes 150mb.) slower, , results in kernel crash shortly after completion of 100x loops.

below code used process file, , if pm me, can send copy of string experimenting (however i'm sure may easier generate one)

code 1 (using split() method)

ppstrace = ppsinst.query('curv?') ppstrace = ppstrace.split(',') ppsvals = [] iii in range(len(ppstrace)): #does algebra values     ppstrace[iii] = ((float(ppstrace[iii]))-yoff)*ymult+yzero  maxes=np.empty(shape=(0,0)) iters=int(samples/1000) in range(1000): #looks max value in 10,000 sample increments, adds "maxes"     print     maxes = np.append(maxes,max(ppstrace[i*iters:(i+1)*iters])) pps = 100*np.std(maxes)/np.mean(maxes) print pps," % pps noise"

code 2 (self generated script);

ppstrace = ppsinst.query('curv?') walkerr=1 walkerl=0 length=len(ppstrace) maxes=np.empty(shape=(0,0)) iters=int(samples/1000) #samples 10 million, iters 10000  in range(1000):     sample=[] #initialize 10k sample list     commas=0 #commas 0     while commas<iters: #if number of commas found less 10,000, keep adding values sample         while ppstrace[walkerr]!=unicode(","):#indexes commas value extraction             walkerr+=1             if walkerr==length:                 break         sample.append((float(str(ppstrace[walkerl:walkerr]))-yoff)*ymult+yzero)#add value between commas sample list         walkerl=walkerr+1         walkerr+=1         commas+=1     maxes=np.append(maxes,max(sample)) pps = 100*np.std(maxes)/np.mean(maxes) print pps,"% pps noise"

also tried pandas dataframe stringio csv conversion. thing gets memory error trying read frame.

i thinking solution load sql table , pull csv in 10,000 sample chunks (which intended purpose of script). love not this!

thanks guys!

have tried class cstringio? it's file io, uses string buffer instead of specified file. frankly, expect you're suffering chronic speed problem. self-generated script should right approach. might speed-up if read block @ time, , parse while next block reading.

for parallel processing, use multiprocessing package. see official documentation or this tutorial details , examples.

briefly, create function embodies process want run in parallel. create process function target parameter. start process. when want merge thread main program, use join.

Search This Blog

Live one

python - Efficiently processing a very large unicode string into csv -

Comments

Post a Comment

Popular posts from this blog

authentication - Mongodb revoke acccess to connect test database -

python - GitPython: check if git is available -

c - getting error: cannot take the address of an rvalue of type 'int' -