python - Efficiently processing a very large unicode string into csv -
usually i'm able find answers dilemmas pretty on site perhaps problem requires more specific touch;
i have ~50 million long unicode string download tektronix oscilloscope. getting assigned pain in a** memory (sys.getsizeof() reports ~100 mb)
the problem lies in need turn csv can grab 10,000 of 10 million comma sep values (this fixed)... 1) have tried split(",") method, using this, ram usage on python kernel spikes 300 mb....but process efficient (except when loop ~100 times in 1 routine...somewhere between iterations 40-50, kernel spits memory error.) 2) wrote own script after downloading absurdly long string, scans the number of commas until see 10,000 , stops, turning values between commas floats , populating np array. pretty efficient memory usage perspective (from before importing file after running script, memory usage changes 150mb.) slower, , results in kernel crash shortly after completion of 100x loops.
below code used process file, , if pm me, can send copy of string experimenting (however i'm sure may easier generate one)
code 1 (using split() method)
ppstrace = ppsinst.query('curv?') ppstrace = ppstrace.split(',') ppsvals = [] iii in range(len(ppstrace)): #does algebra values ppstrace[iii] = ((float(ppstrace[iii]))-yoff)*ymult+yzero maxes=np.empty(shape=(0,0)) iters=int(samples/1000) in range(1000): #looks max value in 10,000 sample increments, adds "maxes" print maxes = np.append(maxes,max(ppstrace[i*iters:(i+1)*iters])) pps = 100*np.std(maxes)/np.mean(maxes) print pps," % pps noise" code 2 (self generated script);
ppstrace = ppsinst.query('curv?') walkerr=1 walkerl=0 length=len(ppstrace) maxes=np.empty(shape=(0,0)) iters=int(samples/1000) #samples 10 million, iters 10000 in range(1000): sample=[] #initialize 10k sample list commas=0 #commas 0 while commas<iters: #if number of commas found less 10,000, keep adding values sample while ppstrace[walkerr]!=unicode(","):#indexes commas value extraction walkerr+=1 if walkerr==length: break sample.append((float(str(ppstrace[walkerl:walkerr]))-yoff)*ymult+yzero)#add value between commas sample list walkerl=walkerr+1 walkerr+=1 commas+=1 maxes=np.append(maxes,max(sample)) pps = 100*np.std(maxes)/np.mean(maxes) print pps,"% pps noise" also tried pandas dataframe stringio csv conversion. thing gets memory error trying read frame.
i thinking solution load sql table , pull csv in 10,000 sample chunks (which intended purpose of script). love not this!
thanks guys!
have tried class cstringio? it's file io, uses string buffer instead of specified file. frankly, expect you're suffering chronic speed problem. self-generated script should right approach. might speed-up if read block @ time, , parse while next block reading.
for parallel processing, use multiprocessing package. see official documentation or this tutorial details , examples.
briefly, create function embodies process want run in parallel. create process function target parameter. start process. when want merge thread main program, use join.
Comments
Post a Comment