python - Performance issue with Pandas merge function -

September 15, 2011

i merging multiple csv files using pandas' merge function takes more 3 hours process. note have more 1k files merge , columns can different though "security" column common in each file , values of unique. tips increase performance? doing wrong or inefficient. thank much!

def consolidate_data_files (file_names, thread_count): output_file_names = []  joined_frame = none file_count = 0 file in file_names:     data_frame = pandas.read_csv(str(file), quoting=csv.quote_none, dtype=str)     if file_count == 0:         joined_frame = data_frame     else:         joined_frame = data_frame.merge(joined_frame, how='outer')     file_count += 1 total_row_count = len(joined_frame.index)  row_per_file = math.ceil(total_row_count/thread_count) merged_file_count = int(math.ceil(total_row_count/row_per_file);  in range(merged_file_count):     file = "merged_file_"+str(i)+".csv"     output_file_names.append(file)     row_start = int(i * row_per_file)     row_end = int(row_start + row_per_file)     joined_frame[row_start:row_end].to_csv(path_or_buf=file, index=false, quoting=csv.quote_none) del joined_frame return output_file_names

Search This Blog

Live one

python - Performance issue with Pandas merge function -

Comments

Post a Comment

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

php - XML feed for Wordpress Social Board plugin modifications -

javascript - Twitter Bootstrap - how to add some more margin between tooltip popup and element -