python - Performance issue with Pandas merge function -
i merging multiple csv files using pandas' merge function takes more 3 hours process. note have more 1k files merge , columns can different though "security" column common in each file , values of unique. tips increase performance? doing wrong or inefficient. thank much!
def consolidate_data_files (file_names, thread_count): output_file_names = [] joined_frame = none file_count = 0 file in file_names: data_frame = pandas.read_csv(str(file), quoting=csv.quote_none, dtype=str) if file_count == 0: joined_frame = data_frame else: joined_frame = data_frame.merge(joined_frame, how='outer') file_count += 1 total_row_count = len(joined_frame.index) row_per_file = math.ceil(total_row_count/thread_count) merged_file_count = int(math.ceil(total_row_count/row_per_file); in range(merged_file_count): file = "merged_file_"+str(i)+".csv" output_file_names.append(file) row_start = int(i * row_per_file) row_end = int(row_start + row_per_file) joined_frame[row_start:row_end].to_csv(path_or_buf=file, index=false, quoting=csv.quote_none) del joined_frame return output_file_names
Comments
Post a Comment