python - Conditionally create an "Other" category in categorical column -
i have dataframe
df
1 column, category
created code below:
import pandas pd import random rand string import ascii_uppercase rand.seed(1010) df = pd.dataframe() values = list() in range(0,1000): category = (''.join(rand.choice(ascii_uppercase) in range(1))) values.append(category) df['category'] = values
the frequency counts each value are:
df['category'].value_counts() out[95]: p 54 b 50 t 48 v 46 46 r 45 f 43 k 43 u 41 c 40 w 39 e 39 j 39 x 37 m 37 q 35 y 35 z 34 o 33 d 33 h 32 g 32 l 31 n 31 s 29
i make new value in df['category']
column called "other" , assign values of df['category']
have value_count
less 35
.
can me out this?
let me know if need more me
edit @edchum proposed solution
import pandas pd import random rand string import ascii_uppercase rand.seed(1010) df = pd.dataframe() values = list() in range(0,1000): category = (''.join(rand.choice(ascii_uppercase) in range(1))) values.append(category) df['category'] = values df['category'].value_counts() df.loc[df['category'].isin((df['category'].value_counts([df['category'].value_counts() < 35]).index), 'category'] = 'other' file "<stdin>", line 1 df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_counts() < 35]).index), 'category'] = 'other' ^ syntaxerror: invalid syntax
note using python 2.7 on spyder ide (i tried proposed solution in ipython , python console windows)
you can use value_counts
generate boolean mask mask values , set these 'other' using loc
:
in [71]: df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_counts() < 35]).index), 'category'] = 'other' df out[71]: category 0 other 1 other 2 3 v 4 u 5 d 6 t 7 g 8 s 9 h 10 other 11 other 12 other 13 other 14 s 15 d 16 b 17 p 18 b 19 other 20 other 21 f 22 h 23 g 24 p 25 other 26 m 27 v 28 t 29 .. ... 970 e 971 d 972 other 973 p 974 v 975 s 976 e 977 other 978 h 979 v 980 o 981 other 982 o 983 z 984 other 985 p 986 p 987 other 988 o 989 other 990 p 991 x 992 e 993 v 994 b 995 p 996 b 997 p 998 q 999 x [1000 rows x 1 columns]
breaking above down:
in [74]: df['category'].value_counts() < 35 out[74]: w false b false c false v false h false p false t false r false u false k false e false y false m false f false o false false d false q false n true j true s true g true z true true x true l true name: category, dtype: bool in [76]: df['category'].value_counts()[df['category'].value_counts() < 35] out[76]: n 34 j 33 s 33 g 33 z 32 31 x 31 l 30 name: category, dtype: int64
we can use isin
against .index
values , set rows 'other'
Comments
Post a Comment