python - Conditionally create an "Other" category in categorical column -


i have dataframe df 1 column, category created code below:

import pandas pd import random rand string import ascii_uppercase  rand.seed(1010)  df = pd.dataframe() values = list() in range(0,1000):        category = (''.join(rand.choice(ascii_uppercase) in range(1)))     values.append(category)  df['category'] = values 

the frequency counts each value are:

df['category'].value_counts() out[95]:  p    54 b    50 t    48 v    46    46 r    45 f    43 k    43 u    41 c    40 w    39 e    39 j    39 x    37 m    37 q    35 y    35 z    34 o    33 d    33 h    32 g    32 l    31 n    31 s    29 

i make new value in df['category'] column called "other" , assign values of df['category'] have value_count less 35.

can me out this?

let me know if need more me

edit @edchum proposed solution

import pandas pd import random rand string import ascii_uppercase  rand.seed(1010)  df = pd.dataframe() values = list() in range(0,1000):        category = (''.join(rand.choice(ascii_uppercase) in range(1)))     values.append(category)  df['category'] = values df['category'].value_counts()  df.loc[df['category'].isin((df['category'].value_counts([df['category'].value_‌​counts() < 35]).index), 'category'] = 'other'    file "<stdin>", line 1     df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_‌​counts() < 35]).index), 'category'] = 'other'                                                                                    ^ syntaxerror: invalid syntax 

note using python 2.7 on spyder ide (i tried proposed solution in ipython , python console windows)

you can use value_counts generate boolean mask mask values , set these 'other' using loc:

in [71]: df.loc[df['category'].isin((df['category'].value_counts()[df['category'].value_counts() < 35]).index), 'category'] = 'other' df  out[71]:     category 0      other 1      other 2          3          v 4          u 5          d 6          t 7          g 8          s 9          h 10     other 11     other 12     other 13     other 14         s 15         d 16         b 17         p 18         b 19     other 20     other 21         f 22         h 23         g 24         p 25     other 26         m 27         v 28         t 29         ..       ... 970        e 971        d 972    other 973        p 974        v 975        s 976        e 977    other 978        h 979        v 980        o 981    other 982        o 983        z 984    other 985        p 986        p 987    other 988        o 989    other 990        p 991        x 992        e 993        v 994        b 995        p 996        b 997        p 998        q 999        x  [1000 rows x 1 columns] 

breaking above down:

in [74]: df['category'].value_counts() < 35  out[74]: w    false b    false c    false v    false h    false p    false t    false r    false u    false k    false e    false y    false m    false f    false o    false    false d    false q    false n     true j     true s     true g     true z     true     true x     true l     true name: category, dtype: bool  in [76]:     df['category'].value_counts()[df['category'].value_counts() < 35]  out[76]: n    34 j    33 s    33 g    33 z    32    31 x    31 l    30 name: category, dtype: int64 

we can use isin against .index values , set rows 'other'


Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

How to get the ip address of VM and use it to configure SSH connection dynamically in Ansible -

javascript - Get parameter of GET request -