Awk: What wrong with CJK characters? #Korean -


given .txt files space-separated words such as:

but esope holly bastard 생 지 옥 이 군 지 옥 이 지 옥 지 我 是 你 的 爸 爸 ! 爸 爸 ! ! ! 你 不 會 的 ! 

and the awk function :

cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2" "$1}' 

i following output in console invalid korean words (valid english , chinese space-separated words)

생 16 bastard 1 2 esope 1 holly 1 2 1 2 不 1 你 2 我 1 是 1 會 1 爸 4 的 2 

how works korean words ? note: have 300.000 lines , near 2 millions words.


edit: used answer:

$ awk '{a[$1]++}end{for(k in a)print a[k],k}' rs=" |\n" myfile.txt | sort > myfileout.txt 

a single awk script can handle , far more efficient current pipeline:

$ awk '{a[$1]++}end{for(k in a)print k,a[k]}' rs=" |\n" file  옥 3 bastard 1 ! 5 爸 4 군 1 지 4 2 會 1 你 2 1 是 1 不 1 이 2 esope 1 的 2 holly 1 2 생 1 我 1 2 

if want store results file can use redirection like:

$ awk '{a[$1]++}end{for(k in a)print k,a[k]}' rs=" |\n" file > outfile 

Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

javascript - Get parameter of GET request -

javascript - Twitter Bootstrap - how to add some more margin between tooltip popup and element -