Awk: What wrong with CJK characters? #Korean -
given .txt files space-separated words such as:
but esope holly bastard 생 지 옥 이 군 지 옥 이 지 옥 지 我 是 你 的 爸 爸 ! 爸 爸 ! ! ! 你 不 會 的 !
and the awk function :
cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2" "$1}'
i following output in console invalid korean words (valid english , chinese space-separated words)
생 16 bastard 1 2 esope 1 holly 1 2 1 2 不 1 你 2 我 1 是 1 會 1 爸 4 的 2
how works korean words ? note: have 300.000 lines , near 2 millions words.
edit: used answer:
$ awk '{a[$1]++}end{for(k in a)print a[k],k}' rs=" |\n" myfile.txt | sort > myfileout.txt
a single awk
script can handle , far more efficient current pipeline:
$ awk '{a[$1]++}end{for(k in a)print k,a[k]}' rs=" |\n" file 옥 3 bastard 1 ! 5 爸 4 군 1 지 4 2 會 1 你 2 1 是 1 不 1 이 2 esope 1 的 2 holly 1 2 생 1 我 1 2
if want store results file can use redirection like:
$ awk '{a[$1]++}end{for(k in a)print k,a[k]}' rs=" |\n" file > outfile
Comments
Post a Comment