preparing-corpus

  • copy first 5 pages of every PDF to a file.

  • then remove ……

  • Then displace Ia from its place. include lower case at the beginnning of the line.

  • find the numebr of N.

  • if the horoscope article is present in frist 5 pages then next 2 pages are copied.

  • Problem: there are too less occurences of N. Hence words starting this character from U Dienshonhi Dictionary were borrowed and then words( ) were inserted to training file.

  • I inserted 127 N words.

  • insert TAB space in 52nd line throughout files for EOL

  • remove all the empty lines from the corpus file containing 20,000 lines. Command: sed ‘/^$/d’ kha01-copy.training_text > output.txt

  • insert 2 blank lines for every 52 lines. Command: #awk ‘1;!(NR%51){print “\t”;}’ output1.txt > kha.train.training_text #awk ‘1;!(NR%51){print “\t”;}’ output-eval.txt > kha.eval.training_text

training text = 16,318 lines, eval_text = 3,736 lines

I at the begining of lines caused hallucination effect. Hence words starting with I were placed randomly in a sentence. The capital letter on n- N occurs very less ~1%. The occurence of this capital letter was increased to 8% by inserting words starting with N randomly in the training text file.

#awk