-
copy first 5 pages of every PDF to a file.
-
then remove ……
-
Then displace Ia from its place. include lower case at the beginnning of the line.
-
find the numebr of N.
-
if the horoscope article is present in frist 5 pages then next 2 pages are copied.
-
Problem: there are too less occurences of N. Hence words starting this character from U Dienshonhi Dictionary were borrowed and then words( ) were inserted to training file.
-
I inserted 127 N words.
-
insert TAB space in 52nd line throughout files for EOL
-
remove all the empty lines from the corpus file containing 20,000 lines. Command: sed ‘/^$/d’ kha01-copy.training_text > output.txt
-
insert 2 blank lines for every 52 lines. Command: #awk ‘1;!(NR%51){print “\t”;}’ output1.txt > kha.train.training_text #awk ‘1;!(NR%51){print “\t”;}’ output-eval.txt > kha.eval.training_text
training text = 16,318 lines, eval_text = 3,736 lines
I at the begining of lines caused hallucination effect. Hence words starting with I were placed randomly in a sentence. The capital letter on n- N occurs very less ~1%. The occurence of this capital letter was increased to 8% by inserting words starting with N randomly in the training text file.