Ticket #366 (new task)

Opened 15 years ago

Review Processing of Prompt Files for non-English languages

Reported by: kmaclean Owned by: kmaclean
Priority: major Milestone: Acoustic Model 0.1.2
Component: Scripts Version: 0.1-alpha
Keywords: Cc:


I strip many characters from the original prompt file. From the Prompts.pm script (note: anything with a leading "#" is a comment):

           # cleanup prompts 
            $linescalar = join(" ", @line);
            $linescalar =~ tr/a-z/A-Z/; # change to uppercase
            $linescalar =~ s/,//g; # remove commas
             $linescalar =~ s/\.//g; # remove periods 
             # dealing with quotes
            #  $linescalar =~ s/\'//g; # remove single quotes; but need words like "don't" - need to research this more ...
            # $linescalar =~ s/\'\b(.*)\b\'/$1/g; # remove single quotes from quoted text; single quote must be at start of a word, and at end of a word - does not work if there are two words with single quotesin them in same sentence ...
            $linescalar =~ s/\'EM//g; # remove leading single quotes for contraction of them
            $linescalar =~ s/\"//g; # remove double quotes
            $linescalar =~ s/://g; # remove colon
            # $linescalar =~ s/-//g; # compound word dash
            $linescalar =~ s/--//g; # double dash
            $linescalar =~ s/ - / /g; # space dash space punctuation   
            $linescalar =~ s/ -/ /g; # space dash punctuation           
            $linescalar =~ s/;//g; # semi-colon
            $linescalar =~ s/!//g; # exclamation mark
            $linescalar =~ s/\?//g; # question mark
            # Other cleanup !!!!!! need to change the prompts files directly rather than doing this!!! or add to dictionnary!!!
            $linescalar =~ s/&/AND/g;
            $linescalar =~ s/2000/TWO THOUSAND/g;

So I don't remove single quotes. Not sure what to do, maybe a new set a rules for each language.

Note: See TracTickets for help on using tickets.