Ticket #498 (new enhancement)

Opened 9 years ago

Librivox contributions and dates/numbers

Reported by: kmaclean Owned by: kmaclean
Priority: major Milestone:
Component: Prompts Version: 0.1-alpha
Keywords: Cc:


see Librivox contributions and dates/numbers

In reviewing a possible audio file I came across a lot of dates in one section, 1800, 1839 and so on.

This raises the issue of whether in a prompt context it is better to deal with these numbers in the text2prompts stage, (ensuring that 1800 becomes "eighteen hundred" for example) or including '1800' as a separate word in the lexicon.

The downside of the latter is that potentially you end up with a lot of numbers in your lexicon, eventually more numbers than words. The pre-treatment seems to be more efficient.

Is there an industry standard or even Voxforge preference for this?

Note: See TracTickets for help on using tickets.