Ticket #132 (closed task: wontfix)
Reconciling Approaches to Collecting Audio for a Speech Corpus
Reported by: | kmaclean | Owned by: | kmaclean |
---|---|---|---|
Priority: | major | Milestone: | WebSite 0.2.1 |
Component: | Audio | Version: | 0.1-alpha |
Keywords: | Cc: |
Description
There seems to be two broad approaches to collecting audio for a speech corpus:
1. collect speech with as little external noise as possible
The presumption here is that any speech to be recognized will noise filtered (i.e. remove echo, hardware noises, external sounds, ...) before it is sent to the speech recognition engine. Julius includes parameters to aid in this noise removal (using spectral substraction: 'ssload' and 'ssalpha') ...
2. collect speech in its natural environment, warts and all
The approach here is to ensure that we get recordings covering as many different hardware configurations (different microphones types with computers with audio cards and on-board audio, noisy and quiet cooling fans/hard drives), and different recording environments (rooms with and without echo). The presumption here is that there would be no noise pre-filtering on the speech to be recognized, because any such noise removal algorithm introduces noise of its own.
Both appraoches require that we get good monophone and triphone coverage, and as many different people and dialects (and accents) as possible.
need to follow-up on David Gelbert's recommendation to ask the comp.speech.research newsgroup for advice...