Speech corpus

For a broader coverage related to this topic, see Corpus linguistics.

A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In Speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition engine). In Linguistics, spoken corpora are used to do research into Phonetic, Conversation analysis, Dialectology and other fields.

A corpus is one such database. Corpora is the plural of corpus (i.e. it is many such databases).

There are two types of Speech Corpora:

Read Speech - which includes:
- Book excerpts
- Broadcast news
- Lists of words
- Sequences of numbers
Spontaneous Speech - which includes:
- Dialogs - between two or more people (includes meetings);
- Narratives - a person telling a story (one such corpus is the Buckeye Corpus);
- Map-tasks - one person explains a route on a map to another;
- Appointment-tasks - two people try to find a common meeting time based on individual schedules.

A special kind of speech corpora are non-native speech databases that contain speech with foreign accent.

References

Edwards, Jane / Lampert, Martin (eds.) (1992): Talking Data – Transcription and Coding in Discourse Research. Hillsdale: Erlbaum.
Leech, Geoffrey / Myers, Greg / Thomas, Jenny (eds.) (1995): Spoken English on Computer: Transcription, Markup and Application. Harlow: Longman.

Natural language processing

General terms	Text corpus Speech corpus Stopwords Bag-of-words AI-complete n-gram (Bigram, Trigram)

Text analysis	Text segmentation Part-of-speech tagging Text chunking Compound term processing Collocation extraction Stemming Lemmatisation Named-entity recognition Coreference resolution Sentiment analysis Concept mining Parsing Word sense disambiguation Terminology extraction Truecasing

Automatic summarization	Multi-document summarization Sentence extraction Text simplification

Machine translation	Computer-assisted Example-based Rule-based

Automatic identification and data capture	Speech recognition Speech synthesis Optical character recognition Natural language generation

Topic model	Pachinko allocation Latent Dirichlet allocation Latent semantic analysis

Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing

Natural language user interface	Automated online assistant Chatterbot Interactive fiction Question answering

This article is issued from Wikipedia - version of the 11/8/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.