SpeechDat CAR Spanish

This database comprises recordings from 306 speakers recorded in 600 different sessions. Speech signals were recorded in a car and simultaneously transmitted by GSM and recorded in a fixed platform connected to an ISDN line.

The SpeechDat Car Spanish Database was recorded within the scope of the SpeechDat Car project (LE4-8334) which was sponsored by the European Commission and the Spanish Government.

Collection was performed at the Department of Signal Theory and Communications of the Technical University of Catalonia (UPC) (Spain) with the collaboration of SEAT and Volkswagen. The owner of the database is UPC.

Definition of the database content

The following table shows the contents and corpus codes of the SpeechDat car Spanish Database. All items are read, unless marked as spontaneous.

taula

Speakers

Spain has a population of 38 Million people. The official language is Spanish (Castiliano) and some regions have also other official languages as Catalan, Galiziano and Basque. Due to the limited number of speakers to be recorded, the number of regions is small. The number of dialectal regions has been defined taking into account phonetic differences among regions and four groups are defined:

Region	Description
NORTH WEST	Galicia, Asturias
CENTER	Aragon, Cantabria, Castilla_La_Mancha, Castilla_León, La_Rioja, Madrid, Extremadura (North), País Vasco, Navarra
SOUTH	Andalucía, Canarias, Extremadura (South), Murcia
EAST	Cataluña, Valencia, Baleares

The distribution of sessions recorded as function of the accent region of the recorded speakers is shown in the next table.

Number	Name of accent/region	Number of speakers	Number of sessions	Number of sessions (%)
1	NORTHWEST	53	106	17.6
2	CENTER	78	154	26
3	SOUTH	54	105	17.1
4	EAST	121	235	39.3
TOTAL		306	600	100

The total number of different speakers is 306. 149 are female and 157 are male speakers. Next table shows the number of sessions spoken by females and males speakers and their age groups

Age groups	Male speakers		Female speakers		Percentage of total
Age groups	Number	Sessions	Number	Sessions	Speakers	Sessions
18-30	84	165	76	150	52	52.1
31-45	41	80	39	75	26.1	26.1
46-60	30	59	35	69	21.6	21.5
over 60	1	2	0	0	0.3	0.3
TOTAL	156	306	150	294	100	100

Recording Platforms

Two types of recordings compose the database. First, wideband recordings (60-7000 Hz) were performed for systems which are installed and operate in the car itself; second, narrow band recordings (300-3400 Hz) were performed for systems that operate centrally outside the car and obtain their spoken input from the driver over the cellular telephone network. Two recording platforms were used

A ‘mobile’ recording platform (PltM) installed inside the car, recording multi-channel speech utterances in a high bandwidth mode (16 kHz sample frequency).
A ‘fixed’ recording platform (PltF) located at the far-end fixed side of the GSM communications simultaneously recording the speech utterances coming from the car (8 kHz sample frequency, A law encoding).

Multi-channel recordings were performed simultaneously in the car and through the GSM network.

Recording conditions

There are defined 7 environment conditions:

car stopped by motor running, CEQ: no restrictions.
car in town traffic, CEQ: everything set to off or close
car in town traffic, CEQ: with noisy conditions
car moving at a low speed with rough road conditions, CEQ: everything set to off or close
car moving at a low speed with rough road conditions and CEQ: with noisy conditions
car moving at a high speed with good road conditions CEQ: no restrictions.
car moving at a high speed with good road conditions and with audio equipment on no further restrictions.

In addition, some information was collected during the recordings:

Weather conditions : rain, sun chine, wind …
Accessories used during recordings: windscreen wipers, ventilation, fan, radio …
Level of fan: on/off

Transcription

The transcription included in this database is an orthographic, lexical transcription with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. The extra marks contained in the transcription aid in interpreting the text form of the utterance. Transcriptions were made in two passes: one pass in which words are transcribed, and a second pass in which the additional details are added. Transcriptions are CASE INSENSITIVE.

Non-Speech Acoustic Events have been arranged into 5 categories and transcribed. Events only are transcribed if they are clearly distinguishable. Very low-level, non-intrusive events are ignored. The event will be transcribed at the place of occurrence, using the defined symbols in square brackets. For noise events that occur over a span of one or more words, the transcription indicates the beginning of the noise, just before the first word it affects.

The first two categories of acoustic events originate from the speaker, and the other three categories originate from another source. The 5 categories are:

[fil]: Filled pause. These sounds can well be modeled in a filled pause model in speech recognisers. Examples of filled pauses: uh, um, er, ah, mm.

[spk]: Speaker noise. All kinds of sounds and noises made by the calling speaker that are not part of the prompted text, e.g. lip smack, cough, grunt, throat clear, tongue click, loud breath, laugh, loud sigh.

[sta]: Stationary noise. This category contains background noise that is not intermittent and has a more or less stable amplitude spectrum. Examples voice babble (cocktail-party noise), sirens, wind, rain, cobble stones.

[int]: Intermittent noise. This category contains noises of an intermittent nature. These noises typically occur only once (like a door slam), or have pauses between them (like phone ringing), or change their color over time (like music). Examples: music, background speech, baby crying, phone ringing, door slam, door bell, paper rustle, cross talk, ticks by the direction indicator.

[dit]: DTMF and prompt tone. In fact this is a special case of [int]. But since this sound can be expected to be present in nearly each speech file, a special symbol was defined.

Only signals from microphone 0 have been transcribed. All the signals contain the prompt beep.

The Lexicon

The database includes a lexicon. The lexicon file is an alphabetically ordered list of distinct lexical items (essentially words in our case) which occur in the corpus with the corresponding pronunciation information. Each distinct word has a separate entry. As the lexicon is derived from the corpus it uses the same alphabetic encoding for special and accented characters as used in the transcriptions (ISO-8859). We have included a frequency count for each entry in the lexicon e.g. to help indicate rare words whose transcriptions are perhaps less important or reliable.

The pronunciation lexicon was produced after the transcription phase; it contain, alphabetically sorted, all words found in the transcription (one occurrence for each word), their number of occurrences and the list of their phonemic representations. The words appear in the lexicon exactly as they appear in the transcription. All the component words have been identified and alphabetically sorted; all fragments, mispronunciations and non speech events have been removed, and only one occurrence of each word have been selected.

A software tool developed at UPC (SAGA: Spanish Automatic Graphemes to Allophones Transcriber) has been used to translate the transcribed words to phonemic strings by using the SAMPA phonemic notation. The complete lexicon was manually supervised.

Availability

This database is commercially available.

Information

asuncion.moreno@upc.edu

Search form

You are here