Optimization of speech recognition systems

August 4, 2014

An increasing number of intelligent system ranging from smartphones, info entertainment systems for vehicles, tablet and smartphone applications, to household devices and building services technology, are controlled via voice input.
However, many of the voice control systems are very error-prone. The human factor has often been disregarded during programming. Human beings do not always apply the same logic; they express themselves differently according to their language skills, nationality, social environment and educational background. As soon as the command entered does not comply with the envisaged and programmed command of the system due to the selection of words, sequence of words or pronunciation, the user is not understood and the command is not carried out. Break-offs and renewed speech input are time-consuming for the user and in some situations distracting and dangerous, for example while driving.

To optimally adjust voice control systems to the behavior and pronunciation of the users, behavior patterns of many different people have to be determined and taken into consideration. How do different individuals proceed when they operate the systems? Which commands do they enter via speech recognition to call up specific information and which words do they select to do so, and in which order? How are the individual words pronounced? Crowdsourcing by clickworker provides an ideal and efficient data collection tool to obtain valid data quickly.

Detailed information about our service “Audio datasets for speech recognition training“

Ask our international and 1.5 m strong crowd of Clickworkers how they proceed when they operate their system and which speech commands they would give to call up specific information. By using our crowdsourcing services you will receive all the data needed from potential users of your system in valid quantities whilst taking into account and allocating according to nationalities and regional language differences. Furthermore, the results can be classified into other demographic data of our Clickworkers, for example age group or gender.

This newsletter presents a case study that demonstrates how crowdsourcing can be used to train speech recognition systems to react to human behavior and make them more intelligent.

Optimization of speech recognition systems with crowdsourcing

Thousands of Clickworkers record their speech input to control a car infotainment system and supply the manufacturer with these important data for the programming and optimization of the system.

Challenge		Exemplary workflow
Voice control systems are often only as good as their speech recognition. The challenge in these speech recognition systems is to optimize and train them to react to the different ways speech is entered by the users. Programming without the “human understanding” and “human behavior” factors cannot yield an optimal speech recognition system. Often, the users’ speech entries are not recognized, or they are misunderstood. The users must often enter their commands several times before the system reacts to the entry correctly and displays the desired information. This is time-consuming for the user and is often distracting while driving. In order to optimize the system’s range and enable it to recognize the individual speech entry options of potential users, speech recordings from thousands of different people with individual commands and pronunciations are needed.		The customer determines the targets and the respective number of speech entries. Depending on the requirements, the customer can determine what technical specifications the Clickworkers will have to fulfill and draft a briefing. clickworker will draw up the project according to the customer’s wishes. The project will be divided into individual microtasks and tested. A microtask corresponds to one speech recording. The microtasks will be only be offered to qualified Clickworkers (taking the respective target regions and fulfillment of the technical requirements into account) and put online for processing. The Clickworkers read the job instructions and record their individual speech command to accomplish the predefined target. For example: Find a 4-star hotel in Stuttgart (scenario: You are driving to Italy, want to spend a night in Stuttgart and are looking for a hotel with the help of your navigation system.) Samples of the speech recordings are checked by clickworker on a daily basis and all the recordings are uploaded via a Cloud provider from where the customer can download them.
Solution
Thousands of our Clickworkers in the German-speaking world record how they would issue a command, to call up the predefined reaction “x” or information “y” via the infotainment system. Every speech recording differs through the selection of words, sequence of words as well as the pronunciation of the individual Clickworker. The recordings help to train the speech recognition of the system and to optimize the infotainment system for the individually different ways users handle the system.
Project specifications
Clickworker qualifications: German native speaker from Germany, Switzerland and Austria. Equipment needed by the Clickworker: PC or laptop with a microphone and loudspeakers. Data transfer: Audio files via Cloud.		Typical number of daily speech recordings: 500 – 600 recordings. Quality assurance: Injected testing. Project volume: 40,000 speech recordings (1,000 speech recordings per task/target; 40 tasks/target).

For further questions about our services or offer requests, please send us an e-mail to: request@clickworker.com
or give us a call at: +49 201 959718-0.

Ines Maione