System Evaluation
System Evaluation

Unlike other software applications the functionality of speech technology applications cannot be verified by simple testing. In most cases the quality assurance of applications involving Speech Recognition, Dialogue Handling or Speech Synthesis requires a usability test with a representative group of test users.
What is system evaluation?

The evaluation of a speech application consists either in a series of controlled tests of the application performed by a selected group of test users or the test of the application in an artificial test environment ('test bed') or both. Usually all tests are monitored (recorded) and later analysed with regard to pre-defined test criteria. Test results are interpreted and summarized in an evaluation report that might be the basis of strategic decisions of the customer (e.g. to use speech recognition in a product or not). Therefore an objective and independent system evaluation is of paramount importance.

System Evaluation Standards?

Since the field of speech driven applications is rather young and such applications tend to be very different in techniques used and procedures applied, there exist no recognized evaluation standards for speech applications. Each case must be analysed very carefully to determine those test criteria that really enable the customer to come to a proper judgement of the new technology.
Walker et al (1997) have proposed the PARADISE framework to evaluate spoken dialogue systems. But unfortunately we found in several evaluations that PARADISE is only applicable for a very simple and spezialized form of dialogue system.
For the case of multi-modal speech applications the BAS has developped another evaluation scheme called PROMISE (Beringer et al. 2002) which was successfully applied in the evaluation of the SMARTKOM systems.

What can BASSS do for you?

Members of BASSS have a vast experience with system evaluations (VERBMOBIL, SMARTKOM) and are therefore the ideal partners for a pragmatic, objective and independent evaluation of all kinds of speech applications. We distinguish between holistic (or black-box) evaluations where the overall performance of the complete system is measured in usability tests and analytic evaluations where selected components of a speech application undergo objective tests on suitable reference LRs and/or psycho-physical tests.

