MUSCLE – AIIA LAB

Project’s Rationale

MUSCLE is a European Network of Excellence that aims at creating and supporting a pan-European Network of Excellence to foster close collaboration between research groups in multimedia datamining on the one hand and machine learning on the other in order to make breakthrough progress towards the following objectives:

Harnessing the full potential of machine learning and cross-modal interaction for the (semi-) automatic generation of metadata with high semantic content for multimedia documents;
Applying machine learning for the creation of expressive, context-aware, self-learning, and human-centered interfaces that will be able to effectively assist users in the exploration of complex and rich multimedia content;
Improving interoperability and exchangeability of heterogeneous and distributed (meta)data by enabling data descriptions of high semantic content (e.g. ontologies, MPEG7 and XML schemata) and inference schemes that can reason about these at the appropriate levels.
Through dissemination, training and industrial liaison, contribute to the distribution and uptake of the technology by relevant end-users such as industry, education, and the service sector. In particular, close interactions with other IP’s and NOE’s in this and related activity fields are planned;
Through accomplishing the above, facilitate the broad and democratic (i.e. obviating the need for special expertise) access to information and knowledge for all European citizens (e.g. e-Education, enriched cultural heritage).

Partners:

Advanced Computer Vision
ARMINES
Aristotle University of Thessaloniki
Bilkent University
Commissariat a l’Energie Atomique, France
Consiglio Nazionale delle Ricerche
Centre National de la Recherche Scientifique
Centre for Mathematics and Computer Science
Ecole Nationale Supιrieure de l’Electronique et de ses Applications
European Research Consortium for Informatics and Mathematics
Foundation for Research and Technology Hellas
France Telecom SA
Ecole Nationale Supιrieure des Telecommunications
Institute fur Bildverarbeitung und angewandte Informatik e.V
National Technical University of Athens
INRIA-Ariana
INRIA-Imedia
INRIA-Parole
INRIA-Texmex
INRIA-Vista
Royal Institute of Technologie
LTU Technologies
Computer and Automation Research Institute of the Hungarian Academy of Sciences
Austria Research Centers, Seibersdorf Research, Gmbh
Tel Aviv University
Trinity College Dublin, Ireland
Israel Institute of Technology
Technical University of Crete, Greece
Graz University of Technology
Vienna University of Technology
Cambridge University
University College London
Albert-Ludwigs-Universitaet Freiburg
University of Surrey
Universitat Politecnica de Catalunya
Institute of Information Theory and Automation
University of Ulster
University of Amsterdam
Technical Research Centre of Finland

Our Research Objectives

The research performed by AUTH within the framework of the MUSCLE NOE includes the following objectives:

Recognition of human emotions in video sequences using facial expressions
Emotional classification of speech
Language modelling
Indexing and fingerprinting of videos using semantic information
Medical diagnosis from the analysis of voice characteristics

Contributions of AUTH

Emotion Recognition from Speech based on gender information

Emotional speech recognition aims to automatically classify speech units (e.g., utterances) into emotional states, such as anger, happiness, neutral, sadness and surprise. The major contribution of this work is to rate the discriminating capability
of a set of features for emotional speech recognition when gender information is taken into consideration. A total of 87 features has been calculated over 500 utterances of the Danish Emotional Speech database. The class pdfs of the mean value of the pitch contour for the five emotions under study are plotted below. We note that the pdf curves are splines fitted to the discrete pdf of each class.

In order to study the classification ability of each feature, a rating method has been implemented. Each feature is evaluated by the ratio between the between-class variance and the within-class variance. The between-class variance measures the distance between the class means, whereas the within-class variance measures the dispersion within each class. The best features should be characterized by a large and a small. The 15 features with the highest ration ( σ2b/σ2w) are shown below, where both σ2b and σ2w are depicted.

The Sequential Forward Selection method (SFS) has been used in order to discover the 5-10 features which are able to classify the samples in the best way for each gender. The criterion used in SFS is the crossvalidated correct classification rate of a Bayes classifier where the class probability distribution functions (pdfs) are approximated via Parzen windows or modeled as Gaussians.

When a Bayes classifier with Gaussian pdfs is employed, a correct classification rate of 61.1% is obtained for male subjects and a corresponding rate of 57.1% for female ones. In the same experiment, a random classification would result in a correct classification rate of 20%. When gender information is not considered a correct classification score of 50.6% is obtained. The partial correct classificaction for each class in the following figure.

The rates reported in Tables 3 and 4 can be further improved by analyzing the properties of the above mentioned two-class problems. The features which can separate two classes could be different from those which separate 5 classes. By designing proper decision fusion algorithms, we may combine several two-class classifiers and the overall system could outperform the rates obtained by the five-class classifiers.

Automatic Detection Of Vocal Fold Paralysis and Edema

In this paper we propose a combined scheme of linear prediction analysis for feature extraction along with linear projection
methods for feature reduction followed by known pattern recognition methods on the purpose of discriminating between normal and pathological voice samples. Two different cases of speech under vocal fold pathology are examined: vocal fold
paralysis and vocal fold edema. Three known classifiers are tested and compared in both cases, namely the Fisher linear
discriminant, the K-nearest neighbor classifier, and the nearest mean classifier. The performance of each classifier is evaluated in terms of the probabilities of false alarm and detection or the receiver operating characteristic. The datasets used are part of a database of disordered speech developed by Massachusetts Eye and Ear Infirmary. The experimental results indicate that vocal fold paralysis and edema can easily be detected by any of the aforementioned classifiers.

In the first experiment, the dataset contains recordings from 21 males aged 26 to 60 years who were medically diagnosed as normals and 21 males aged 20 to 75 years who where medically diagnosed with vocal fold paralysis. In the second experiment 21 females aged 22 to 52 years who were medically diagnosed as normals and 21 females aged 18 to 57 years who where medically diagnosed with vocal fold edema served as subjects. The subjects might suffer from other diseases too, such as hyperfunction, ventricular compression, atrophy, etc. Two different kinds of recordings were made in each session: in the first recording the patients were called to articulate the sustained vowel Ah (/a/) and in the second one to read the Rainbow Passage. The former is the one concerned with the present work. Therefore, all procedures were applied to voiced speech frames far away from transition periods.

The feature vector extraction is performed via short-term linear prediction of order 14. The LP model of order 14 is
regarded as a good choice. It has been reported that the use of more than 14 LPCs does not improve significantly the discrimination of laryngeal diseases. The dimensionality of the feature space is then reduce by principal component analysis.

The whole 2-D feature space for (a) the rst experiment concerned with vocal fold paralysis and (b) the second experiment
concerning vocal fold edema. (Each normal feature vector is represented with an `o’, while each pathological feature vector is represented by a `*’.)

Projection to the 1st principal component

It has been demonstrated by experiments, that efficient detection of voice disorders can be achieved by Fisher’s linear discriminant, K-NN, and the nearest mean classifier for vocal fold paralysis. Slightly worse results have been reported for vocal fold edema detection. The spectral characteristics extracted by linear prediction analysis of order 14 combined with principal component analysis of order 2 for feature reduction have been proved to be very efficient for the aforementioned classification tasks.

Related Group Publications

Kotsia, and I. Pitas,“Real time facial expression recognition from video sequences using Support Vector Machines”, in Proc. ofVisual Communications and Image Processing (VCIP 2005), Beijing, China, 12-15 July, 2005
C.I.Cotsaces, N.Nikolaidis and I.Pitas, “The use of face indicator functions for video indexing and fingerprinting”, in Proc. of International Workshop on Content-Based Multimedia Indexing (CBMI 2005), Riga, Latvia, 21-23 June 2005
Ververidis and C. Kotropoulos,“Sequential Forward Feature Selection with Low Computational Cost”, in Proc. ofEuropean Signal Processing Conference (EUSIPCO 2005),, Antalya, Turkey, 4-8 September, 2005
Ververidis and C. Kotropoulos“Automatic Speech Classification to five emotional states based on gender information”, in Proc. of12th European Signal Processing Conference (EUSIPCO ’04), pp. 341-344, Vienna, Austria, September 2004
Marinaki, C. Kotropoulos, I. Pitas, and N. Maglaveras,“Automatic detection of vocal fold paralysis and edema”, in Proc. of8th Int. Conf. Spoken Language Processing (INTERSPEECH 2004), Jeju, Korea, October 2004
Bassiouand C. Kotropoulos,“Interpolated Distanced Bigram Language Models for Robust Word Clustering”, in in Proc. of IEEE International Workshop on Nonlinear Signal and Image Processing (NSIP 2005), Sapporo, Japan, 18-20 May, 2005