🎭 Automatic Speech Emotion Recognition using TinyML (AERSUT)
A TinyML-powered system for real-time emotion recognition from speech, optimized for edge devices

License: MIT Python 3.8+

Overview

AERSUT (Automatic Emotion Recognition System Using TinyML) is an advanced system that detects and analyzes emotions from speech signals using TinyML. The system classifies emotions into eight categories: surprise, neutral, disgust, fear, sad, calm, happy, and anger. The implementation focuses on running efficiently on resource-constrained devices while maintaining high accuracy.

🔑 Key Features

  • Multi-emotion Classification: Detects 8 distinct emotional states from speech
  • Advanced Feature Extraction: Utilizes MFCCs, Mel-spectrograms, ZCR, and RMS Energy
  • Data Augmentation: Implements noise addition, time stretching, and signal shifting
  • TinyML Integration: Optimized for deployment on resource-constrained devices
  • Dual-Model Approach: Implements both CNN and CNN-LSTM architectures
  • High Accuracy: Achieves up to 72% test accuracy on combined datasets

🏗️ System Architecture

1. Feature Extraction Pipeline

A. Mel Spectrograms

  • Visual representation of speech’s temporal and spectral changes
  • Captures rich emotional features
  • Provides 2D feature maps for CNN processing

B. Mel Frequency Cepstral Coefficients (MFCCs)

  • 13 coefficients extracted per frame
  • Mel-scale transformation for human-like frequency perception
  • DCT for decorrelation of filter bank energies

C. Additional Features

  • Zero Crossing Rate: Measures signal noisiness and periodicity
  • Root Mean Squared Energy: Represents signal power over time

2. Model Architectures

A. CNN Model

  • Input: MFCC features (13 coefficients × 130 frames)
  • Convolutional Blocks: Multiple Conv1D layers with BatchNorm and ReLU
  • Classification Head: Dense layers with Dropout for regularization

B. CNN-LSTM Model

  • Combines CNN for feature extraction with LSTM for temporal modeling
  • Training: 120 epochs with learning rate 0.00001
  • Performance: 96% training accuracy, 72% test accuracy

📊 Performance Comparison

Model Training Accuracy Test Accuracy
CNN 99% 67%
CNN-LSTM 96% 72%

🛠️ Technical Implementation

Data Augmentation

  • Noise Injection: Adding Gaussian noise to audio signals
  • Time Stretching: ±20% variation in speech rate
  • Pitch Shifting: Modulating pitch by ±3 semitones
  • Time Shifting: Randomly shifting audio in time

TinyML Deployment

  • Optimized for edge deployment using TensorFlow Lite
  • Real-time emotion classification on resource-constrained devices
  • Efficient inference with minimal memory footprint

📚 Dataset

The system uses a combination of two benchmark datasets:

1. RAVDESS (Ryerson Audio-Visual Database)

  • 1,440 speech samples
  • 8 emotional states
  • 24 professional actors

2. CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)

  • 7,442 audio clips
  • 91 actors with diverse demographics
  • 6 emotional states

Challenges & Solutions

  1. Challenge: Model size was too large for edge devices

    • Solution: Implemented model quantization and pruning techniques
  2. Challenge: Limited training data for certain emotion classes

    • Solution: Used data augmentation and class weighting
  3. Challenge: Real-time performance on edge devices

    • Solution: Optimized model architecture and used TensorFlow Lite delegates

What I Learned

  • Techniques for optimizing deep learning models for edge devices
  • The importance of model quantization and pruning in TinyML
  • How to handle class imbalance in emotion recognition datasets
  • Best practices for deploying ML models on resource-constrained devices

Future Improvements

  • Implement face detection to focus only on facial regions
  • Add support for more nuanced emotion categories
  • Optimize for lower-power microcontrollers (e.g., ESP32)
  • Create a web interface for remote monitoring
  • Implement continuous learning to improve model accuracy over time

Last modified on 2022-10-13