🎭 Automatic Speech Emotion Recognition using TinyML (AERSUT)
A TinyML-powered system for real-time emotion recognition from speech, optimized for edge devices
Overview
AERSUT (Automatic Emotion Recognition System Using TinyML) is an advanced system that detects and analyzes emotions from speech signals using TinyML. The system classifies emotions into eight categories: surprise, neutral, disgust, fear, sad, calm, happy, and anger. The implementation focuses on running efficiently on resource-constrained devices while maintaining high accuracy.
🔑 Key Features
- Multi-emotion Classification: Detects 8 distinct emotional states from speech
- Advanced Feature Extraction: Utilizes MFCCs, Mel-spectrograms, ZCR, and RMS Energy
- Data Augmentation: Implements noise addition, time stretching, and signal shifting
- TinyML Integration: Optimized for deployment on resource-constrained devices
- Dual-Model Approach: Implements both CNN and CNN-LSTM architectures
- High Accuracy: Achieves up to 72% test accuracy on combined datasets
🏗️ System Architecture
1. Feature Extraction Pipeline
A. Mel Spectrograms
- Visual representation of speech’s temporal and spectral changes
- Captures rich emotional features
- Provides 2D feature maps for CNN processing
B. Mel Frequency Cepstral Coefficients (MFCCs)
- 13 coefficients extracted per frame
- Mel-scale transformation for human-like frequency perception
- DCT for decorrelation of filter bank energies
C. Additional Features
- Zero Crossing Rate: Measures signal noisiness and periodicity
- Root Mean Squared Energy: Represents signal power over time
2. Model Architectures
A. CNN Model
- Input: MFCC features (13 coefficients × 130 frames)
- Convolutional Blocks: Multiple Conv1D layers with BatchNorm and ReLU
- Classification Head: Dense layers with Dropout for regularization
B. CNN-LSTM Model
- Combines CNN for feature extraction with LSTM for temporal modeling
- Training: 120 epochs with learning rate 0.00001
- Performance: 96% training accuracy, 72% test accuracy
📊 Performance Comparison
Model | Training Accuracy | Test Accuracy |
---|---|---|
CNN | 99% | 67% |
CNN-LSTM | 96% | 72% |
🛠️ Technical Implementation
Data Augmentation
- Noise Injection: Adding Gaussian noise to audio signals
- Time Stretching: ±20% variation in speech rate
- Pitch Shifting: Modulating pitch by ±3 semitones
- Time Shifting: Randomly shifting audio in time
TinyML Deployment
- Optimized for edge deployment using TensorFlow Lite
- Real-time emotion classification on resource-constrained devices
- Efficient inference with minimal memory footprint
📚 Dataset
The system uses a combination of two benchmark datasets:
1. RAVDESS (Ryerson Audio-Visual Database)
- 1,440 speech samples
- 8 emotional states
- 24 professional actors
2. CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)
- 7,442 audio clips
- 91 actors with diverse demographics
- 6 emotional states
Challenges & Solutions
-
Challenge: Model size was too large for edge devices
- Solution: Implemented model quantization and pruning techniques
-
Challenge: Limited training data for certain emotion classes
- Solution: Used data augmentation and class weighting
-
Challenge: Real-time performance on edge devices
- Solution: Optimized model architecture and used TensorFlow Lite delegates
What I Learned
- Techniques for optimizing deep learning models for edge devices
- The importance of model quantization and pruning in TinyML
- How to handle class imbalance in emotion recognition datasets
- Best practices for deploying ML models on resource-constrained devices
Future Improvements
- Implement face detection to focus only on facial regions
- Add support for more nuanced emotion categories
- Optimize for lower-power microcontrollers (e.g., ESP32)
- Create a web interface for remote monitoring
- Implement continuous learning to improve model accuracy over time
Last modified on 2022-10-13