🎭 Automatic Speech Emotion Recognition using TinyML (AERSUT)

Overview

AERSUT (Automatic Emotion Recognition System Using TinyML) is an advanced system that detects and analyzes emotions from speech signals using TinyML. The system classifies emotions into eight categories: surprise, neutral, disgust, fear, sad, calm, happy, and anger. The implementation focuses on running efficiently on resource-constrained devices while maintaining high accuracy.

🔑 Key Features

Multi-emotion Classification: Detects 8 distinct emotional states from speech
Advanced Feature Extraction: Utilizes MFCCs, Mel-spectrograms, ZCR, and RMS Energy
Data Augmentation: Implements noise addition, time stretching, and signal shifting
TinyML Integration: Optimized for deployment on resource-constrained devices
Dual-Model Approach: Implements both CNN and CNN-LSTM architectures
High Accuracy: Achieves up to 72% test accuracy on combined datasets

🏗️ System Architecture

1. Feature Extraction Pipeline

A. Mel Spectrograms

Visual representation of speech’s temporal and spectral changes
Captures rich emotional features
Provides 2D feature maps for CNN processing

B. Mel Frequency Cepstral Coefficients (MFCCs)

13 coefficients extracted per frame
Mel-scale transformation for human-like frequency perception
DCT for decorrelation of filter bank energies

C. Additional Features

Zero Crossing Rate: Measures signal noisiness and periodicity
Root Mean Squared Energy: Represents signal power over time

2. Model Architectures

A. CNN Model

Input: MFCC features (13 coefficients × 130 frames)
Convolutional Blocks: Multiple Conv1D layers with BatchNorm and ReLU
Classification Head: Dense layers with Dropout for regularization

B. CNN-LSTM Model

Combines CNN for feature extraction with LSTM for temporal modeling
Training: 120 epochs with learning rate 0.00001
Performance: 96% training accuracy, 72% test accuracy

📊 Performance Comparison

Model	Training Accuracy	Test Accuracy
CNN	99%	67%
CNN-LSTM	96%	72%

🛠️ Technical Implementation

Data Augmentation

Noise Injection: Adding Gaussian noise to audio signals
Time Stretching: ±20% variation in speech rate
Pitch Shifting: Modulating pitch by ±3 semitones
Time Shifting: Randomly shifting audio in time

TinyML Deployment

Optimized for edge deployment using TensorFlow Lite
Real-time emotion classification on resource-constrained devices
Efficient inference with minimal memory footprint

📚 Dataset

The system uses a combination of two benchmark datasets:

1. RAVDESS (Ryerson Audio-Visual Database)

1,440 speech samples
8 emotional states
24 professional actors

2. CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)

7,442 audio clips
91 actors with diverse demographics
6 emotional states

Challenges & Solutions

Challenge: Model size was too large for edge devices
- Solution: Implemented model quantization and pruning techniques
Challenge: Limited training data for certain emotion classes
- Solution: Used data augmentation and class weighting
Challenge: Real-time performance on edge devices
- Solution: Optimized model architecture and used TensorFlow Lite delegates

What I Learned

Techniques for optimizing deep learning models for edge devices
The importance of model quantization and pruning in TinyML
How to handle class imbalance in emotion recognition datasets
Best practices for deploying ML models on resource-constrained devices

Future Improvements

Implement face detection to focus only on facial regions
Add support for more nuanced emotion categories
Optimize for lower-power microcontrollers (e.g., ESP32)
Create a web interface for remote monitoring
Implement continuous learning to improve model accuracy over time

Last modified on 2022-10-13