Voice Identification Project

Version 1.0.0

The voice identification process uses a custom built Gaussian Mixture model which analyzes the extracted features of a real time voice sample provided by the user and maps it with a GMM model which is trained with audio samples of different users. The additional face recognition feature is combined with it to confirm the real time speaker.

Home: Welcome

Design & Debug

Knowledge Meets Curiosity

This project has been built only using Python 3.6 and various python libraries :

python_speech_features - to extract Mel Frequency Cepstral Coefficients(mfcc), Filterbank Energies(fbank), Log Filterbank Energies(logfbank), Spectral Subband Centroids(ssc).
sklearn.mixture - to use the GMM models.
pickle - to convert a Python object into a byte stream to store it in a file/database.
tkinter - to design a simple UI.

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centers of the latent Gaussians.
The GaussianMixture object implements the expectation-maximization (EM) algorithm for fitting mixture-of-Gaussian models. It can also draw confidence ellipsoids for multivariate models, and compute the Bayesian Information Criterion to assess the number of clusters in the data. A GaussianMixture.fit() method is provided that learns a Gaussian Mixture Model from train data. Given test data, it can assign to each sample the Gaussian it mostly probably belong to using the GaussianMixture.predict() method.
The GaussianMixture comes with different options to constrain the covariance of the difference classes estimated: spherical, diagonal, tied or full covariance.

After testing with one model with each covariance constraints at a time it was noticed that the accuracy of each model was approximately 45%. That is why I combined all four covariance constraints into one model and found the accuracy to be approximately 75-80%. But this has resulted in slow response as the input data is evaluated through 4 different constraints.

Currently this project requires at least a 10 seconds voice sample for each speaker which can be used for training the model. My next target is to maximize accuracy with minimum training data.

For better understanding of the control flow of the project refer to the video given below.

Home: About My Project

Project Video

A walk through of my project

Home: Watch

Contact Me

Home: Contact