About the Data Science

The technical goal of TeacherPrints is to evaluate student and teacher talk time, without any consideration of the meaning of what is being said. Therefore, our approach focused on the machine learning subfield of Automatic Speech Recognition (ASR) without any further Natural Language Processing (NLP). In fact, our use case was fairly straightforward: we needed a model that could ingest an audio recording from a classroom and simply output predictions of when a teacher was speaking versus when a student was speaking.


Technical Approach

There are a couple of issues we recognized from the outset:

First, the fundamental frequency of adult male and adult female voices are very different, so it made sense for our model to learn these two separately and combine the predictions into a single “Teacher” class in post-processing.

Second, the fundamental frequency of children’s voices decreases rapidly with age. Consequently, sometime around puberty, it will become difficult to distinguish children’s voices from those of adults. This is especially problematic with adult females, who have very similar fundamental frequencies even to pre-pubescent children. This limits the current utility of our product to primary school classrooms, especially when the teachers are female.


Marvin: A Voice-Type Classification Model

We began our early research seeking existing open-source, neural-network-based models, given the dominance of deep learning techniques in recent years that have resulted in state-of-the-art performance. This research led us to Lavechin et al.’s voice type classifier for child-centered daylong recordings (which we affectionately refer to as “Marvin” after Lavechin’s first name), a pre-trained model that uses an open-source neural network to classify audio segments into vocalizations produced by the children, male adults, and female adults. To train this model, its authors gathered diverse child-centered corpora of 260 hours of recordings in 10 languages called BabyTrain. The model was intended to be used by others to produce input for downstream tasks such as estimating the number of words produced by adult speakers, estimating the number of utterances produced by children, or - as in the case of TeacherPrints - to identify and visualize speech patterns. According to the authors, it significantly outperforms the state-of-the-art system, the proprietary Language ENvironment Analysis (LENA), which has been used in numerous child language studies.

Marvin’s architecture combines SincNet filters with a stack of recurrent neural network (LSTM) layers. SincNet is a neural architecture for processing raw audio samples. It is a novel Convolutional Neural Network (CNN) that encourages the first convolutional layer to discover more meaningful filters. In contrast to standard CNNs, which learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method. This offers a very compact and efficient way to derive a customized filter bank specifically tuned for the desired application. Marvin’s code mainly relies on pyannote-audio, an open-source Python toolkit for speaker diarization.


Marvin's Limitations

At face value, Marvin was well-aligned to TeacherPrints’ purpose. However, there were at least two concerns we identified that required adaptations to the model for our use case:

The first was the data used to train Marvin. Out of the box, Marvin is trained on recordings of toddlers and pre-school aged children in a home-based setting. For the model to perform well on recordings of schoolchildren in classroom settings, we needed it to distinguish speakers effectively in noisier environments and to properly recognize the voices of older children. To facilitate this, we retrained Marvin using a large, limited-use dataset from public school kindergarten classrooms in Michigan. Our approach was therefore to retrain Marvin completely on this kindergarten classroom data, preserving the Marvin architecture and initializing our weights to its.

The second issue identified about Marvin emerged during our EDA & initial application of Marvin on classroom data, where we discovered that Marvin sometimes confused music with speech. Because music is sometimes used in primary school classrooms, we needed to find a way to teach Marvin the difference between language, music, and non-random sounds. Our solution was to add a second model, a Speech Activity Detection (SAD) model from pyannote-audio, that was trained to distinguish speech from other noise. It turned out that this model also got a bit confused by music, so we retrained it, too, using a movie-based dataset with various combinations of music, speech, and noise (“AVA”).


The TeacherPrints Model

So this brings us to our current architecture: two speech processing models, working in parallel - a voice-type classifier (VTC) that distinguishes adult male, adult female, and children’s voices, and a speech activity detector (SAD) that distinguishes human speech from other sounds. We use the output of the speech activity detector as a filter on the voice-type classifier to arrive at our final model output.


A Note About Weak Annotation

One additional modeling challenge has to do with the precision of the human annotations - essentially, who said what when. Many datasets, including our kindergarten classroom dataset,, have annotation files that do not account for the pauses between when a child and adult are speaking, which is called “weak annotation”. This meant that our labels, our “ground truth”, suggest there is no silence in our data. This imprecision in the labels has a consequence both for the performance of a model trained on such data as well as the evaluation metrics we apply to measure the performance of our models. We implemented preprocessing steps to correct the weak annotation of our kindergarten data, and we focused on precision as the metric to evaluate model performance.


Inference Output

The raw output of our inference conforms to the industry-standard RTTM format, which provides segments indexed by utterance, with start time, duration, and class label. In post-processing, we transform the raw data output into a time-indexed data frame with one-hot encoding of applicable labels. Reshaping the data this way enables us to apply window functions, more easily identify pauses and overlaps, and ultimately allow for users to zoom in or out on segments of special interest.