Continuous annotation of user data is a challenge while deploying NLU techniques at scale in commercial applications. Models must be re-trained and updated to keep the performance at an optimal level. However, the process is expensive, labour-intensive, and time-consuming. Furthermore, with the rising concerns around privacy, manual review of user data needed for annotation is not ideal.
Researchers at Amazon and the University of Massachusetts Lowell have proposed a generative model to produce labelled synthetic data. The idea is to improve model robustness and performance by generating synthetic utterances and augmenting the original training data.
Synthetic augmentation with GIT
The Generative Insertion Transformer (GIT) is based on a non-autoregressive insertion transformer model that extends the idea to solve the inverse NLU problem by producing valid labelled data utterance that matches the annotation with a given template.
In this generative model, the decoder generates a sequence by inserting tokens between previously generated tokens. The carrier tokens are inserted between labels in the template iteratively. The insertion process at each position in the utterance is independent of every other position and stops when the EOS token is generated at all positions, resulting in a fully annotated synthetic utterance that can be directly augmented with real data for model building purposes.
The process can be divided into three sections:
Pretraining: GIT is pre-trained using the BERT encoder and KERMIT objective on an unsupervised LM task: Given a sentence with masked tokens, GIT is trained to insert the masked tokens. Two tests are configured on this model:
- Pre-training using only English Wikipedia
- Pre-training using an internal corpus of 800M unlabeled utterances randomly sampled from de-identified Alexa requests, using English Wikipedia pre-trained models as initialization.
Fine-tuning: The pre-trained GIT model is then fine-tuned for each domain using annotated real data. A template is provided as model input for each utterance and the complete utterance as output. During training, at each insertion slot, there are multiple candidate tokens from the ground truth, unlike autoregressive generation, which entails a single token per generation step. The ground truth distribution sets non-candidate token probabilities to 0 and uniformly weighs all candidate token probabilities.
Generation: To generate synthetic data for NLU, a template is constructed that contains the desired intent, slot types, and slot values for the synthetic example. This priming sequence is provided as an input to the decoder, which inserts carrier tokens in an iterative manner to form a coherent utterance. The generation process addresses both the label projection and entity control challenges. Templates used in inference are constructed from the reduced real data.
To study the effectiveness of synthetically generated data, the NLU model performance was evaluated in a reduced data regime. For each domain, multiple IC-NER models are built using all real data, a reduced set of real data and a combination of real and synthetic data. All models within a domain share the same training hyper-parameters, including architecture and encoder. They differ only in training data composition.
The researchers demonstrated DA using GIT as a feasible data generation technique to mitigate reduced annotation volumes for IC and NER tasks. The NLU models trained on 33% real data and synthetic data performed on par with models trained on full real data. Further, on domains with the highest SemER regressions, the quality of synthetic data was improved by filtering them with model confidence scores. Among domains that benefit from synthetic data, appropriate carrier token insertion enhanced utterances’ semantics and their value as training samples. The future represents data generation with entities replaced through knowledge base sampling. Such finer control over entities supports new feature expansion and enhances customer privacy.