TL;DR:

In this paper, we described our submission to the BabyLM Challenge, and investigated sample-efficient pretraining strategies. On the data side, we focused on improving data utilization, specifically different batching strategies for training. Our findings indicated that the formatting of the input data can significantly impact and improve downstream task performance. On the modelling side, we proposed part-of-speech augmentation to enrich the training signals derived from the datasets, and we showed that inducing structural biases in the model through part-of- speech trees yields modest benefits.

Poster: