SciBERT: A Pretrained Language Model for Scientific Text

– NLP is important for extracting knowledge from scientific publications.
– Training deep neural models requires large amounts of labeled data.
– Annotated data in scientific domains is difficult and expensive to collect.
– Unsupervised pretraining of language models improves performance on NLP tasks.
– SCIBERT is a pretrained language model based on BERT trained on scientific text.

SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.

– SciBERT outperforms BERT-Base and achieves new state-of-the-art results on several tasks.
– Future work includes releasing a version of SciBERT analogous to BERT-Large.

