– NLP is important for extracting knowledge from scientific publications.
– Training deep neural models requires large amounts of labeled data.
– Annotated data in scientific domains is difficult and expensive to collect.
– Unsupervised pretraining of language models improves performance on NLP tasks.
– SCIBERT is a pretrained language model based on BERT trained on scientific text.
SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks and demonstrates statistically significant improvements over BERT.
– SciBERT is a pretrained language model for scientific text based on BERT.
– SciBERT outperforms BERT-Base and achieves new state-of-the-art results on several tasks.
– Future work includes releasing a version of SciBERT analogous to BERT-Large.
– SciBERT outperforms BERT-Base on scientific tasks.
– Achieves new state-of-the-art results on many scientific tasks.