BERT

Within this article, we are set to present an abridged account of our triumphant resolution of a financial sentiment classification conundrum utilizing Bidirectional Encoder Representations from Transformers (BERT). Notably, our approach propelled the state-of-the-art performance forward by an impressive margin of 15 percentage points. Furthermore, we are eager to extend the details of our production setup, source code, dataset, and comprehensive guidelines. These resources will empower you to effectively train and apply our financial sentiment classifier.

The rationale behind our endeavour is deeply rooted in Prosus’ stature as a prominent global technology investor. Our daily responsibilities encompass navigating an immense volume of information, primarily comprised of unstructured textual data. This wealth of data pertains to the sectors and companies that pique our interest. Amidst this continuous torrent of data, financial sentiment analysis emerges as a vital tool for directing our analysts’ attention.

Our initial observations quickly underscored the inadequacy of rudimentary approaches like bag-of-words. These techniques tend to overlook critical contextual nuances within the text. Our challenge was to delve deeper into the financial landscape, comprehending the nuances that differentiate positive from negative sentiments from a financial standpoint.

A New Chapter in Natural Language Processing

If we were to confront this challenge around 2017, we’d have been compelled to devise a tailor-made model and commence training from the ground up. In such a scenario, the demand for substantial volumes of accurately labelled data would have been imperative to achieve even moderately satisfactory performance levels. However, sourcing high-quality labelled data is no mean feat, particularly within a specialized realm such as finance, where expertise is a prerequisite and certainly not easily attainable.

Thankfully, a transformative moment akin to NLP’s “ImageNet moment” transpired in 2018. The breakthrough commenced with ULMFit, where researchers effectively cracked the code on efficient transfer learning for NLP predicaments. The concept is elegantly simple: Firstly, harness abundant textual data sources, like Wikipedia. Subsequently, develop a language model using this data centred around predicting the next word in a given sentence. Finally, fine-tune this language model for your task, potentially integrating one or more task-specific layers. The brilliance of this approach lies in its ability to curtail the demand for copious amounts of data during fine-tuning, as the model inherently assimilates linguistic intricacies during the initial language model training phase. The laborious groundwork is essentially already accomplished!

A Multitude of Language Models Traced the ULMFit Route Yet Diverged in Training Approaches, Most Notably BERT

Numerous subsequent language models followed the footsteps laid by ULMFit, albeit diverging into distinct training methodologies, with BERT being a standout example. BERT emerged as the model that catapulted the notion of pre-training and fine-tuning into the limelight. BERT introduced two pivotal innovations into the domain of language modelling:

  1. Transformer Architecture: BERT integrated the transformer architecture (the ‘T’ in BERT) borrowed from machine translation. This architecture outperformed RNN-based ones by excelling at capturing long-term dependencies, as showcased brilliantly in this comprehensive overview.
  2. Masked Language Modeling (MLM): BERT introduced the concept of Masked Language Modeling as a task. Here, 15% of randomly selected tokens within a text are masked, and the model is tasked with predicting these masked tokens. This approach facilitates true bi-directionality (captured by the ‘B’ in BERT), ushering in a more comprehensive understanding. For intuitive explanations of transformers and BERT, refer to this resource.

BERT promptly established itself as the front-runner, achieving unparalleled performance in various downstream tasks, including text classification and question answering. However, its significance extended beyond task performance; it played a pivotal role in democratizing NLP. The door was flung open by the arduous compute-heavy training (Google dedicated 16 TPUs over 4 days for BERT’s pre-training). Now, individuals armed with modest computational resources could train highly accurate NLP models tailored to their niche tasks, leveraging the foundation laid by pre-trained language models.