Nowadays there is a lively debate on the issue of reducing the carbon footprint during the process of software development, especially when building the Machine Learning (ML) models. Nevertheless, we see that the question is far beyond the scope of mere optimization as there is a great deal of other factors (e.g., social, economic, or environmental) that should definitely be took into consideration. Moreover, there exist some unusual scenarios that suggest using NLP for such things as air quality prediction. In this article we are going to investigate this problem from the NLP researcher’s perspective, and see how we, as engineers, can contribute into building a greener internet through small but consistent efforts. Up we go!
Computationally hungry NLP
If we have a look at the history of Natural Language Processing (NLP) we can observe how computational complexity has increased over past few decades. Obliviously, one can ask why should the complexity grow, whereas we still speak the same language? We will cover this question in the following chapter, but first, a little bit of theory.
The main goal of any NLP project is to get some useful insights from your textual data. We all speak, and we all listen, thus language is one of the most important parts of our everyday lives. Not surprisingly, business owners want to get most of the textual information, as a great deal of the data is in a verbal format (reviews, comments etc.). Moreover, even graphical assets may be converted into text. So, what’s the problem with NLP? In any machine learning problem one of the most crucial tasks is how to represent your data in a machine-readable format. For instance, if you are constructing a scoring model for a credit department of your bank, you may get such user parameters as yearly income, marital status, number of children, number of debts etc. Almost all these parameters are in a number format, and even those few which are not, may be easily converted into digits, i.e., marital status may be single, married or divorced, thus we create three columns, and put 1s or 0s in the corresponding cells. And, as our ML model is a function which accepts a numeric vector as input, we can simply feed a bunch of our vectorized user profiles into the engine and let the algorithm do the magic.
However, with NLP the task is not this trivial. How can we convert a text into a set of digits? What is really fascinating about NLP is the fact that there is no correct answer. Feature extraction has always been the pure art and the vectorization technique may vary from task to task, making the process more craft than science. If we make a summary of all the methods, we can define three large categories.
Until 2013 all the vectorization techniques were based on pure statistics and could be described as “One-hot encoding”. These methods include n-grams, tf-idf, Bag-Of-Words, Hashing algorithms etc. The features were defined by the corpus itself, and, depending on your algorithm, the words were converted to a numerical vector based on their frequency and importance. Suppose, you have a set of 160k tweets, and you use bigrams and Murmurhash v3 hashing algorithm to vectorize your data. This will generate a collection of 40k features, but to avoid the curst of dimensionality you apply a feature reduction technique (for instance, based on Chi-Squared) to leave only 5k most important features which are later fed into the model (for example, SVM classifier).
In 2013, a group of researchers at Google, led by Tomas Mikolov, proposed a novel technique called word2vec. In their paper they introduced the word embeddings approach, which converts the text into dense vectors. Suppose you are a training a sentiment classifier for online reviews. You have two classes, positive and negative. Each of the reviews is composed of words. For instance, you have a word “love”, which is close to the positive class, thus you assign 0.9 to this token. The word “disappointed” is closer to the negative class, so you assign -0.8 to this term. The term “alright” is somewhere between positive and negative, so you could set its value to 0.2. But how can we obtain these values? The embeddings are trained beforehand and are downloaded as a single file that is later used for feature extraction. As these embeddings do not change dynamically during the vectorization, you are interested in having as many different words as possible. That’s why to train good embeddings huge datasets are used, such as Google news, Wikipedia, books, and scientific papers. Not surprisingly, training such embeddings requires time and computational power, and should be trained for each language separately. However, the advantage of such an approach is a better classification as the embeddings allow to capture semantical relationships between words and reduce the impact of functional words (to, this, it, of etc.) which are used more often than meaningful terms. Besides word2vec such technique includes skip-gram and GloVe.
Word embeddings allowed to dramatically improve the quality of text classification, but as there is always something to improve, the Google research team, in 2018, introduced a new approach based on the transformer architecture, called BERT (Bidirectional Encoder Representations from Transformers), which allows constructing contextual representations. Unlike word2vec, BERT analyzes the context from both directions (left and right), which results in more meaningful representations. Consequently, in two phrases “I run a marathon” and “I run a company”, BERT will assign different values to the word “run”, as its meaning changes depending on the context. Besides improving the quality of the classification, this approach has defined a trend towards pre-trained language representations with task-agnostic architecture. This paradigm has led to substantial progress on many NLP tasks and has inspired new architectures and algorithms with multiple layers of representation and contextual state. Certainly, these solutions required serious computational power, thus each new algorithm was a compromise between efficiency and desired quality. For instance, even in 2018 there were much more powerful models than BERT, but their cost made them commercially inefficient. BERT, in turn, allowed to achieve the higher result at affordable cost. According to Devlin et al (2019), a base BERT model with 110 M parameters required 96 hours on 16 TPU chips. If previous approaches mainly treated basic problems like classification or topic modeling, now researchers focused on such challenging tasks as reading comprehension, question answering or textual entailment.
In 2020, due to advances in hardware optimization, engineers succeeded to train enormous models with more capacities. The first revolution was the Turing model, which had 17 billion (!) parameters and was able to perform such sophisticated tasks as text summarization and question answering.
Today, researchers try to create a language model as close to a human brain as possible. A huge advantage of BERT was the ability to directly fine-tune a model specifically for a task by providing thousands of examples (instead of millions), a so-called “transfer learning”. Now, we try to imitate a human brain which does not require large, supervised datasets to learn language tasks, just a brief directive in natural language (“tell me if the sentence describes something joyful or sad” etc.). This concept is called “meta learning” and is more attractive for engineers as it doesn’t require any training data for fine tuning. To achieve this goal, researchers from OpenAI trained a 175 billion parameter autoregressive language model, called GTP 3. This model demonstrated promising results in the zero-shot and one-shot settings (i.e., with no examples or one example provided). To train such a monstrous model it took 3640 petaflop-days on V100 GPUs on part of a high-bandwidth cluster provided by Microsoft.
Efficiency vs Quality
As you may have noticed, NLP models get more sophisticated and closer to human performance. This is due to scientific breakthroughs, more memory-optimized algorithms, but also more powerful hardware. Training such models provokes high energy consumption, and this energy is not always derived from carbon-neutral sources. Consequently, this results in higher CO2 emission and significant computational cost. There’s an interesting study on the environmental effect of running NLP pipelines in terms of CO2 emissions and estimated cost led by Strubell et al at the University of Massachusetts. To do this they estimated the kilowatts of energy required to train a variety of state-of-the-art NLP models, and then converted this value to approximate carbon emissions and electricity costs.
As it may be seen from the table, the more parameters model has the higher computational cost it demands. Naturally, one may suggest stop increasing the NLP model complexity as the damage to the environment is higher than the positive impact of a better language model. But let us first study this issue from different points of view before rushing to any conclusions.
Equitable access for everyone
Running complicated models is not attainable to all the researchers due to their high cost and specific technical requirements. This promotes inequality among the laboratories as those ones with better financing will have more chances to achieve a better result. To avoid the problematic “rich get richer“ cycle there exist multiple initiatives, for instance Microsoft for startups. It is a program that helps startups grow, create and expand their network by supporting them in a technical and business partnership approach, with the help of free Azure credits. Thus, engineers can build, run, and manage applications across multiple clouds with the tools and frameworks of their choice.
Moreover, in the case of relying on cloud infrastructure, it may be useful to apply frameworks for environment-friendly development, like Green Ops.
Green Ops is a methodology for optimizing the ecological impact of companies. This practice gives companies a methodology and indicators to reduce their ecological footprint realized through their consumption of Cloud Computing. The main goal of this methodology is to:
Devise a Cloud Best Practices Framework to reduce the environmental impact of cloud infrastructures while respecting the transformation objectives of companies
Development an audit tool to identify cloud assets that require change to reduce the impact of the existing infrastructure
Deliver an eco-score to encourage the company in its approach
Thus, the need to create a framework is to systematize the sustainable approach by integrating it into project ceremonies.
However, for non-profit educational institutions, relying on cloud compute services such as AWS, Google Cloud or Microsoft Azure may not be an option or may be less cost effective than building own compute centers. Consequently, other approaches of compute optimization may be applied.
‘Green’ Mathematics
As a Data Scientist I find it more interesting to extract more relevant features specific for a task instead of relying on huge models. For instance, if you are working on a scenario involving a limited vocabulary, like product descriptions, One-hot encoders like Murmurhash V3 may give surprisingly good results, whereas they require much less computational resources. Another example is Deceptive Opinion Spam, where classical encoders like BERT or ElMo do not provide the needed results as it concerns the reviews which were intentionally written to sound authentic. Instead of applying such state-of-the-art models, engineers define their own features, like Linguistic Inquiry Word Count (LIWC) or Stylometric features and obtain better results at less computational cost.
There also exist hyperparameter tuning techniques like random or Bayesian hyperparameter search, which allow to improve the computational efficiency comparing to the brute-force grid search. Even if the computational gain may seem insignificant, at long term such techniques may considerably reduce the energy consumption.
From this point of view, it is important to have a common framework defining the standards for integrating computationally efficient algorithms into most popular ML libraries. For instance, there are a lot of packages implementing Bayesian hyperparameter search techniques, but Data Scientists prefer not to apply them for tuning their NLP models due to their incompatibility with such popular libraries as PyTorch or Tensorflow. Having such a framework may improve the situation facilitating the packages interoperability.
NLP for good
It is believed that NLP is mostly used in marketing and e-commerce. This is partially true, as the majority of use cases involve using textual data to increase product sales. However, some NLP project may directly improve the ecology, like predicting air quality with social media (Jiang et al). In this study, the researchers concern themselves with the issue of air quality measures, such the concentration of PM2.5, and the problems of monitoring the changes of air quality conditions. To resolve this problem, they suggest exploiting social media and NLP by treating users as social sensors with their findings and locations. The results of their extensive experiments showed very promising result as with the help of NLP the authors succeeded to outperform two comparative baseline methods which used only historical measurements, whereas their engine was able to extract essential knowledge about air quality form myriad tweets in social media.
This use case proves that using the most advanced techniques in NLP may also help in constructing a more sustainable future. Undoubtedly, software and hardware optimization should be applied whenever it is possible.
Conclusion
In this article we have briefly studied the most popular NLP pipelines from the point of view of environmental impact, economic efficiency, and equitable access.
After having analyzed these aspects, we can conclude that the first step would be to create a common framework, like Green Ops, centralizing all the best practices of using computationally efficient hardware and algorithms. Huge cloud providers like AWS or Azure could apply this framework to build valuable, flexible, and environmentally friendly compute resources. For non-profit educational institutions, the most suitable option may be to build centralized servers instead of relying on cloud computing and implement computationally friendly algorithms. Besides being cost effective, this will also allow to provide equitable access to all researchers, even those ones with limited financial resources, as the server capacity can be shared across many different projects. Finally, as NLP can contribute to the sustainable development in many valuable ways, like improving the air quality monitoring, further research in the domain should be encouraged. Having a set of best practices may be also beneficial for individual researchers as it may facilitate the integration of compute-optimization packages into popular ML libraries like Tensorflow and Pytorch. Thus, as we have seen above, through small but consistent efforts we can build more sustainable internet.
“The secret of success is constancy of purpose.”
(Benjamin Disraeli)
References
Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. arXiv preprint arXiv:1906.02243.
Rhonda Ascierto. 2018. Uptime Institute Global Data Center Survey. Technical report, Uptime Institute.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference for Learning Representations (ICLR), San Diego, California, USA.
James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305.
Bruno Burger. 2019. Net Public Electricity Generation in Germany in 2018. Technical report, Fraunhofer Institute for Solar Energy Systems ISE.
Komentar