Chapter 5 Conclusion

Author: Nadja Sauter

Supervisor: Matthias Aßenmacher

It is very impressive how multimodal architectures have developed, especially over the course of the last two years. Particularly, methods to generate pictures based on text prompts, like DALL-E, became incredibly good at their “job”. A lot of people are fascinated by the stunning results and a huge hype about these AI generated images evolved in the internet, especially on twitter. In this way, the models were not only investigated by researchers but also by the online community (e.g. Katherine Crowson alias Rivers Have Wings). Even in the art scene these methods attracted a lot of attention as shown in our use case “enerative Arts” (subsection 4.4). Apart from that, it is possible to deploy these methods commercially, for instance in the film production or gaming industry (e.g. creating characters for games). However, this might also result in problems of copyright, an issue which has not yet been dealt with until now.

It is also impressive how realistic and precise outputs are achieved by such architectures. On the other hand, these methods can also be abused to spread misleading information as it is often very difficult to distinguish between a fake or a real picture by only looking at it. This can be systematically used to manipulate the public opinion by spreading AI manipulated media, also called deep fakes. That’s why researchers like Joshi, Walambe, and Kotecha (2021) demand automated tools which are capable of detecting these fabrications. Apart from that, like most deep learning models, multimodal architectures are not free from bias which also needs to be investigated further (Esser, Rombach, and Ommer 2020). Besides, the algorithms are very complex which is why they are often called “black-box” models, meaning that one cannot directly retrace how the model came to a certain solution or decision. This may limit their social acceptance and usability as the underlying process is not credible and transparent enough (Joshi, Walambe, and Kotecha 2021). For instance, in medical applications like e.g. predicting the presence or absence of cancer, apart from the decision of the AI the reasoning and the certainty are highly relevant for doctors and patients.

Furthermore, there is a clear trend in recent years to build more and more complex architectures in order to achieve higher performance. For instance OpenAI’s language model GPT-2 has had about 1.5 billion parameters (Radford, Wu, et al. 2019 a), whereas its successor GPT-3 had about 175 billion parameters (Brown et al. 2020). Increasing the number of parameters often helps improving model performance, but all of these parameters need to be trained and stored which takes a lot of time, enormous computational power and storage. For example, training GPT-2 took about one week (168 hours) of training time on 32 TPUv3 chips (Strubell, Ganesh, and McCallum 2019 c). The researchers (Strubell, Ganesh, and McCallum (2019 c)) estimated that the cloud compute costs for training GPT-2 added up to about $12,902–$43,008. Apart from the enormous expenses, this also contributes to our environmental burden as this process is really energy intensive. Due to missing power draw data on GPT-2’s training hardware, the researchers weren’t able to calculate the CO₂ emission. However, for the popular BERT architecture with 110M parameters they calculated cloud compute costs of $3,751-$12,571, energy consumption of 1,507 kWh and a Carbon footprint of 1,438 lbs of CO₂. In comparison, the footprint of flying from New York to San Francisco by plane for one passenger is about 1,984 lbs of CO₂. In conclusion training BERT once results in almost the same footprint as this long-haul flight. On top of this, these numbers are only for one training run. Developing a new model or adapting it often takes several fitting and tuning phases.

Moreover, the computational power as well as the necessary hardware, technology and financial means to run these models can oftentimes only be provided by big technology companies like e.g. Google, Facebook or OpenAI. This results in a disparate access between researchers in academia versus industry. Furthermore, the companies sometimes tend do not publishing their (best) models as they are their “product” and contribute to the company’s intellectual property. In this way it is not possible to reproduce their work and findings independently. Besides, from an economic point of view, this may be the foundation of a monopoly which might be dangerous for economic competition and holds the possibility of abuse.

References

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.

Esser, Patrick, Robin Rombach, and Björn Ommer. 2020. “A Note on Data Biases in Generative Models.” arXiv. https://doi.org/10.48550/ARXIV.2012.02516.

Joshi, Gargi, Rahee Walambe, and Ketan Kotecha. 2021. “A Review on Explainability in Multimodal Deep Neural Nets.” IEEE Access 9: 59800–59821. https://doi.org/10.1109/ACCESS.2021.3070212.

Radford, Alec, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019a. “Language Models Are Unsupervised Multitask Learners.” In.

Strubell, Emma, Ananya Ganesh, and Andrew McCallum. 2019c. “Energy and Policy Considerations for Deep Learning in Nlp.” arXiv. https://doi.org/10.48550/ARXIV.1906.02243.