References

Agirre, Eneko, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. “A Study on Similarity and Relatedness Using Distributional and Wordnet-Based Approaches.”

Ailem, Melissa, Bowen Zhang, Aurelien Bellet, Pascal Denis, and Fei Sha. 2018. “A Probabilistic Model for Joint Learning of Word Embeddings from Texts and Images.”

Akbari, Hassan, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. “VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text.” In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neurips 2021, December 6-14, 2021, Virtual, edited by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, 24206–21. https://proceedings.neurips.cc/paper/2021/hash/cb3213ada48302953cb0f166464ab356-Abstract.html.

Alayrac, Jean-Baptiste, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, et al. 2022. “Flamingo: A Visual Language Model for Few-Shot Learning.” arXiv Preprint arXiv:2204.14198.

Alford, A. 2021. “Google Announces 800M Parameter Vision-Language Ai Model Align.” 2021. https://www.infoq.com/news/2021/07/google-vision-language-ai/.

Anderson, Peter, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. “SPICE: Semantic Propositional Image Caption Evaluation.” In Computer Vision – Eccv 2016, edited by Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, 382–98. Cham: Springer International Publishing.

Anderson, Peter, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. “Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering.” In 2018 Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 6077–86. https://doi.org/10.1109/CVPR.2018.00636.

Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. “Vqa: Visual Question Answering.” In Proceedings of the Ieee International Conference on Computer Vision, 2425–33.

Aran, Komatsuzaki. 2021. “When You Generate Images with Vqgan Clip, the Image Quality Dramatically Improves If You Add "Unreal Engine" to Your Prompt. People Are Now Calling This "Unreal Engine Trick".” Twitter. https://twitter.com/arankomatsuzaki/status/1399471244760649729.

Bachmann, Roman, David Mizrahi, Andrei Atanov, and Amir Zamir. 2022. “MultiMAE: Multi-Modal Multi-Task Masked Autoencoders.” arXiv Preprint arXiv: Arxiv-2204.01678.

Baevski, Alexei, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. “Data2vec: A General Framework for Self-Supervised Learning in Speech, Vision and Language.” arXiv Preprint arXiv:2202.03555.

Baevski, Alexei, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” Advances in Neural Information Processing Systems 33: 12449–60.

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. “Neural Machine Translation by Jointly Learning to Align and Translate.” http://arxiv.org/abs/1409.0473.

Baltrušaitis, Tadas, Chaitanya Ahuja, and Louis-Philippe Morency. 2017. “Multimodal Machine Learning: A Survey and Taxonomy.” arXiv Preprint arXiv: Arxiv-1705.09406.

Bandy, Jack, and Nicholas Vincent. 2021. “Addressing" Documentation Debt" in Machine Learning Research: A Retrospective Datasheet for Bookcorpus.” arXiv Preprint arXiv:2105.05241.

Banerjee, Satanjeev, and Alon Lavie. 2005. “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments.” In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 65–72. Ann Arbor, Michigan: Association for Computational Linguistics. https://aclanthology.org/W05-0909.

Bao, Hangbo, Li Dong, and Furu Wei. 2021. “Beit: Bert Pre-Training of Image Transformers.” arXiv Preprint arXiv:2106.08254.

Barham, Paul, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, et al. 2022. “Pathways: Asynchronous Distributed Dataflow for Ml.” arXiv. https://doi.org/10.48550/ARXIV.2203.12533.

Bäck, Thomas, and Hans-Paul Schwefel. 1993. “An Overview of Evolutionary Algorithms for Parameter Optimization.” Evolutionary Computation 1 (1): 1–23. https://doi.org/10.1162/evco.1993.1.1.1.

Bellemare, Marc G., Yavar Naddaf, Joel Veness, and Michael Bowling. 2013. “The Arcade Learning Environment: An Evaluation Platform for General Agents.” J. Artif. Int. Res. 47 (1): 253–79.

Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” In Proceedings of the 2021 Acm Conference on Fairness, Accountability, and Transparency, 610–23. FAccT ’21. Virtual Event, Canada: Association for Computing Machinery. https://doi.org/10.1145/3442188.3445922.

Bengio, Yoshua, Aaron C. Courville, and Pascal Vincent. 2013. “Representation Learning: A Review and New Perspectives.” IEEE Trans. Pattern Anal. Mach. Intell. 35 (8): 1798–1828. https://doi.org/10.1109/TPAMI.2013.50.

Beyer, Lucas, Olivier J Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. 2020. “Are We Done with Imagenet?” arXiv Preprint arXiv:2006.07159.

Birhane, Abeba, Vinay Uday Prabhu, and Emmanuel Kahembwe. 2021. “Multimodal Datasets: Misogyny, Pornography, and Malignant Stereotypes.” arXiv Preprint arXiv:2110.01963.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. “Enriching Word Vectors with Subword Information.” http://arxiv.org/abs/1607.04606.

Bommasani, Rishi, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, et al. 2021. “On the Opportunities and Risks of Foundation Models.” arXiv Preprint arXiv:2108.07258.

Bordes, Patrick, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, and Patrick Gallinari. 2020. “Incorporating Visual Semantics into Sentence Representations Within a Grounded Space.” arXiv Preprint arXiv:2002.02734.

Boris, Dayma. 2022. “DALL·E Mini.” https://huggingface.co/spaces/dalle-mini/dalle-mini.

Borji, Ali. 2018. “Pros and Cons of GAN Evaluation Measures.” CoRR. http://arxiv.org/abs/1802.03446.

Bosch, Anna, Andrew Zisserman, and Xavier Munoz. 2007. “Image Classification Using Random Forests and Ferns.” In 2007 Ieee 11th International Conference on Computer Vision, 1–8. Ieee.

Bowman, Samuel R, and George E Dahl. 2021. “What Will It Take to Fix Benchmarking in Natural Language Understanding?” arXiv Preprint arXiv:2104.02145.

Bromley, Jane, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. “Signature Verification Using a" Siamese" Time Delay Neural Network.” Advances in Neural Information Processing Systems 6.

Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems 33: 1877–1901.

Bruni, Elia, Nam-Khanh Tran, and Marco Baroni. 2014. “Multimodal Distributional Semantics.” Journal of Artificial Intelligence Research 49: 1–47.

Brysbaert, Marc, Amy Beth Warriner, and Victor Kuperman. 2014. “Concreteness Ratings for 40 Thousand Generally Known English Word Lemmas.” Behavior Research Methods 46 (3): 904–11.

Carion, Nicolas, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. “End-to-End Object Detection with Transformers.” CoRR. https://arxiv.org/abs/2005.12872.

Caron, Mathilde, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments.” CoRR. https://arxiv.org/abs/2006.09882.

Caron, Mathilde, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. “Emerging Properties in Self-Supervised Vision Transformers.” In Proceedings of the Ieee/Cvf International Conference on Computer Vision, 9650–60.

Carreira, Joao, Skanda Koppula, Daniel Zoran, Adria Recasens, Catalin Ionescu, Olivier Henaff, Evan Shelhamer, et al. 2022. “Hierarchical Perceiver.” arXiv Preprint arXiv: Arxiv-2202.10890.

Caruana, Rich. 1997. “Multitask Learning.” Machine Learning 28 (1): 41–75.

Cheerla, Anika, and Olivier Gevaert. 2019. “Deep Learning with Multimodal Representation for Pancancer Prognosis Prediction.” Bioinformatics 35 (14): i446–i454.

Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020a. “A Simple Framework for Contrastive Learning of Visual Representations.” arXiv. https://doi.org/10.48550/ARXIV.2002.05709.

Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020b. “A Simple Framework for Contrastive Learning of Visual Representations.” In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, 119:1597–1607. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v119/chen20j.html.

Chen, Xinlei, Saining Xie, and Kaiming He. 2021. “An Empirical Study of Training Self-Supervised Vision Transformers.” In Proceedings of the Ieee/Cvf International Conference on Computer Vision, 9640–9.

Cheng, Heng-Tze, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, et al. 2016. “Wide & Deep Learning for Recommender Systems.” In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, 7–10.

Cho, Kyunghyun, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. “Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation.” http://arxiv.org/abs/1406.1078.

Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al. 2022. “Palm: Scaling Language Modeling with Pathways.” arXiv Preprint arXiv:2204.02311.

Clark, Kevin, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. “What Does Bert Look at? An Analysis of Bert’s Attention.” http://arxiv.org/abs/1906.04341.

Collell, Guillem, Ted Zhang, and Marie-Francine Moens. 2017. “Imagined Visual Representations as Multimodal Embeddings.” In Proceedings of the Aaai Conference on Artificial Intelligence. Vol. 31. 1.

Cornia, Marcella, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2019. “Meshed-Memory Transformer for Image Captioning.” arXiv. https://doi.org/10.48550/ARXIV.1912.08226.

———. 2020. “Meshed-Memory Transformer for Image Captioning.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition.

Crawshaw, Michael. 2020. “Multi-Task Learning with Deep Neural Networks: A Survey.” arXiv. https://doi.org/10.48550/ARXIV.2009.09796.

Crowson, Katherine, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. 2022. “VQGAN-Clip: Open Domain Image Generation and Editing with Natural Language Guidance.” arXiv. https://doi.org/10.48550/ARXIV.2204.08583.

Da, Jeff, and Jungo Kasai. 2019. “Cracking the Contextual Commonsense Code: Understanding Commonsense Reasoning Aptitude of Deep Contextual Representations.” http://arxiv.org/abs/1910.01157.

Das, Abhishek, Harsh Agrawal, Larry Zitnick, Devi Parikh, and Dhruv Batra. 2017. “Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?” Computer Vision and Image Understanding 163: 90–100.

Dean, Jeff. 2021. “Introducing Pathways: A Next-Generation Ai Architecture.” 2021. https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/.

Dean, Jeffrey. 2020. “1.1 the Deep Learning Revolution and Its Implications for Computer Architecture and Chip Design.” In 2020 Ieee International Solid- State Circuits Conference - (Isscc), 8–14. https://doi.org/10.1109/ISSCC19947.2020.9063049.

Dehouche, Nassim. 2021. “Plagiarism in the Age of Massive Generative Pre-Trained Transformers (Gpt-3).” Ethics in Science and Environmental Politics 21: 17–23.

Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. “Imagenet: A Large-Scale Hierarchical Image Database.” In 2009 Ieee Conference on Computer Vision and Pattern Recognition, 248–55. Ieee.

Devereux, Barry J, Lorraine K Tyler, Jeroen Geertzen, and Billi Randall. 2014. “The Centre for Speech, Language and the Brain (Cslb) Concept Property Norms.” Behavior Research Methods 46 (4): 1119–27.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018a. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv. https://doi.org/10.48550/ARXIV.1810.04805.

———. 2018b. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” http://arxiv.org/abs/1810.04805.

———. 2018c. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.

———. 2019. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” In, 4171–86. Minneapolis, Minnesota: Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423.

Dhamala, Jwala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. “Bold: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation.” In Proceedings of the 2021 Acm Conference on Fairness, Accountability, and Transparency, 862–72.

Dhariwal, Prafulla, and Alex Nichol. 2021. “Diffusion Models Beat Gans on Image Synthesis.” CoRR. https://arxiv.org/abs/2105.05233.

Ding, Ming, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, et al. 2021. “CogView: Mastering Text-to-Image Generation via Transformers.” CoRR. https://arxiv.org/abs/2105.13290.

Doerr, Benjamin, and Frank Neumann. 2021. “A Survey on Recent Progress in the Theory of Evolutionary Algorithms for Discrete Optimization.” ACM Trans. Evol. Learn. Optim. 1 (4). https://doi.org/10.1145/3472304.

Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. 2020. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv Preprint arXiv:2010.11929.

———. 2020a. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” CoRR. https://arxiv.org/abs/2010.11929.

———. 2020b. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” CoRR abs/2010.11929. https://arxiv.org/abs/2010.11929.

———. 2021. “An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale.” In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy.

Dwibedi, Debidatta, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. 2021. “With a Little Help from My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations.” CoRR abs/2104.14548. https://arxiv.org/abs/2104.14548.

Education, IBM Cloud. 2020a. “What Is Supervised Learning?” www.ibm.com.

———. 2020b. “What Is Unsupervised Learning?” www.ibm.com.

Esser, Patrick, Robin Rombach, and Bjorn Ommer. 2021. “Taming Transformers for High-Resolution Image Synthesis.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 12873–83.

Esser, Patrick, Robin Rombach, and Björn Ommer. 2020. “A Note on Data Biases in Generative Models.” arXiv. https://doi.org/10.48550/ARXIV.2012.02516.

Ettinger, Allyson. 2019. “What Bert Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models.” http://arxiv.org/abs/1907.13528.

Everingham, Mark, Luc van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. “The Pascal Visual Object Classes (Voc) Challenge.” International Journal of Computer Vision 88 (2): 303–38. https://doi.org/10.1007/s11263-009-0275-4.

Fellbaum, Christiane. 2010. “WordNet.” In Theory and Applications of Ontology: Computer Applications, 231–43. Springer.

Fellbaum, Christiane D. 2000. “WordNet : An Electronic Lexical Database.” Language 76: 706.

Fernando, Chrisantha, Dylan Banarse, Charles Blundell, Yori Zwols, David Ha, Andrei A. Rusu, Alexander Pritzel, and Daan Wierstra. 2017. “PathNet: Evolution Channels Gradient Descent in Super Neural Networks.” In. https://arxiv.org/abs/1701.08734.

Forbes, Maxwell, Ari Holtzman, and Yejin Choi. 2019. “Do Neural Language Representations Learn Physical Commonsense?” http://arxiv.org/abs/1908.02899.

Gafni, Oran, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. 2022. “Make-a-Scene: Scene-Based Text-to-Image Generation with Human Priors.” arXiv. https://doi.org/10.48550/ARXIV.2203.13131.

Galanter, Philip. 2016. “Generative Art Theory.” A Companion to Digital Art 1: 631.

Gao, Jiyang, Zhen Li, Ram Nevatia, and others. 2017. “Knowledge Concentration: Learning 100k Object Classifiers in a Single Cnn.” arXiv Preprint arXiv:1711.07607.

Gao, Leo, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800gb Dataset of Diverse Text for Language Modeling.” arXiv Preprint arXiv:2101.00027.

Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. 2016. “A Neural Algorithm of Artistic Style.” arXiv. https://doi.org/10.48550/ARXIV.1508.06576.

Gebru, Timnit, Jonathan Krause, Yilun Wang, Duyun Chen, Jia Deng, Erez Aiden, and Li Fei-Fei. 2017. “Using Deep Learning and Google Street View to Estimate the Demographic Makeup of Neighborhoods Across the United States.” Proceedings of the National Academy of Sciences 114: 201700035. https://doi.org/10.1073/pnas.1700035114.

Gesmundo, Andrea, and Jeff Dean. 2022. “MuNet: Evolving Pretrained Deep Neural Networks into Scalable Auto-Tuning Multitask Systems.” arXiv. https://doi.org/10.48550/ARXIV.2205.10937.

Gokaslan, Aaron, and Vanya Cohen. 2019. “OpenWebText Corpus.”

Goodfellow, Ian J., Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014a. “Generative Adversarial Networks.” arXiv. https://doi.org/10.48550/ARXIV.1406.2661.

———. 2014b. “Generative Adversarial Networks.” arXiv. https://doi.org/10.48550/ARXIV.1406.2661.

Goodfellow, Ian J, Jonathon Shlens, and Christian Szegedy. 2014. “Explaining and Harnessing Adversarial Examples.” arXiv Preprint arXiv:1412.6572.

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems, edited by Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger. Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.

Google. 2022. “Embeddings: Translating to a Lower-Dimensional Space.”

Goyal, Yash, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. “Making the V in Vqa Matter: Elevating the Role of Image Understanding in Visual Question Answering.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 6904–13.

Grill, Jean-Bastien, Florian Strub, Florent Altch’e, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, et al. 2020. “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.” Neurips.

Grill, Jean-Bastien, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, et al. 2020. “Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.” arXiv. https://doi.org/10.48550/ARXIV.2006.07733.

Guo, Yandong, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. 2016. “Ms-Celeb-1m: A Dataset and Benchmark for Large-Scale Face Recognition.” In European Conference on Computer Vision, 87–102. Springer.

Harnad, Stevan. 1990. “The Symbol Grounding Problem.” Physica D: Nonlinear Phenomena 42 (1-3): 335–46.

Harris, Z, and others. 1954. “Distributional Hypothesis.” Word World 10 (23): 146–62.

Hart, B., and T. R. Risley. 1995. “Meaningful Differences in the Everyday Experience of Young American Children.” Baltimore, MD: Paul H. Brookes Publishing Company,

He, Kaiming, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022. “Masked Autoencoders Are Scalable Vision Learners.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 16000–16009.

He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. “Deep Residual Learning for Image Recognition.” CoRR. http://arxiv.org/abs/1512.03385.

Henderson, Peter, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky, and Joelle Pineau. 2020. “Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning.” Journal of Machine Learning Research 21 (248): 1–43.

Herdade, Simao, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. “Image Captioning: Transforming Objects into Words.” In, 11135–45. http://papers.nips.cc/paper/9293-image-captioning-transforming-objects-into-words.

Heusel, Martin, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/8a1d694707eb0fefe65871369074926d-Paper.pdf.

Hill, Felix, and Anna Korhonen. 2014. “Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can’t See What I Mean.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (Emnlp), 255–65.

Hill, Felix, Roi Reichart, and Anna Korhonen. 2015. “Simlex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation.” Computational Linguistics 41 (4): 665–95.

Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” arXiv. https://doi.org/10.48550/ARXIV.1503.02531.

Hinton, Geoffrey, Oriol Vinyals, Jeff Dean, and others. 2015. “Distilling the Knowledge in a Neural Network.” arXiv Preprint arXiv:1503.02531 2 (7).

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020a. “Denoising Diffusion Probabilistic Models.” CoRR. https://arxiv.org/abs/2006.11239.

———. 2020b. “Denoising Diffusion Probabilistic Models.” In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neurips 2020, December 6-12, 2020, Virtual, edited by Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin. https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.

Hochreiter, Sepp, and Jürgen Schmidhuber. 1997. “Long Short-Term Memory.” Neural Computation 9 (8): 1735–80.

Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” arXiv Preprint arXiv:2203.15556.

Howard, Andrew G., Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications.” CoRR abs/1704.04861. http://arxiv.org/abs/1704.04861.

Hu, Ronghang, and Amanpreet Singh. 2021a. “Unit: Multimodal Multitask Learning with a Unified Transformer.” In Proceedings of the Ieee/Cvf International Conference on Computer Vision, 1439–49.

———. 2021b. “UniT: Multimodal Multitask Learning with a Unified Transformer.” In 2021 Ieee/Cvf International Conference on Computer Vision (Iccv), 1419–29. https://doi.org/10.1109/ICCV48922.2021.00147.

Huang, Lun, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. “Attention on Attention for Image Captioning.” In, 4633–42. https://doi.org/10.1109/ICCV.2019.00473.

Huang, Shih-Cheng, Anuj Pareek, Saeed Seyyedi, Imon Banerjee, and Matthew P Lungren. 2020. “Fusion of Medical Imaging and Electronic Health Records Using Deep Learning: A Systematic Review and Implementation Guidelines.” NPJ Digital Medicine 3 (1): 1–9.

Huang, Yanping, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. 2018. “GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism.” CoRR abs/1811.06965. http://arxiv.org/abs/1811.06965.

Hudson, Drew A, and Christopher D Manning. 2019. “Gqa: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 6700–6709.

IV, William C. Sleeman, Rishabh Kapoor, and Preetam Ghosh. 2021. “Multimodal Classification: Current Landscape, Taxonomy and Future Directions.” arXiv Preprint arXiv: Arxiv-2109.09020.

Ive, Julia, Pranava Madhyastha, and Lucia Specia. 2019. “Distilling Translations with Visual Awareness.” arXiv Preprint arXiv:1906.07701.

Jacobs, Robert A., Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. 1991. “Adaptive Mixtures of Local Experts.” Neural Computation 3 (1): 79–87. https://doi.org/10.1162/neco.1991.3.1.79.

Jaegle, Andrew, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021a. “Perceiver: General Perception with Iterative Attention.” In International Conference on Machine Learning, 4651–64. PMLR.

Jaegle, Andrew, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and João Carreira. 2021b. “Perceiver: General Perception with Iterative Attention.” In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, edited by Marina Meila and Tong Zhang, 139:4651–64. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v139/jaegle21a.html.

Jaspreet. 2019. “A Concise History of Neural Networks by Jaspreet Towards Data Science.” towardsdatascience.com.

Jean, Neal, Marshall Burke, Michael Xie, W. Matthew Davis, David B. Lobell, and Stefano Ermon. 2016. “Combining Satellite Imagery and Machine Learning to Predict Poverty.” Science 353 (6301): 790–94. https://doi.org/10.1126/science.aaf7894.

Jia, Chao, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021a. “Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision.” In International Conference on Machine Learning, 4904–16. PMLR.

Jia, Chao, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021b. “Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision.” CoRR. https://arxiv.org/abs/2102.05918.

Jordan, Michael I., and Robert A. Jacobs. 1994. “Hierarchical Mixtures of Experts and the Em Algorithm.” Neural Computation 6 (2): 181–214. https://doi.org/10.1162/neco.1994.6.2.181.

Joseph, K. J., Salman H. Khan, Fahad Shahbaz Khan, and Vineeth N. Balasubramanian. 2021. “Towards Open World Object Detection.” CoRR abs/2103.02603. https://arxiv.org/abs/2103.02603.

Joshi, Gargi, Rahee Walambe, and Ketan Kotecha. 2021. “A Review on Explainability in Multimodal Deep Neural Nets.” IEEE Access 9: 59800–59821. https://doi.org/10.1109/ACCESS.2021.3070212.

Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89. https://doi.org/10.1038/s41586-021-03819-2.

Kahatapitiya, Kumara, and Michael S. Ryoo. 2021. “SWAT: Spatial Structure Within and Among Tokens.” arXiv Preprint arXiv: Arxiv-2111.13677.

Kaiser, Lukasz, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. 2017. “One Model to Learn Them All.” arXiv. https://arxiv.org/pdf/1706.05137.pdf.

Karpathy, Andrej, and Li Fei-Fei. 2014. “Deep Visual-Semantic Alignments for Generating Image Descriptions.” arXiv. https://doi.org/10.48550/ARXIV.1412.2306.

Karras, Tero, Samuli Laine, and Timo Aila. 2019. “A Style-Based Generator Architecture for Generative Adversarial Networks.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 4401–10.

Katzman, Jared L, Uri Shaham, Alexander Cloninger, Jonathan Bates, Tingting Jiang, and Yuval Kluger. 2018. “DeepSurv: Personalized Treatment Recommender System Using a Cox Proportional Hazards Deep Neural Network.” BMC Medical Research Methodology 18 (1): 1–12.

Kiela, Douwe, and Léon Bottou. 2014. “Learning Image Embeddings Using Convolutional Neural Networks for Improved Multi-Modal Semantics.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (Emnlp), 36–45.

Kiela, Douwe, Alexis Conneau, Allan Jabri, and Maximilian Nickel. 2017. “Learning Visually Grounded Sentence Representations.” arXiv Preprint arXiv:1707.06320.

Kingma, Diederik P., and Max Welling. 2019. “An Introduction to Variational Autoencoders.” CoRR. http://arxiv.org/abs/1906.02691.

Kingma, Diederik P, and Max Welling. 2013. “Auto-Encoding Variational Bayes.” arXiv. https://doi.org/10.48550/ARXIV.1312.6114.

Kiros, Jamie, William Chan, and Geoffrey Hinton. 2018. “Illustrative Language Understanding: Large-Scale Visual Grounding with Image Search.” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 922–33.

Koehn, Philipp. 2005. “Europarl: A Parallel Corpus for Statistical Machine Translation.” In Proceedings of Machine Translation Summit X: Papers, 79–86.

Kolesnikov, Alexander, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. 2019. “Large Scale Learning of General Visual Representations for Transfer.” arXiv Preprint arXiv:1912.11370 2 (8).

Kopper, Philipp, Simon Wiegrebe, Bernd Bischl, Andreas Bender, and David Rügamer. 2022. “DeepPAMM: Deep Piecewise Exponential Additive Mixed Models for Complex Hazard Structures in Survival Analysis.” In Pacific-Asia Conference on Knowledge Discovery and Data Mining, 249–61. Springer.

Kottur, Satwik, Ramakrishna Vedantam, José MF Moura, and Devi Parikh. 2016. “Visual Word2vec (Vis-W2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 4985–94.

Krishna, Ranjay, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, et al. 2016. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.” In. https://arxiv.org/abs/1602.07332.

———. 2017. “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.” International Journal of Computer Vision 123 (1): 32–73.

Krizhevsky, Alex, Geoffrey Hinton, and others. 2009. “Learning Multiple Layers of Features from Tiny Images.”

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012a. “ImageNet Classification with Deep Convolutional Neural Networks.” In Advances in Neural Information Processing Systems, edited by F. Pereira, C. J. Burges, L. Bottou, and K. Q. Weinberger. Vol. 25. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.

———. 2012b. “Imagenet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems 25.

Kudo, Taku, and John Richardson. 2018. “SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing.” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 66–71. Brussels, Belgium: Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-2012.

Kuznetsova, Alina, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, et al. 2020. “The Open Images Dataset V4.” International Journal of Computer Vision 128 (7): 1956–81.

Kynkäänniemi, Tuomas, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2019. “Improved Precision and Recall Metric for Assessing Generative Models.” arXiv. https://doi.org/10.48550/ARXIV.1904.06991.

Law, Stephen, Brooks Paige, and Chris Russell. 2019. “Take a Look Around.” ACM Transactions on Intelligent Systems and Technology 10 (5): 1–19. https://doi.org/10.1145/3342240.

Lazaridou, Angeliki, Nghia The Pham, and Marco Baroni. 2015. “Combining Language and Vision with a Multimodal Skip-Gram Model.” arXiv Preprint arXiv:1501.02598.

LeCun, Yann. 2022. “A Path Towards Autonomous Machine Intelligence Version 0.9. 2, 2022-06-27.”

Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. “BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension.” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 7871–80. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.703.

Lewkowycz, Aitor, Anders Andreassen, David Martin Dohan, Ethan S Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, et al. 2022. “Solving Quantitative Reasoning Problems with Language Models.” Technical report. https://arxiv.org/abs/2206.14858.

Lialin, Vladislav, Kevin Zhao, Namrata Shivagunde, and Anna Rumshisky. 2022. “Life After Bert: What Do Other Muppets Understand About Language?” http://arxiv.org/abs/2205.10696.

Lin, Chin-Yew. 2004. “ROUGE: A Package for Automatic Evaluation of Summaries.” In Text Summarization Branches Out, 74–81. Barcelona, Spain: Association for Computational Linguistics. https://aclanthology.org/W04-1013.

Lin, Tsung-Yi, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. 2014. “Microsoft Coco: Common Objects in Context.” arXiv. https://doi.org/10.48550/ARXIV.1405.0312.

Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014a. “Microsoft Coco: Common Objects in Context.” In European Conference on Computer Vision, 740–55. Springer.

Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014b. “Microsoft Coco: Common Objects in Context.” In Computer Vision – Eccv 2014, 740–55. Springer International Publishing.

Lin, Yongjie, Yi Chern Tan, and Robert Frank. 2019. “Open Sesame: Getting Inside BERT’s Linguistic Knowledge.” In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics. https://doi.org/10.18653/v1/w19-4825.

Liu, Nelson F., Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. “Linguistic Knowledge and Transferability of Contextual Representations.” http://arxiv.org/abs/1903.08855.

Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. “Roberta: A Robustly Optimized Bert Pretraining Approach.” arXiv Preprint arXiv:1907.11692.

Lottick, Kadan, Silvia Susai, Sorelle A Friedler, and Jonathan P Wilson. 2019. “Energy Usage Reports: Environmental Awareness as Part of Algorithmic Accountability.” arXiv Preprint arXiv:1911.08354.

Lu, Jiasen, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019a. “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.” arXiv. https://doi.org/10.48550/ARXIV.1908.02265.

———. 2019b. “Vilbert: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.” Advances in Neural Information Processing Systems 32.

Lu, Yujie, Wanrong Zhu, Xin Eric Wang, Miguel Eckstein, and William Yang Wang. 2022. “Imagination-Augmented Natural Language Understanding.” arXiv Preprint arXiv:2204.08535.

Mahajan, Dhruv, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. 2018. “Exploring the Limits of Weakly Supervised Pretraining.” In Proceedings of the European Conference on Computer Vision (Eccv), 181–96.

Manning, Chris, Anna Goldie, and John Hewitt. 2022. “Stanford Cs224n: Natural Language Processing with Deep Learning.”

Mayer, Thomas, and Michael Cysouw. 2014. “Creating a Massively Parallel Bible Corpus.” Oceania 135 (273): 40.

Mccormack, Jon, and Camilo Cruz Gambardella. 2022. “Growing and Evolving 3-d Prints.” IEEE Transactions on Evolutionary Computation 26 (1): 88–99. https://doi.org/10.1109/TEVC.2021.3095156.

MICHAEL BARTHEL, JESSE HOLCOMB, GALEN STOCKING, and AMY MITCHELL. 2016. “Reddit News Users More Likely to Be Male, Young and Digital in Their News Preferences.” 2016. https://www.pewresearch.org/journalism/2016/02/25/reddit-news-users-more-likely-to-be-male-young-and-digital-in-their-news-preferences/.

Midjourney. 2022. “Midjourney.” https://www.midjourney.com/.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.

———. 2013b. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781.

Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. 2013. “Exploiting Similarities Among Languages for Machine Translation.” http://arxiv.org/abs/1309.4168.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” http://arxiv.org/abs/1310.4546.

Mineault, Patrick. 2021. “Unsupervised Models of the Brain.” 2021. https://xcorr.net/2021/12/31/2021-in-review-unsupervised-brain-models/.

Mircosoft. 2019. “Evaluate:Detection.” 2019. https://cocodataset.org/#detection-eval.

Mishkin, Pamela, Lama Ahmad, Miles Brundage, Gretchen Krueger, and Girish Sastry. 2022. “DALL·E 2 Preview - Risks and Limitations.” [https://github.com/openai/dalle-2-preview/blob/main/system-card.md](https://github.com/openai/dalle-2-preview/blob/main/system-card.md).

Mordvintsev, Alexander. 2015. “Inceptionism: Going Deeper into Neural Networks.” Google AI Blog. Google. https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html.

Mustafa, Basil, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, and Neil Houlsby. 2022. “Multimodal Contrastive Learning with Limoe: The Language-Image Mixture of Experts.” arXiv. https://doi.org/10.48550/ARXIV.2206.02770.

Nagrani, Arsha, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, and Chen Sun. 2021. “Attention Bottlenecks for Multimodal Fusion.” In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Neurips 2021, December 6-14, 2021, Virtual, edited by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, 14200–14213. https://proceedings.neurips.cc/paper/2021/hash/76ba9f564ebbc35b1014ac498fafadd0-Abstract.html.

“Neural Networks - History.” 2022. cs.stanford.edu.

Nichol, Alex, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021a. “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.” arXiv. https://doi.org/10.48550/ARXIV.2112.10741.

———. 2021b. “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.” CoRR. https://arxiv.org/abs/2112.10741.

Oord, Aäron van den, Oriol Vinyals, and Koray Kavukcuoglu. 2017. “Neural Discrete Representation Learning.” CoRR. http://arxiv.org/abs/1711.00937.

OpenAI. 2021. “DALL-E.” https://github.com/openai/DALL-E.

Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. “Bleu: A Method for Automatic Evaluation of Machine Translation.” In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–18. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135.

Parcalabescu, Letitia, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, and Albert Gatt. 2022. “VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena.” In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8253–80. Association for Computational Linguistics. https://aclanthology.org/2022.acl-long.567.

Patashnik, Or, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. “StyleCLIP: Text-Driven Manipulation of Stylegan Imagery.” CoRR. https://arxiv.org/abs/2103.17249.

Pennington, Jeffrey, Richard Socher, and Christopher D Manning. 2014. “Glove: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (Emnlp), 1532–43.

Perez, Ethan, Douwe Kiela, and Kyunghyun Cho. 2021a. “True Few-Shot Learning with Language Models.” http://arxiv.org/abs/2105.11447.

———. 2021b. “True Few-Shot Learning with Language Models.” Advances in Neural Information Processing Systems 34: 11054–70.

Pezzelle, Sandro, Ece Takmaz, and Raquel Fernández. 2021. “Word Representation Learning in Multimodal Pre-Trained Transformers: An Intrinsic Evaluation.” Transactions of the Association for Computational Linguistics 9: 1563–79.

Pilehvar, Mohammad Taher, and Jose Camacho-Collados. 2021. Embeddings in Natural Language Processing. Springer International Publishing. https://doi.org/10.1007/978-3-031-02177-0.

Pont-Tuset, Jordi, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari. 2020. “Connecting Vision and Language with Localized Narratives.” In European Conference on Computer Vision, 647–64. Springer.

Pölsterl, Sebastian, Ignacio Sarasua, Benjamı́n Gutiérrez-Becker, and Christian Wachinger. 2019. “A Wide and Deep Neural Network for Survival Analysis from Anatomical Shape and Tabular Clinical Data.” In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 453–64. Springer.

Prabhu, Vinay Uday, and Abeba Birhane. 2020. “Large Image Datasets: A Pyrrhic Win for Computer Vision?” arXiv Preprint arXiv:2006.16923.

Qiao, Han, Vivian Liu, and Lydia Chilton. 2022. “Initial Images: Using Image Prompts to Improve Subject Representation in Multimodal Ai Generated Art.” In Creativity and Cognition, 15–28.

Qiao, Tingting, Jing Zhang, Duanqing Xu, and Dacheng Tao. 2019. “MirrorGAN: Learning Text-to-Image Generation by Redescription.” CoRR. http://arxiv.org/abs/1903.05854.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021a. “Learning Transferable Visual Models from Natural Language Supervision.” arXiv. https://doi.org/10.48550/ARXIV.2103.00020.

———. 2021b. “Learning Transferable Visual Models from Natural Language Supervision.” In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, edited by Marina Meila and Tong Zhang, 139:8748–63. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v139/radford21a.html.

———. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” In International Conference on Machine Learning, 8748–63. PMLR.

Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. “Improving Language Understanding by Generative Pre-Training.”

Radford, Alec, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019a. “Language Models Are Unsupervised Multitask Learners.” In.

Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, and others. 2019b. “Language Models Are Unsupervised Multitask Learners.” OpenAI Blog 1 (8): 9.

Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019a. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” http://arxiv.org/abs/1910.10683.

———. 2019b. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” CoRR. http://arxiv.org/abs/1910.10683.

Raghu, Maithra, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. 2016. “On the Expressive Power of Deep Neural Networks.” arXiv. https://doi.org/10.48550/ARXIV.1606.05336.

Rajpurkar, Pranav, Robin Jia, and Percy Liang. 2018. “Know What You Don’t Know: Unanswerable Questions for Squad.” arXiv Preprint arXiv:1806.03822.

Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. “Squad: 100,000+ Questions for Machine Comprehension of Text.” arXiv Preprint arXiv:1606.05250.

Ramachandran, Prajit, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. “Stand-Alone Self-Attention in Vision Models.” CoRR abs/1906.05909. http://arxiv.org/abs/1906.05909.

Ramesh, Aditya, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022a. “Hierarchical Text-Conditional Image Generation with Clip Latents.” arXiv. https://doi.org/10.48550/ARXIV.2204.06125.

———. 2022b. “Hierarchical Text-Conditional Image Generation with Clip Latents. 2022.” arXiv Preprint arXiv:2204.06125.

Ramesh, Aditya, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021a. “Zero-Shot Text-to-Image Generation.” arXiv. https://doi.org/10.48550/ARXIV.2102.12092.

———. 2021b. “Zero-Shot Text-to-Image Generation.” CoRR. https://arxiv.org/abs/2102.12092.

———. 2021c. “Zero-Shot Text-to-Image Generation.” In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, edited by Marina Meila and Tong Zhang, 139:8821–31. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v139/ramesh21a.html.

———. 2021d. “Zero-Shot Text-to-Image Generation.” In Proceedings of the 38th International Conference on Machine Learning, edited by Marina Meila and Tong Zhang, 139:8821–31. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v139/ramesh21a.html.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rebuffi, Sylvestre-Alvise, Hakan Bilen, and Andrea Vedaldi. 2017. “Learning Multiple Visual Domains with Residual Adapters.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/e7b24b112a44fdd9ee93bdf998c6ca0e-Paper.pdf.

Recht, Benjamin, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. “Do Imagenet Classifiers Generalize to Imagenet?” In International Conference on Machine Learning, 5389–5400. PMLR.

Reed, Scott E., Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. 2016. “Learning What and Where to Draw.” CoRR. http://arxiv.org/abs/1610.02454.

Reed, Scott E., Zeynep Akata, Bernt Schiele, and Honglak Lee. 2016. “Learning Deep Representations of Fine-Grained Visual Descriptions.” CoRR. http://arxiv.org/abs/1605.05395.

Reed, Scott E., Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. “Generative Adversarial Text to Image Synthesis.” CoRR. http://arxiv.org/abs/1605.05396.

Reed, Scott, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, et al. 2022. “A Generalist Agent.” arXiv. https://doi.org/10.48550/ARXIV.2205.06175.

Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. 2015. “Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks.” Advances in Neural Information Processing Systems 28.

Rennie, Steven J., Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava Goel. 2017. “Self-Critical Sequence Training for Image Captioning.” In 2017 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 1179–95. https://doi.org/10.1109/CVPR.2017.131.

Ribeiro, Marco Tulio, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. “Beyond Accuracy: Behavioral Testing of Nlp Models with Checklist.” arXiv Preprint arXiv:2005.04118.

Riquelme, Carlos, Joan Puigcerver, Basil Mustafa, Maxim Neumann, Rodolphe Jenatton, André Susano Pinto, Daniel Keysers, and Neil Houlsby. 2021. “Scaling Vision with Sparse Mixture of Experts.” In Advances in Neural Information Processing Systems, edited by M. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, and J. Wortman Vaughan, 34:8583–95. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2021/file/48237d9f2dea8c74c2a72126cf63d933-Paper.pdf.

Ritchie, Hannah, Max Roser, and Pablo Rosado. 2020. “\(CO_2\) And Greenhouse Gas Emissions.” Our World in Data.

Rombach, Robin, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. “High-Resolution Image Synthesis with Latent Diffusion Models.” CoRR. https://arxiv.org/abs/2112.10752.

———. 2022. “StableDiffusion.” https://github.com/CompVis/stable-diffusion.

Rosset, Corby. 2020. “Turing-Nlg: A 17-Billion-Parameter Language Model by Microsoft.” Microsoft Blog 1 (2).

Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, et al. 2015. “ImageNet Large Scale Visual Recognition Challenge.” Int. J. Comput. Vision 115 (3): 211–52. https://doi.org/10.1007/s11263-015-0816-y.

Rügamer, David, Chris Kolb, and Nadja Klein. 2020. “Semi-Structured Deep Distributional Regression: Combining Structured Additive Models and Deep Learning.” arXiv Preprint arXiv:2002.05777.

Saharia, Chitwan, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, et al. 2022a. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.” arXiv Preprint arXiv: Arxiv-2205.11487.

———. 2022b. “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding.” arXiv. https://doi.org/10.48550/ARXIV.2205.11487.

Saifee, Moiz. 2020. “GPT-3: The New Mighty Language Model from Openai.”

Sajjadi, Mehdi S. M., Olivier Bachem, Mario Lucic, Olivier Bousquet, and Sylvain Gelly. 2018. “Assessing Generative Models via Precision and Recall.” arXiv. https://doi.org/10.48550/ARXIV.1806.00035.

Salimans, Tim, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. “Improved Techniques for Training Gans.” CoRR. http://arxiv.org/abs/1606.03498.

Schick, Timo, and Hinrich Schütze. 2020. “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference.” http://arxiv.org/abs/2001.07676.

Schuhmann, C. 2022. “Laion-400-Million Open Dataset.” 2022. https://laion.ai/blog/laion-400-open-dataset/.

Schuhmann, Christoph, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. 2021a. “Laion-400m: Open Dataset of Clip-Filtered 400 Million Image-Text Pairs.” arXiv Preprint arXiv:2111.02114.

———. 2021b. “LAION-400M: Open Dataset of Clip-Filtered 400 Million Image-Text Pairs.” CoRR. https://arxiv.org/abs/2111.02114.

Sejnowski, Terrence J. 2020. “The Unreasonable Effectiveness of Deep Learning in Artificial Intelligence.” Proceedings of the National Academy of Sciences U.S.A. (2020) Https://Www.pnas.org/Content/Early/2020/01/23/1907373117. https://doi.org/10.1073/pnas.1907373117.

Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2015a. “Neural Machine Translation of Rare Words with Subword Units.” CoRR. http://arxiv.org/abs/1508.07909.

———. 2015b. “Neural Machine Translation of Rare Words with Subword Units.” arXiv Preprint arXiv:1508.07909.

———. 2016. “Neural Machine Translation of Rare Words with Subword Units.” In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–25. Berlin, Germany: Association for Computational Linguistics. https://doi.org/10.18653/v1/P16-1162.

Shah, Deval. 2022. “Self-Supervised Learning and Its Applications - Neptune.ai.” neptune.ai.

Shao, Shuai, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. 2019. “Objects365: A Large-Scale, High-Quality Dataset for Object Detection.” In Proceedings of the Ieee/Cvf International Conference on Computer Vision, 8430–9.

Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” In. https://openreview.net/pdf?id=B1ckMDqlg.

Shekhar, Ravi, Sandro Pezzelle, Yauhen Klimovich, Aurélie Herbelot, Moin Nabi, Enver Sangineto, and Raffaella Bernardi. 2017. “Foil It! Find One Mismatch Between Image and Language Caption.” arXiv Preprint arXiv:1705.01359.

Shen, Sheng, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, and Kurt Keutzer. 2021. “How Much Can Clip Benefit Vision-and-Language Tasks?” arXiv Preprint arXiv:2107.06383.

Sheng, Emily, Kai-Wei Chang, Premkumar Natarajan, and Nanyun Peng. 2019. “The Woman Worked as a Babysitter: On Biases in Language Generation.” arXiv Preprint arXiv:1909.01326.

Shonenkov, Alex. 2021. “RuDALL-E.” https://github.com/ai-forever/ru-dalle.

Shvetsova, Nina, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, and Hilde Kuehne. 2021. “Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval.” arXiv Preprint arXiv: Arxiv-2112.04446.

Sikarwar, Ankur, and Gabriel Kreiman. 2022. “On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering.” arXiv Preprint arXiv:2201.03965.

Silberer, Carina, and Mirella Lapata. 2014. “Learning Grounded Meaning Representations with Autoencoders.” In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 721–32.

Simonyan, Karen, and Andrew Zisserman. 2014. “Very deep convolutional networks for large-scale image recognition.” arXiv Preprint arXiv:1409.1556.

Singh, Amanpreet, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. 2022. “Flava: A Foundational Language and Vision Alignment Model.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 15638–50.

Sirko, Wojciech, Sergii Kashubin, Marvin Ritter, Abigail Annkah, Yasser Salah Eddine Bouchareb, Yann N. Dauphin, Daniel Keysers, Maxim Neumann, Moustapha Cissé, and John Quinn. 2021. “Continental-Scale Building Detection from High Resolution Satellite Imagery.” CoRR. https://arxiv.org/abs/2107.12283.

Snell, Charlie. 2021. “Understanding Vq-Vae.” https://ml.berkeley.edu/blog/posts/vq-vae/.

Socher, Richard, and Li Fei-fei. 2010. “Connecting Modalities: Semi-Supervised Segmentation and Annotation of Images Using Unaligned Text Corpora.” In In Ieee Computer Society Conference on Computer Vision and Pattern Recognition.

Soderlund, Jacob, and Alan Blair. 2018. “Adversarial Image Generation Using Evolution and Deep Learning.” In 2018 Ieee Congress on Evolutionary Computation (Cec), 1–8. https://doi.org/10.1109/CEC.2018.8477754.

Sohl-Dickstein, Jascha, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. “Deep Unsupervised Learning Using Nonequilibrium Thermodynamics.” CoRR. http://arxiv.org/abs/1503.03585.

Srinivasan, Krishna, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. 2021. “Wit: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning.” In Proceedings of the 44th International Acm Sigir Conference on Research and Development in Information Retrieval, 2443–9.

Srinivasan, Ramya, and Kanji Uchino. 2021. “Biases in Generative Art: A Causal Look from the Lens of Art History.” In Proceedings of the 2021 Acm Conference on Fairness, Accountability, and Transparency, 41–51.

Srivastava, Aarohi, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, et al. 2022. “Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models.” arXiv Preprint arXiv:2206.04615.

Steiner, Andreas, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. 2021. “How to Train Your Vit? Data, Augmentation, and Regularization in Vision Transformers.” https://doi.org/10.48550/ARXIV.2106.10270.

Strubell, Emma, Ananya Ganesh, and Andrew McCallum. 2019a. “Energy and Policy Considerations for Deep Learning in NLP.” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–50. Florence, Italy: Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1355.

———. 2019b. “Energy and Policy Considerations for Deep Learning in Nlp.” arXiv Preprint arXiv:1906.02243.

———. 2019c. “Energy and Policy Considerations for Deep Learning in Nlp.” arXiv. https://doi.org/10.48550/ARXIV.1906.02243.

Sulubacak, Umut, Ozan Caglayan, Stig-Arne Grönroos, Aku Rouhe, Desmond Elliott, Lucia Specia, and Jörg Tiedemann. 2020. “Multimodal Machine Translation Through Visuals and Speech.” Mach. Transl. 34 (2-3): 97–147. https://doi.org/10.1007/s10590-020-09250-0.

Sun, Chen, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era.” In Proceedings of the Ieee International Conference on Computer Vision, 843–52.

Sun, Qingfeng, Yujing Wang, Can Xu, Kai Zheng, Yaming Yang, Huang Hu, Fei Xu, Jessica Zhang, Xiubo Geng, and Daxin Jiang. 2021. “Multimodal Dialogue Response Generation.” arXiv Preprint arXiv:2110.08515.

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. “Sequence to Sequence Learning with Neural Networks.” http://arxiv.org/abs/1409.3215.

Sutton, R. S. 2019. “The Bitter Lesson.” 2019. http://www.incompleteideas.net/IncIdeas/BitterLesson.html.

Szegedy, Christian, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2015. “Rethinking the Inception Architecture for Computer Vision.” CoRR. http://arxiv.org/abs/1512.00567.

Tan, Hao, and Mohit Bansal. 2020. “Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision.” arXiv Preprint arXiv:2010.06775.

Tan, Mingxing, and Quoc V. Le. 2019a. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” https://doi.org/10.48550/ARXIV.1905.11946.

———. 2019b. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” CoRR abs/1905.11946. http://arxiv.org/abs/1905.11946.

Tao, Ming, Hao Tang, Songsong Wu, Nicu Sebe, Fei Wu, and Xiao-Yuan Jing. 2020. “DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis.” CoRR. https://arxiv.org/abs/2008.05865.

techslang. 2020. “What Is Self-Supervised Learning? — Definition by Techslang.” www.techslang.com.

Theis, Lucas, Aäron van den Oord, and Matthias Bethge. 2015. “A Note on the Evaluation of Generative Models.” arXiv. https://doi.org/10.48550/ARXIV.1511.01844.

Tian, Yonglong, Dilip Krishnan, and Phillip Isola. 2020. “Contrastive Multiview Coding.” In European Conference on Computer Vision, 776–94. Springer.

Tiu, Ekin. 2021. “Understanding Contrastive Learning by Ekin Tiu Towards Data Science.” towardsdatascience.com.

Tong, Chao, Jun Li, Chao Lang, Fanxin Kong, Jianwei Niu, and Joel JPC Rodrigues. 2018. “An Efficient Deep Model for Day-Ahead Electricity Load Forecasting with Stacked Denoising Auto-Encoders.” Journal of Parallel and Distributed Computing 117: 267–73.

Torralba, Antonio, and Alexei A Efros. 2011. “Unbiased Look at Dataset Bias.” In CVPR 2011, 1521–8. IEEE.

Uppal, Shagun, Sarthak Bhagat, Devamanyu Hazarika, Navonil Majumder, Soujanya Poria, Roger Zimmermann, and Amir Zadeh. 2022. “Multimodal Research in Vision and Language: A Review of Current and Emerging Trends.” Information Fusion 77: 149–71.

Vale-Silva, Luı́s A, and Karl Rohr. 2021. “Long-Term Cancer Survival Prediction Using Multimodal Deep Learning.” Scientific Reports 11 (1): 1–12.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017a. “Attention Is All You Need.” http://arxiv.org/abs/1706.03762.

———. 2017b. “Attention Is All You Need.” CoRR. http://arxiv.org/abs/1706.03762.

———. 2017c. “Attention Is All You Need.” In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, ca, USA, edited by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017a. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. “Attention Is All You Need.” Advances in Neural Information Processing Systems 30.

———. 2017c. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. 2015. “CIDEr: Consensus-Based Image Description Evaluation.” In 2015 Ieee Conference on Computer Vision and Pattern Recognition (Cvpr), 4566–75. https://doi.org/10.1109/CVPR.2015.7299087.

Vinyals, Oriol, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. “Show and Tell: A Neural Image Caption Generator.” In, 3156–64. https://doi.org/10.1109/CVPR.2015.7298935.

Voita, Elena, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. 2019. “Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned.” http://arxiv.org/abs/1905.09418.

Wang, Alex, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” arXiv Preprint arXiv:1804.07461.

Wang, Jun, and Shengchen Li. 2018. “Detection and Classification of Acoustic Scenes and Events 2018 Self-Attention Mechanism Based System for Dcase2018 Challenge Task1 and Task4,” August. https://doi.org/10.13140/RG.2.2.28317.13281.

Wang, Peng, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. “OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework.” In Proceedings of the 39th International Conference on Machine Learning, edited by Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, 162:23318–40. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v162/wang22al.html.

Wang, Qin, Rujia Li, Qi Wang, and Shiping Chen. 2021. “Non-Fungible Token (Nft): Overview, Evaluation, Opportunities and Challenges.” arXiv. https://doi.org/10.48550/ARXIV.2105.07447.

Website. 2020. “Localized Narratives Data and Visualization.” 2020. https://google.github.io/localized-narratives.

Wei, Chen, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, and Christoph Feichtenhofer. 2022. “Masked Feature Prediction for Self-Supervised Visual Pre-Training.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 14668–78.

Weng, Lilian. 2018. “From Autoencoder to Beta-Vae.” Lilianweng.github.io. https://lilianweng.github.io/posts/2018-08-12-vae/.

———. 2021. “What Are Diffusion Models?” Lilianweng.github.io. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/.

Wenzek, Guillaume, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. 2019. “Ccnet: Extracting High Quality Monolingual Datasets from Web Crawl Data.” arXiv Preprint arXiv:1911.00359.

Wu, Chenfei, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. 2021. “NÜWA: Visual Synthesis Pre-Training for Neural visUal World creAtion.” arXiv Preprint arXiv: Arxiv-2111.12417.

WZRD. 2020. “WZRD.” https://wzrd.ai/.

Xiao, Jianxiong, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. 2010. “SUN Database: Large-Scale Scene Recognition from Abbey to Zoo.” In 2010 Ieee Computer Society Conference on Computer Vision and Pattern Recognition, 3485–92. https://doi.org/10.1109/CVPR.2010.5539970.

Xie, Qizhe, Eduard H. Hovy, Minh-Thang Luong, and Quoc V. Le. 2019. “Self-Training with Noisy Student Improves Imagenet Classification.” CoRR abs/1911.04252. http://arxiv.org/abs/1911.04252.

Xu, Kelvin, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.” arXiv. https://doi.org/10.48550/ARXIV.1502.03044.

Xu, Tao, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2017. “AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks.” CoRR. http://arxiv.org/abs/1711.10485.

Xue, Linting, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2020. “MT5: A Massively Multilingual Pre-Trained Text-to-Text Transformer.” arXiv Preprint arXiv:2010.11934.

Yang, Xu, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. 2019. “Auto-Encoding Scene Graphs for Image Captioning.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition (Cvpr).

Yann, Lecun, and Misra Ishan. 2021. “Self-Supervised Learning: The Dark Matter of Intelligence.” 2021. https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/.

Yao, Benjamin Z., Xiong Yang, Liang Lin, Mun Wai Lee, and Song-Chun Zhu. 2010. “I2T: Image Parsing to Text Description.” Proceedings of the IEEE 98 (8): 1485–1508. https://doi.org/10.1109/JPROC.2010.2050411.

Yao, Jiawen, Xinliang Zhu, Feiyun Zhu, and Junzhou Huang. 2017. “Deep Correlational Learning for Survival Prediction from Multi-Modality Data.” In Medical Image Computing and Computer-Assisted Intervention − Miccai 2017, 406–14. Springer International Publishing.

Yao, Ting, Yingwei Pan, Yehao Li, and Tao Mei. 2018a. “Exploring Visual Relationship for Image Captioning.” arXiv. https://doi.org/10.48550/ARXIV.1809.07041.

———. 2018b. “Exploring Visual Relationship for Image Captioning.” arXiv. https://doi.org/10.48550/ARXIV.1809.07041.

You, Jiaxuan, Xiaocheng Li, Melvin Low, David Lobell, and Stefano Ermon. 2017. “Deep Gaussian Process for Crop Yield Prediction Based on Remote Sensing Data.” In Proceedings of the Thirty-First Aaai Conference on Artificial Intelligence, 4559–65. AAAI’17. San Francisco, California, USA: AAAI Press.

Young, Peter, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. “From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions.” Transactions of the Association for Computational Linguistics 2: 67–78.

Yu, Jiahui, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. 2021. “Vector-Quantized Image Modeling with Improved VQGAN.” CoRR. https://arxiv.org/abs/2110.04627.

Yu, Jiahui, Yuanzhong Xu, Jing Koh, Thang Luong, Gunjan Baid, Vijay Vasudevan, Alexander Ku, et al. 2022a. “Scaling Autoregressive Models for Content-Rich Text-to-Image Generation.” https://doi.org/10.48550/arXiv.2206.10789.

Yu, Jiahui, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, et al. 2022b. “Scaling Autoregressive Models for Content-Rich Text-to-Image Generation.” arXiv. https://doi.org/10.48550/ARXIV.2206.10789.

Yuan, Lu, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, et al. 2021. “Florence: A New Foundation Model for Computer Vision.” arXiv Preprint arXiv:2111.11432.

Yuan, Sha, Zhao Shuai, Leng Jiahong, Xue Zhao, Zhao Hanyu, and Tang Jie. 2022. “WuDaoMM: A Large-Scale Multi-Modal Dataset for Pre-Training Models.” arXiv Preprint arXiv:2203.11480.

Zagoruyko, Sergey, and Nikos Komodakis. 2016. “Wide Residual Networks.” CoRR abs/1605.07146. http://arxiv.org/abs/1605.07146.

Zellers, Rowan, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. “From Recognition to Cognition: Visual Commonsense Reasoning.” In Proceedings of the Ieee/Cvf Conference on Computer Vision and Pattern Recognition, 6720–31.

Zellers, Rowan, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. “Swag: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference.” arXiv Preprint arXiv:1808.05326.

Zeng, Andy, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, et al. 2022. “Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language.” arXiv Preprint arXiv:2204.00598.

Zhang, Chao, Zichao Yang, Xiaodong He, and Li Deng. 2020. “Multimodal Intelligence: Representation Learning, Information Fusion, and Applications.” IEEE J. Sel. Top. Signal Process. 14 (3): 478–93. https://doi.org/10.1109/JSTSP.2020.2987728.

Zhang, Han, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaolei Huang, Xiaogang Wang, and Dimitris N. Metaxas. 2016. “StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks.” CoRR. http://arxiv.org/abs/1612.03242.

Zhang, Peng, Yash Goyal, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2016. “Yin and Yang: Balancing and Answering Binary Visual Questions.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 5014–22.

Zhang, Yuhao, Hang Jiang, Yasuhide Miura, Christopher D Manning, and Curtis P Langlotz. 2020. “Contrastive Learning of Medical Visual Representations from Paired Images and Text.” arXiv Preprint arXiv:2010.00747.

Zhou, Bolei, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. 2017. “Scene Parsing Through Ade20k Dataset.” In Proceedings of the Ieee Conference on Computer Vision and Pattern Recognition, 633–41.

Zhou, Yanqi, Sudip Roy, Amirali Abdolrashidi, Daniel Wong, Peter Ma, Qiumin Xu, Hanxiao Liu, et al. 2020. “Transferable Graph Optimizers for Ml Compilers.” NeurIPS 2020. http://arxiv.org/abs/2010.12438.

Zhou, Yufan, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. 2021. “LAFITE: Towards Language-Free Training for Text-to-Image Generation.” CoRR. https://arxiv.org/abs/2111.13792.

Zhu, Minfeng, Pingbo Pan, Wei Chen, and Yi Yang. 2019. “DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis.” CoRR. http://arxiv.org/abs/1904.01310.

Zhu, Xinliang, Jiawen Yao, and Junzhou Huang. 2016. “Deep Convolutional Neural Network for Survival Analysis with Pathological Images.” In 2016 Ieee International Conference on Bioinformatics and Biomedicine (Bibm), 544–47. IEEE.

Zhu, Yukun, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. “Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books.” In Proceedings of the Ieee International Conference on Computer Vision, 19–27.

Zhuang, Chengxu, Siming Yan, Aran Nayebi, Martin Schrimpf, Michael C Frank, James J DiCarlo, and Daniel LK Yamins. 2021. “Unsupervised Neural Network Models of the Ventral Visual Stream.” Proceedings of the National Academy of Sciences 118 (3): e2014196118.