My research focuses on identifying, designing, and building efficient, scalable, sustainable, and affordable abstractions and infrastructure for generative modeling. I also work at the intersection of data, law, and AI policy. I helped create some of the first embodied AI robotics foundation models and frontier-level open-source language models, contributed to datasets including OpenWebText, LAION, BLOOM, OpenThoughts, DCLM, and The Pile, and co-authored the RAIL AI License, the second most widely used AI software license on Hugging Face.
My research focuses identifying, designing, and building efficient, scalable, sustainable, and affordable abstractions and infrastructure for generative modeling research. I also do work at the intersection of data, law, and AI policy. I created some of the first embodied AI robotics foundation models, frontier level open source larguage language models, contributed to datasets including OpenWebText, LAION, BLOOM, OpenThoughs, DCLM,and the Pile, and co-authored the RAIL AI License, the second most popular AI software license on Huggingface
My work has been recognized by orals and invited talks at top conference including NeurIPS, ECCV, ICML, and CVPR. I am a Mozilla RISE25 2024 honoree, and I have received awards for my open source contributions from the Linux Foundation and Mozilla.
I maintain pybind11, PyTorch, and other popular open source libraries. I released one of the first 1 billion parameter+ auto-regression large language models OpenGPT2. I have contributed to the development of popular open source generative AI artifacts including OpenWebText, OpenGPT2, CommonCanvas, CommonCatalog, CommonPile, Habitat-Matterport3D, DataComp-LM, and BLOOM. These works have been collectively downloaded millions of times. I also helped create the Responsible AI License (RAIL) as a co-chair of the BLOOM Workshop. Additionally, I serve on the advisory board of EncodeJustice and Fidutam.
My work has been featured by WIRED, CNN, TechCrunch, and others.
I am currently on the academic and industry job market. I will also be attending NeurIPS 2024 in Vancouver. Please see my Google Scholar for my most up to date publication list.
CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images.
Aaron Gokaslan, A. Feder Cooper, Jasmine Collins, Landan Seguin, Austin Jacobson, Mihir Patel, Jonathan Frankle, Cory Stephenson, Volodymyr Kuleshov
CVPR, 2024
Accepted to CVPR 2024
Presented at NeurIPS 2023 Diffusion and Content Creativity Workshops
Data Governance in the Age of Large-Scale Data-Driven Language Technology.
Yacine Jernite, Huu Nguyen, Stella Biderman, Anna Rogers, Maraim Masoud, Valentin Danchev, Samson Tan, Alexandra Sasha Luccioni, Nishant Subramani, Isaac Johnson, Gerard Dupont, Jesse Dodge, Kyle Lo, Zeerak Talat, Dragomir R. Radev, Aaron Gokaslan, Somaieh Nikpoor, Peter Henderson, Rishi Bommasani, Margaret Mitchell
FAccT, 2022
Habitat 2.0: Training Home Assistants to Rearrange their Habitat.
Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, Aaron Gokaslan, Vladimir Vondrus, Sameer Dharur, Franziska Meier, Wojciech Galuba, Angel X. Chang, Zsolt Kira, Vladlen Koltun, Jitendra Malik, Manolis Savva, Dhruv Batra
NeurIPS, 2021
Spotlight: Top 3% of papers