In this graduate-level course, we will investigate large generative AI models for scientific and engineering problems: machine learning models that can generate outputs, such as hypotheses, designs, or simulations, based on patterns learned from scientific and engineering data. They. In more detail:
This course is directly relevant to the AuroraGPT project, which is developing a trillion-parameter generative AI model to be trained on Argonne’s new 64,000-GPU Aurora supercomputer. It also connects to the work of the Trillion Parameter Consortium, which engages researchers worldwide seeking to apply generative AI to scientific problems.
The course will take place as CMSC 35200-1: Deep Learning Systems in the fall quarter of 2023 at the University of Chicago, on Tuesdays and Thursdays, 3:30-4:50pm. For more information, please contact Profs Ian Foster and Rick Stevens.
We will study theoretical underpinnings of such models, their training paradigms, and applications. We will explore how these models can generate new data that are statistically similar to their training data, including text, images, and potentially more abstract representations, and how this capacity can be harnessed for scientific discovery and engineering solutions. Key topics include:
By the end of the course, students will have an understanding of how to implement and use generative AI models, how to apply them to problems in science and engineering, and how to navigate the ethical considerations that arise with the use of AI in these fields.
We will spend much time reading and discussing key papers in this area. In addition, the course will have a strong practical component, with students working to train models, apply them to science and engineering problems, evaluate their performance, etc. Initial ideas of things to cover:
(A work in progress!)
Scientific Data Acquisition and Organization: Robust data lies at the heart of any sophisticated model. Thus we first must curate large-scale scientific datasets, designing approximately 20 specialized “bundles” across domains like biology/biochemistry, materials/chemistry, physics/cosmology, and climate/environment. By addressing gaps in existing large language models (LLMs) tailored for intricate scientific challenges, our data collection aims to be highly targeted and efficient, enhancing the overall capabilities of our models.
Model Evaluation Suite Development: With the curated data in place, the second phase centers on constructing expansive model evaluation suites. These suites, tailored to specific dataset collections and subdomains, will validate data and lay the groundwork for model testing. We plan on utilizing current LLMs to shape problems that AuroraGPT can solve, targeting around 1,000 problems for each scientific subdomain, resulting in an infrastructure ready to evaluate models on a staggering 20,000 problem sets.
Model Construction and Performance Analysis: This pillar is about breathing life into our data through model building. We aim to construct models across diverse scales, from 7B to 1000B, leveraging general texts, code, and niche scientific data. Rigorous testing will be conducted on elite supercomputers Polaris and Aurora, ensuring optimal performance. The setup will harness technologies like Megatron and DeepSpeed to determine the best strategies for parallelism and fine-tuning hardware choices.
Model Refinement and Deployment: The final phase ensures that our models do not just exist but also thrive in real-world applications. Refinement processes will utilize post-processing tools such as “instruct,” “RLHF,” and “Chat” and might employ pipelines like DeepSpeed RLHF or Alpaca. Automation will be a focus, especially for post-processing and safety checks. As the finishing touch, we plan on launching a Web and API platform for internal testing of AuroraGPT at Argonne before its broader release.
This site is accessible at https://tpc-ai.github.io/genAI-SE/.