Research Working Group

Storage Working Group

Mission

Define and develop the MLPerf™ Storage benchmarks to characterize performance of storage systems that support machine learning workloads.

Purpose

Storing and processing of training data is a crucial part of the machine learning (ML) pipeline. The way we ingest, store, and serve data into ML frameworks can significantly impact the performance of training and inference, as well as resource costs. However, even though data management can pose a significant bottleneck, it has received far less attention and specialization for ML.

The main goal of this working group is to create a benchmark that evaluates performance for the most important storage aspects in ML workloads, including data ingestion, training, and inference. Our end goal is to create a storage benchmark for the full ML pipeline which is compatible with diverse software frameworks and hardware accelerators. The benchmark will not require any specific hardware for performing computation.

Creating this benchmark will establish best practices in measuring storage performance in ML, contribute to the design of next generation systems for ML, and help system engineers find the right sizing of storage relative to compute in ML clusters.

Deliverables

Storage access traces for representative ML applications, from the applications’ perspective. Our initial targets are Vision, NLP, and Recommenders. (Short-term goal)
Storage benchmark rules for:
- Data ingestion phase (Medium-term goal)
- Training phase (Short-term goal)
- Inference phase (Long-term goal)
- Full ML pipeline (Long-term goal)
Flexible generator of datasets:
- Synthetic workload generator based on analysis of I/O in real ML traces, which is aware of compute think-time. (Short-term goal)
- Trace replayer that scales the workload size. (Long-term goal)
User-friendly testing harness that is easy to deploy with different storage systems. (Medium-term goal)

Meeting Schedule

Weekly on Friday from 8:00-9:00AM Pacific.

How to Join

Use this link to request to join the group/mailing list, and receive the meeting invite:
Storage Google Group.
Requests are manually reviewed, so please be patient.

Working Group Resources

Shared documents and meeting minutes:

Associate a Google account with your e-mail address.
Ask to join our Public Google Group.
Once approved, go to the Storage folder in our Public Google Drive.

Working Group Chairs

Oana Balmau (oana.balmau@cs.mcgill.ca)

Curtis Anderson (canderson@panasas.com)

Johnu George (johnu.george@nutanix.com)

Vice Chairs

Huihuo Zheng (huihuo.zheng@anl.gov)

Working Group Chair Bios

Oana is an Assistant Professor in the School of Computer Science at McGill University. Her research focuses on storage systems and data management systems, with an emphasis on large-scale data management for machine learning, data science, and edge computing. She completed her PhD at the University of Sydney, advised by Prof. Willy Zwaenepoel. Before her PhD, Oana earned her Bachelors and Masters degrees in Computer Science from EPFL.

Curtis is a filesystem developer at heart, spending the last 36 of his 45 years of programming experience working on filesystems and nearly every type of storage-related technology. He’s currently working at Panasas helping steer PanFS toward a more commercial view of the HPC market. He also enjoys watching the business side of the house do their things, it's foreign to tech but has its own internal logic and “architecture”.

Johnu George is a staff engineer at Nutanix with a wealth of experience in building production grade cloud native platforms and large scale hybrid data pipelines. His research interests include machine learning system design, distributed learning infrastructure improvements and ML workload characterization. He is an active open source contributor and has steered several industry collaborations on projects like Kubeflow, Apache Mnemonic and Knative. He is an Apache PMC member and currently chairing Kubeflow Training and AutoML Working groups.

Huihuo Zheng is a computer scientist at Argonne National Laboratory. His research interests include data management and parallel I/O for deep learning applications, as well as large scale distributed training on HPC supercomputers. He also applies HPC and deep learning to solve challenging domain science problems in physics, chemistry and material sciences. Huihuo received his PhD. in Physics at the University of Illinois at Urbana-Champaign in 2016.