从底层突破AI系统算力瓶颈 Breaking Through AI System's Bottlenecks from Bottom Up

个人简介

我于2023年5月加入上海交通大学电子信息与电气工程学院微纳电子学系,任长聘教轨助理教授。此前,我于2017年从北京大学博士毕业,主攻人工智能加速器研究。期间于2015-2016前往美国加州大学洛杉矶分校学术访问。毕业后加入微软研究院,从事“云-边-端”人工智能系统及其异构加速。任职至主管研究员(Senior Researcher)。于2021年初加入阿里巴巴平头哥半导体有限公司并行计算团队。作为核心芯片架构师,负责分布式人工智能算法加速及集群芯片互联相关功能设计。

Biography

I joined the Department of Micro-Nano Electronics at the School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University as a tenure-track Assistant Professor in May 2023. Prior to this, I obtained my Ph.D. degree from Peking University in 2017, specializing in research on artificial intelligence accelerators. During my doctoral studies, I conducted academic visits at the University of California, Los Angeles (UCLA) in 2015-2016 Alumni page. After graduation, I joined Microsoft Research, where I worked on cloud-edge-end artificial intelligence systems and heterogeneous acceleration. I progressed to the position of Senior Researcher. In early 2021, I joined Alibaba T-head Semiconductor Co., Ltd., leading the parallel computing team as a Core Chip Architect, responsible for artificial intelligence algorithm acceleration and cluster chip interconnect-related functional design.

学术工作

自2012年开始,我从事AI芯片设计的相关研究,经历了以AI芯片为代表的新一代并行处理器架构从萌芽到壮大的过程。在这十余年中,我专注于面向人工智能系统的体系结构和设计自动化的研究,通过对 “算法-系统-硬件”的跨层次垂直整合,通盘完整的软硬件架构优化,不断突破AI芯片的性能瓶颈,支撑高速增长的算力需求。围绕上述研究方向,我在体系结构、EDA和人工智能重要国际会议和期刊发表论文21篇(其中CCF-A类10篇)。最高单篇论文引用量超2000余次。并获得多项奖励和荣誉,包括FPGA-15最佳论文提名,FPGA会议历史高被引论文第一名 ACM Library,体系结构 MICRO-2023 年度最佳论文提名,Donald O. Pederson 2017-2019年最佳论文奖(首个中国大陆获奖者,北京大学报道美国UCLA大学报道),2021/2022/2023 年 AI 2000 人工智能芯片技术世界最有影响力学者第二名 中国工程科技知识中心,2019/2021 Elsevier 世界前2%高被引科学家(计算机硬件与体系结构领域),ACM Chinasys 新星奖,微软研究院院长特别奖等。

我致力于同时具有学术贡献和产业影响力的研究。在博士期间,所开发的FPGA人工智能加速器性能建模与自动化设计方法被欧美几乎所有拥有AI硬件产品的公司参考使用,包括Google、Microsoft、NVIDIA、Xilinx、AMD等。在UCLA访问期间,基于FPGA高层次综合工具HLS所开发的自动化人工智能加速框架被上万人使用,并在创业公司 Falcon Computing落地(于2020年被美国Xilinx公司收购)。在微软,我做出了多项重要科研成果转化,其异构硬件加速技术在微软“云-边-端”各级人工智能系统中获得部署和应用。由于突出贡献,我获得微软研究院院长签发的院长特别奖。我所提出的稀疏神经网络加速方法在国际GPU芯片巨头NVIDIA新一代Ampere架构和Hopper架构稀疏张量运算单元中得到应用。我于2021年3月~2023年4月在阿里巴巴平头哥半导体有限公司并行计算团队工作。作为AI芯片核心架构团队成员之一,我完整地参与了平头哥AI并行计算芯片从零到一,设计、验证、生产流片的全过程,并负责完成了多项架构设计的研究和落地,包括高性能跨芯片互联架构、分布式通信软硬件接口、内存模型等。作为平头哥并行计算团队唯一学术合作联系人,我致力于将国际先进研究成果带入到工程团队,包括新一代稀疏张量运算单元、Transformer加速器等,并主持和促进多项研究团队与平头哥技术研究合作。

Research Summary

Since 2012, I have been engaged in research related to AI chip design, witnessing the emergence and growth of the new generation of parallel processor architectures represented by AI chips. Over the past decade, I have focused on the architecture of AI systems and design automation, achieving comprehensive optimization of software and hardware architectures through cross-layer vertical integration of “algorithm-system-hardware.” This approach has continuously pushed the performance boundaries of AI chips, supporting the rapidly increasing demand for computing power. In the aforementioned research areas, I have published 21 papers in internationally recognized conferences and journals, including 10 CCF-A papers. My work has garnered over 2000 citations for a single paper. I have received several awards and honors, including the FPGA-15 Best Paper Nomination, first place in the list of the most highly cited papers in FPGA Conference history ACM Library, MICRO-2023 Top Picks Honorable Mention, the Donald O. Pederson Best Paper Award for 2017-2019 (the first winner from mainland China) Peking University webpage, UCLA webpage, 2021/2022/2023 AI 2000 Most Influential Scholars in AI Chip Technology (Top-2) CKCEST, and being recognized as a top 2% highly cited scientist in Computer Hardware and Architecture in 2019/2021 by Elsevier. I have also received the ACM Chinasys Rising Star Award, among others.

I am committed to conducting research that combines academic contributions with industrial impact. During my Ph.D. studies, the FPGA AI accelerator performance modeling and automated design methods that I developed have been referenced and used by nearly all companies involved in AI hardware, including Google, Microsoft, NVIDIA, Xilinx, and AMD. During my visit to UCLA, the automated AI acceleration framework based on FPGA high-level synthesis (HLS) was adopted by tens of thousands of users and successfully implemented in the startup company Falcon Computing (acquired by Xilinx in 2020). During my time at Microsoft, I achieved significant technology transfer through my research, and the heterogeneous hardware acceleration technologies I developed have been deployed and applied in various levels of Microsoft’s cloud-edge-end AI systems. For my outstanding contributions, I received the Dean’s Special Award from Microsoft Research. The sparse neural network acceleration methods I proposed have been applied in NVIDIA’s next-generation Ampere and Hopper architectures in sparse tensor operation units. In March 2021, I joined Alibaba Pingtouge Semiconductor Co., Ltd., with the aspiration to help domestic chips surpass their international counterparts. As one of the core architecture team members, I have been involved in the complete development process of the Pingtouge AI parallel computing chip, including design, verification, production, and tape-out. I have been responsible for multiple architectural research and implementation, including high-performance inter-chip interconnect architecture, distributed communication software and hardware interfaces, and memory models. As the only academic collaborator in the Pingtouge parallel computing team, I am dedicated to bringing international advanced research achievements to the engineering team, including new-generation sparse tensor operation units and Transformer accelerators. I have also led and facilitated research collaborations between multiple research teams and the Pingtouge technology research team.