News & Events
News
Location: Home -> News & Events -> News -> Content

The research group of Jinxin-Liu Zhenzhe of the School of Computer Science has made important progress in software-defined methods and systems

Date: 2022-05-14   Click:

Recently, the research group of Associate Professor Jin Xin and Liu Zhenzhe of the Institute of Software of the School of Computer Science received good news that the research group has made continuous progress in software-defined methods and systems under ubiquitous computing systems, and has accepted 2 papers at the international top conference ACM SIGCOMM 2022 and 1 paper at ACM MobiSys 2022.


The Internet and its extension are driving the continuous penetration and integration of information space, the physical world and human society, forming a ubiquitous computing environment, and the form of computing systems is undergoing a new round of changes. In order to build a new digital information infrastructure that is "high-speed ubiquitous, space-ground integration, cloud-network integration, intelligent and agile, green and low-carbon, safe and controllable", the basic system software must support efficient, on-demand, and intelligent sharing and scheduling between distributed and heterogeneous resources. Since the great success of "software-defined networking" in network resource management, "software-defined" has been considered to be one of the mainstream ideas for constructing the next generation of ubiquitous computing systems, and related research has been very active.


In recent years, Associate Professor Jin Xin of the School of Computer Science of Peking University and researcher Liu Zhenzhe have carried out a series of explorations and studies on the basic nature, morphological structure, construction method and operation mechanism of ubiquitous computing system software around the software-defined basic principle of "ubiquitous resource virtualization + management task programmability", and made a series of important progress in a row.


The foundation of software definition lies in the construction of a reliable programmable plane. In the ubiquitous computing environment, various domain-specific hardware (ASICs), hypervisors, and compilers will inevitably produce errors and are intertwined with each other. The paper "Meissa: Scalable Network Testing for Programmable Data Planes" aims at the challenge of programmable data plane reliability problems of multi-source complex errors, and proposes a highly scalable automated testing technology based on domain-specific code summary, which reduces the program control flow diagram without losing coverage, and achieves 100% of production-scale programmable data plane programs Path coverage. The paper was published at ACM SIGCOMM 2022, and the first author is Zheng Naiqian, a 2018 undergraduate student majoring in computer science, which is also the first SIGCOMM paper in China in which an undergraduate student is an independent first author.


Software definition needs to support the on-demand scheduling of various underlying resources by the programmable plane according to workload and application scenarios. As machine learning becomes an important workload, the research group proposes new resource management and scheduling methods around model training tasks in two scenarios: cluster distributed machine learning and on-device machine learning.


Model training for distributed machine learning is a hot spot in academia. However, most of the existing methods assume that the training task load monopolizes the GPU, and does not consider the sharing of multiple resources, resulting in low resource utilization and limiting the efficiency of task completion. The paper "Multi-Resource Interleaving for Deep Learning Training" proposes a collaborative scheduling method MURI for CPU, GPU, storage and network resources, and designs a fine-grained multi-resource interleaving mechanism for deep learning load sharing resources according to the phased and iterative training characteristics of deep learning. A scheduling algorithm based on the flower tree algorithm is proposed to maximize the interleaving efficiency, significantly improve resource utilization and shorten the task completion time. Experiments based on real cluster and production environment trajectories show that this method can shorten the average task completion time (JCT) by 3.6 times and the total completion time (Makespan) by 1.6 times. The paper was published in ACM SIGCOMM 2022, and the first author is Yihao Zhao, a 2021 straight doctoral student.


Endpoint machine learning has unique advantages in scenarios such as privacy computing and harsh environments (such as no network connection), but the limited computing, storage, and power resources of terminal devices also bring more challenges. The paper "Melon: Breaking the Memory Wall for Resource-Efficient On-Device Machine Learning" proposes the Melon memory adaptive optimization framework and its three innovative technologies, namely, memory allocation for deep learning, progressive recalculation strategy generation and dynamic memory budget adjustment, to achieve deep software definition of terminal memory management. Experiments show that compared with the baseline method, Melon can increase the training batch size by up to 4 times on the terminal device, which greatly shortens the convergence time of terminal training in federated learning scenarios, greatly reduces the overhead of context switching in dynamic memory scenarios, and greatly reduces energy consumption. The paper was published in ACM MobiSys 2022, and the first author is Wang Qipeng, a 2020 direct doctoral student. It is worth mentioning that this is another new progress made by the research group in software-defined methods for terminal machine learning systems after DeepCache (MobiCom 2018), ELSA (WWW 2019, the first WWW Best Paper Award for Chinese scholars), Elf (MobiSys 2020), and FLASH (WWW 2021).



Copyright © Software Engineering Institute, Peking University

Room 1541, Science Building 1, No.5 Yiheyuan Road, Haidian District, Beijing, P.R.China 100871