[12-7]“高性能科学计算”前沿系列邀请报告3 -- Reliable Matrix Computations via Algorithm-Based Fault Tolerance

文章来源:  |  发布时间:2016-12-07  |  【打印】 【关闭

  

  高性能科学计算前沿系列邀请报告3     

   Title: Reliable Matrix Computations via Algorithm-Based Fault Tolerance 

  Speaker:   Prof. Zizhong Chen 

                    Department of Computer Science and Engineering      

                    University of California, Riverside 

  

  Time: 13:30pm, Wednesday, Dec. 7, 2016

  Venue: Mid Conference Room, Level 4, Building 5,

  Institute of Software, Chinese Academy of Sciences.

  Abstract: 

  Errors are common in today's computer systems. When an error occurs, if the affected application continues, we call it a fail-continue error. Otherwise, we call it a fail-stop error. In this talk, I will discuss our recent work on algorithm-based fault tolerance for reliable matrix computations. We have developed some highly efficient error correction techniques for selected widely used matrix computation algorithms to tolerate both fail-continue and fail-stop errors according to their specific algorithmic characteristics. By leveraging the algorithmic characteristics of these algorithms, the proposed techniques can achieve much higher efficiency than the traditional general techniques (i.e., Triple Modular Redundancy for fail-continue errors and checkpoint/restart for fail-stop errors).   

    

  Short Bio: 

  Zizhong Chen is an Associate Professor in the Department of Computer Science and Engineering at the University of California, Riverside.  He specializes in reliable and high performance scientific computing, numerical algorithms and software, and algorithm-based fault tolerance. He has published over 80 papers with many in highly competitive conferences and journals such as HPDC, PPoPP, SC, ICS, IPDPS, TPDS, TC, JPDC, PARCO, SIMAX, SISC, and IBMRD. He has received a CAREER Award from the U.S. National Science Foundation and a Best Paper Award from the International Supercomputing Conference. Dr. Chen is a Senior Member of the IEEE and a Life Member of the ACM. He currently serves as Subject Area Editor for Elsevier Parallel Computing journal and Associate Editor for IEEE Transactions on Parallel and Distributed Systems.   

  All are welcome!