Overview
Grids hold great promise for researchers who need to connect to remote computers, databases and other resources. However, they can be complex, unreliable, and require a huge time investment for even low-level operations. The Virtual Grid Application Development Software (VGrADS) project was developed to keep these problems from impeding the potential of grids and distributed resources.
Based on work from the earlier GrADS project, VGrADS collaborators developed software tools to simplify and accelerate the development of grid applications and services, while delivering high levels of performance and resource efficiency. RENCI’s role in the VGrADS project focused on using temporal-based reasoning to qualitatively assess, diagnose and adapt long-running applications. RENCI also investigated qualitative metrics for implementing a multi-level fault tolerance strategy, especially in the context of workflows that have strict deadlines, such as weather forecasting. RENCI was also involved in implementing and evaluating different fault-tolerance and recovery mechanisms for such workflows.
The VGrADS project completed its work in September 2009.
Overview
Grids hold great promise for researchers who need to connect to remote computers, databases and other resources. However, they can be complex, unreliable, and require a huge time investment for even low-level operations. The Virtual Grid Application Development Software (VGrADS) project was developed to keep these problems from impeding the potential of grids and distributed resources.
Based on work from the earlier GrADS project, VGrADS collaborators developed software tools to simplify and accelerate the development of grid applications and services, while delivering high levels of performance and resource efficiency. RENCI’s role in the VGrADS project focused on using temporal-based reasoning to qualitatively assess, diagnose and adapt long-running applications. RENCI also investigated qualitative metrics for implementing a multi-level fault tolerance strategy, especially in the context of workflows that have strict deadlines, such as weather forecasting. RENCI was also involved in implementing and evaluating different fault-tolerance and recovery mechanisms for such workflows.
The VGrADS project completed its work in September 2009.
Funding
Cooperative Agreement issued to Rice University under National Science Foundation Cooperative Agreement No. CCR-0331645 with a sub agreement to the University of North Carolina at Chapel Hill.
Co-Principal Investigators
Fran Berman, University of California at San Diego
Henri Casanova, University of Hawaii
Keith Cooper, Chuck Koelbel, Richard Tapia, Linda Torczon, Rice University
Jack Dongarra, University of Tennessee at Knoxville
Lennart Johnsson, University of Houston
Carl Kesselman, University of Southern California Information Sciences Institute
Richard Wolski, University of California at Santa Barbara
Project Team
Daniel Reed (co-PI until Dec-07)
Anirban Mandal (co-PI, Dec-07 to Sep 09)
Gopi Kandaswamy
Emma Buneci (Student)
Publications
L. Ramakrishnan, D. Nurmi, A. Mandal, C. Koelbel, D. Gannon, T. M. Huang, Y. S. Kee, G. Obertelli, K. Thyagaraja, R. Wolski, A. Yarkhan and D. Zagorodnov, “VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance,” in Proceedings of the IEEE/ACM International Conference for High Performance Computing, Networking, Storage and Analysis (SC) , November 2009.
R. Zhang, A. Mandal, C. Koelbel and K. Cooper, “Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids,” in Proceedings of the IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid) , pp. 244-251, May 2009.
G. Kandaswamy, A. Mandal, and D. A. Reed, “Fault Tolerance and Recovery of Scientific Workflows on Computational Grids,” in Proceedings of the IEEE International Symposium on Cluster Computing and the Grid (CCGrid) , pp. 777-782, May 2008.
F. Berman, H. Casanova, A. Chien, K.Cooper, H. Dail, A. Dasgupta, W. Deng, J. Dongarra, L. Johnsson, K. Kennedy, C. Koelbel, B. Liu, X. Liu, A. Mandal, G. Marin, M. Mazina, J. Mellor-Crummey, C. Mendes, A. Olugbile, M. Patel, D. Reed, Z. Shi, O. Sievert, H. Xia and A. YarKhan, “New Grid Scheduling and Rescheduling Methods in the GrADS Project,” in International Journal of Parallel Programming (IJPP), Volume 33(2-3):pp. 209-229, 2005.
Presentations
“VGrADS: Enabling e-Science Workflows on Grids and Clouds with Fault Tolerance” – Presentation (remote) at RENCI booth at Supercomputing (SC 2009), November 2009
Anirban Mandal, “Fault Tolerance and Recovery of Scientific Workflows on Computational Grids” – Presentation at International Symposium on Cluster Computing and the Grid (CCGrid 2008), May 2008
Anirban Mandal, “Fault-tolerance on Slots” – Presentation at VGrADS All Hands meeting, April 2008
Anirban Mandal, “Virtual Grid Execution System: Fault Tolerance Planning and Run-time Rescheduling of Scientific Workflows” – Presentation at RENCI booth in Supercomputing (SC) (2007), November 2007
Anirban Mandal, Gopi Kandaswamy and Daniel Reed, “Fault tolerance and Recovery for Grid Workflow Systems” – Presentation at VGrADS All Hands meeting, April 2007
Daniel A. Reed, presentation on Fault Tolerance at VGrADS All Hands Meeting, September 2005
Daniel A. Reed, presentation at the VGrADS Site Visit, April 2005
Partners
Rice University
University of California at San Diego
University of California at Santa Barbara
University of Houston
University of North Carolina at Chapel Hill
University of Southern California Information Sciences Institute
University of Tennessee at Knoxville
RENCI
Links
VGrADS Project Website


















