Performance Oriented Partial Checkpoint and Migration of LAM/MPI Applications

dc.contributor.authorSingh, Rajendra
dc.contributor.examiningcommitteeStacey, Deborah (University of Guelph) McLeod, Robert (Electrical and Computer Engineering) Thulasiraman, Parimala (Computer Science)en
dc.contributor.supervisorGraham, Peter (Computer Science)en
dc.date.accessioned2011-01-21T15:31:53Z
dc.date.available2011-01-21T15:31:53Z
dc.date.issued2011-01-21T15:31:53Z
dc.degree.disciplineComputer Scienceen_US
dc.degree.levelDoctor of Philosophy (Ph.D.)en_US
dc.description.abstractIn parallel computing, MPI is heavily used due to its support of popular cluster based parallel machines and the Single Program Multiple Data (SPMD) model. Normally cluster nodes are dedicated to a single parallel job/application but MPI could also be used with nodes that are concurrently shared by multiple users. In this case, nodes could become overloaded with work from other users. Even a few overloaded nodes can result in application slowdown. Thus, it is desirable to relocate affected processes in a running application to lightly loaded nodes by partial checkpointing and migrating of those processes. In some MPI applications, groups of processes communicate frequently with one another. Such groups must be near one another to ensure communication efficiency. Thus, if any member of a group is to be checkpointed and migrated, all should be. It must therefore be possible to identify such groups. I have built a prototype, using LAM/MPI, that supports partial checkpoint, migration and restart of MPI processes. To identify process groups for checkpoint and migration, I adapted TEIRESIAS (an algorithm for pattern discovery from bioinformatics) to identify frequent, recurring patterns of communication using data gathered by LAM/MPI. I then created predictors that use the discovered patterns to predict groups of communicating processes that should be checkpointed and migrated together. I have assessed the effectiveness of my technique using synthetic and real communication data (for a small set of representative applications) to show that my predictors can accurately predict process groups for those applications. Additionally, I have created a simple simulation system to allow me to explore scenarios related to network characteristics and overload conditions under which my system might provide useful speedup. Not all MPI applications will benefit from my approach (e.g. those with unpredictable communication patterns or large groups of frequently communicating processes). However, my experimental and simulation results suggest that my technique should be effective for a number of common application types, network characteristics and overload conditions. Using partial checkpoint and migration should therefore allow many long running applications to finish faster than if a subset of their processes was left running on overloaded nodes.en
dc.description.noteFebruary 2011en
dc.format.extent4798942 bytes
dc.format.mimetypeapplication/pdf
dc.identifier.citationRajendra Singh and Peter Graham. Grouping MPI Processes for Partial Checkpoint and co-Migration. Euro-Par 2009, 15th International Euro-Par Conference, Proceedings, Delft, The Netherlands, August 25 -28, 2009en
dc.identifier.citationRajendra Singh and Peter Graham. Performance Driven Partial Checkpoint/Migrate for LAM/MPI. In 22nd International Symposium of High Performance Computing Systems and Applications (HPCS 2008), Quebec City, Canada, June 9-11 2008en
dc.identifier.urihttp://hdl.handle.net/1993/4397
dc.language.isoengen_US
dc.rightsopen accessen_US
dc.subjectCheckpointen
dc.subjectMigrationen
dc.subjectPartialen
dc.subjectLAM/MPIen
dc.titlePerformance Oriented Partial Checkpoint and Migration of LAM/MPI Applicationsen
dc.typedoctoral thesisen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Singh_Rajendra.pdf
Size:
4.58 MB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
2.34 KB
Format:
Item-specific license agreed to upon submission
Description: