Performance Oriented Partial Checkpoint and Migration of LAM/MPI Applications
MetadataShow full item record
In parallel computing, MPI is heavily used due to its support of popular cluster based parallel machines and the Single Program Multiple Data (SPMD) model. Normally cluster nodes are dedicated to a single parallel job/application but MPI could also be used with nodes that are concurrently shared by multiple users. In this case, nodes could become overloaded with work from other users. Even a few overloaded nodes can result in application slowdown. Thus, it is desirable to relocate affected processes in a running application to lightly loaded nodes by partial checkpointing and migrating of those processes. In some MPI applications, groups of processes communicate frequently with one another. Such groups must be near one another to ensure communication efficiency. Thus, if any member of a group is to be checkpointed and migrated, all should be. It must therefore be possible to identify such groups. I have built a prototype, using LAM/MPI, that supports partial checkpoint, migration and restart of MPI processes. To identify process groups for checkpoint and migration, I adapted TEIRESIAS (an algorithm for pattern discovery from bioinformatics) to identify frequent, recurring patterns of communication using data gathered by LAM/MPI. I then created predictors that use the discovered patterns to predict groups of communicating processes that should be checkpointed and migrated together. I have assessed the effectiveness of my technique using synthetic and real communication data (for a small set of representative applications) to show that my predictors can accurately predict process groups for those applications. Additionally, I have created a simple simulation system to allow me to explore scenarios related to network characteristics and overload conditions under which my system might provide useful speedup. Not all MPI applications will benefit from my approach (e.g. those with unpredictable communication patterns or large groups of frequently communicating processes). However, my experimental and simulation results suggest that my technique should be effective for a number of common application types, network characteristics and overload conditions. Using partial checkpoint and migration should therefore allow many long running applications to finish faster than if a subset of their processes was left running on overloaded nodes.