Comparisons of different propensity score methods in a multilevel framework: implications for cluster-based program evaluation
Background: Propensity score (PS) methods have been used to minimize bias in an observational or experimental study in which participants are not randomly assigned to treatment conditions to infer causal effects. The conventional PS methods were developed for independent sampling or non-nested data. However, in health, psychology, organizational sciences, and education area, data collected are often with multilevel or hierarchical structure. In cluster-based intervention programs where clusters are treated as the unit of assignment, each cluster has its own probability (PS) of being assigned to the treatment group, and this probability is associated with factors at both individual and cluster levels. There is lack of both methodology and empirical research on the use of PS methods to estimate the treatment effect with multilevel data from cluster-based programs. Objectives: The objectives of this study are, (i) to compare the performance of PS models and PS conditioning methods in reproducing the treatment effect estimates with multilevel data from cluster-based programs; (ii) to examine the impact of different PS methods on the evaluation of a school-based mental health prevention program and investigate the implications of different PS methods in program evaluations. Methods: Using Monte Carlo simulations, we examined the appropriateness of using PS methods to reproduce treatment effect estimates in cluster-based programs. The data simulations incorporated a clustered observational study (COS) design with treatment assignment at the cluster level. The design factors in the simulation study included: cluster size, number of clusters, intra-class correlation (ICC), as well as the treatment effect size. Specifically, this study compared two different PS models and four different PS conditioning methods across different simulation scenarios in terms of these design factors. The first PS model disaggregates clusterlevel covariates to individual level and uses a logistic regression at individual level to estimate PSs for individuals, and the second PS model aggregates lower-level covariates to cluster level and performs a logistic regression at cluster level. Four different conditioning techniques (covariate adjustment, stratification, weighting, and matching) were combined with each of the two PS models to estimate the average treatment effect (ATE) or the average treatment effect on the treated (ATT). The performance of these PS methods was examined using relative bias, mean squared error (MSE) and 95% CI coverage in data simulation under different situations. We also applied different PS methods to the evaluation of a real mental health prevention program, PAX Good Behavior Game (PAX). The impact of different methods on PAX evaluation was illustrated using three-level multilevel regression combined with PS methods. Results: The results of our simulation study suggest that the performance of PS analyses depends on the PS estimation model (i.e., individual level PS model vs. cluster level PS model) and conditional strategies (i.e., matching, stratification, covariate adjustment, weighting), as well as other factors including number of clusters and ICC. Overall, the individual PS model worked better than the cluster PS model when combined with the same conditional method; and PSbased methods generated less biased and more stable estimates when the number of clusters is large. In terms of conditional methods, covariate adjustment (adjusting on PS score) and weighting produced less biased and more stable estimates than stratification when estimating ATE, and weighting and stratification produced more reliable estimates than matching when estimating ATT. When the number of clusters (e.g., school) is large, the differences among different PS method on program effect size estimation are minimal. This was revealed by application of PS methods to PAX program data analyses. However, using the PS methods improved the imbalance at both individual and cluster levels. Conclusions and significance: In evaluation of cluster-based programs with treatment assigned at cluster level, it is important to consider the potential bias due to imbalance at both individual and cluster levels among these treatment arms. The PS-based methods have the potential to reduce the imbalance and produce more accurate estimates of treatment effects. Overall, the individual level PS models fared slightly better than the cluster level PS models. The impact of different conditional PS techniques might depend on many factors such as ICC, sample sizes at each level and covariates information. Our results provide guidance for practitioners who implement group-based interventions.