This paper presents the results of a multi-model comparison to determine outcome deviations resulting from differences in power system models. We apply eight temporally and spatially resolved models to 16 stylized test cases. These test cases differ in their renewable energy supply share, technology scope, and optimization scope. We focus on technologies for balancing the variability of power generation, such as controllable power plants, energy storage, power transmission, and flexibility related to sector coupling. We use harmonized input data in all models to separate model-related from data-related outcome deviations. We find that our approach allows for isolating and quantifying model-related outcome deviations and robust effects concerning system operation and investment decisions. Furthermore, we can attribute these deviations to the identified model differences. Our results show that trends in the use of individual flexibility options are robust across most models. Moreover, our analysis reveals that differences in the general modeling approach and the modeling of specific technologies lead to comparatively small deviations. In contrast, a heterogeneous model scope can cause substantially larger deviations. Due to a large number of models and scenarios, our analysis can provide important information on which investment and operation decisions are robust to the model choice, and which modeling approaches have an exceptionally high impact on results. Our findings may guide both modelers and decision-makers in properly evaluating the results of similarly designed power system models.