Tests of model skill using different metrics. (a) Tests of model skill compare predicted (modeled) variables against field observations, and common metrics are goodness of fit (R2) and root-mean-square error (RMSE). “Messy” (high frequency) data (a) are often smoothed using averages (b) which are then used to assess model skill. In this example, two metrics of skill are presented, one based on monthly averages (b) and one using monthly maxima (c). The two tests give widely divergent estimates of model performance, as indicated by the test statistics, but actual model skill will depend on which of these metrics (average or extreme) most accurately represents the driver affecting the organisms being modeled. In order to have confidence in a model’s ability to predict biological response, it must be tested against metrics that have been shown to be biologically meaningful and not simply covariates of those that are. In many cases, commonly used parameters such as annual means may have little biological relevance, as the degree to which they are correlated with biologically meaningful values may not apply under future, novel climatic conditions.