I have been meaning to post on model verification for a while and cumulative interactions with customers have led to this proposal. See the Twitter postings alongside for relevant material from the FDA and recent additions to DynoChem Resources. To automatically keep abreast of these, I encourage you to follow DynoChem on Twitter.

Proposed approach to model verification:

- Model development and parameter fitting should be based on a 'training set' of experimental data, that is a subset of all data available.
- Verification should in general be completed against a separate set of experimental data, probably testing the limits of the model (e.g. points at the corners of the anticipated region of validity).
- Those data do not need to be from 'designed' or perfect experiments; in fact, inclusion of spiked or otherwise perturbed experiments can be highly valuable and informative.
- Verification should be described, presented and qualified as being ‘to within E%' or 'within E response units'. E may vary within a factor space.
- E is not arbitrary, but equal to the prediction band width for that response, with confidence level 1-alpha, where alpha may be 5% (95% confidence) or perhaps 1% (99% confidence).
- The limits of applicability of the statement in 4 above (i.e. the region of factor space that is covered) should be defined.

Usage of prediction band widths (or 'prediction intervals') in this way allows a statistically sound statement to be made about the level of verification of any model in which parameters have been fitted. During model development, E reduces if the model ‘improves’, i.e. the fit improves and uncertainty is reduced. When the model is mechanistic, there is often little risk of 'overfitting' (many degrees of freedom) and the quality of mechanistic understanding, together with collection of good data, are the main factors that improve the fit. Bear in mind also that in a mechanistic model, a single set of parameters fits all responses (not separate models for each response) and the fit is judged versus multiple samples, not just end-points.

E needs to be small if users are going to operate near the CQA (or another important) limit, but can be relatively larger if not. So the verification level required for a model to be useful has an element of fitness for purpose.

Prediction bands take account of 'lack of fit' and are correspondingly wider for responses that fit poorly compared to those that fit well. For a CQA upper limit (e.g. typical for an impurity), mathematically one could therefore say that a model is verified and fit for QbD purposes if:

average response*(1+E%) is comfortably less than CQA limit

The above expression is also equivalent to evaluating the probability that CQA will be less than its limit; that probability increases if E is low and/or the average response is well below the CQA upper limit. So that probability is itself an indicator of the degree of model verification achieved.

Of course, in a good mechanistic model, E will be small for all responses, not just a CQA; focusing on reducing E will improve process understanding of the whole system and both prediction and confidence band response surfaces may be drawn to guide experimentation to this goal, see previous posts.