An Introduction to Cognitive Behaviour Therapy, 2 edition


Evaluating CBT Practice

What is evaluation and why should we do it?

By evaluating practice, we mean gathering data with the aim of determining how well therapy is working or whether one form of therapy is better than another. We believe that CBT practitioners should attempt to evaluate the effectiveness of their therapy for several reasons:


  1. It places us in the great tradition of ‘scientist-practitioners’ (Committee on Training in Clinical Psychology, 1947; Raimy, 1950), aiming to expand knowledge through ‘real world’ research by practitioners. (See also Salkovskis, 1995, 2002; Margison et al., 2000.) The idea behind these approaches is that although traditional, university-based, controlled research is essential to progress, some questions are best answered through research based in clinical practice and carried out by ordinary clinicians.
  2. It allows us to give both clients and purchasers more accurate information about what kind of outcomes clients can expect. Such evaluation is, therefore, an important part of accountability to our commissioners and of informed consent for our clients. It also allows both our clients and ourselves to see whether we are doing as well as expected and, therefore, whether there are areas that we need to improve.
  3. It gives us a baseline of data against which we can compare changes we introduce in running our services. For instance, if we introduce a change hoping to reduce the proportion of people who drop out of therapy, then it is helpful to know what the original proportion was; if we do some training, hoping to improve outcomes with depression, then we need to know what our outcomes were before the training. This kind of routine data can be an enormously useful support for clinical audits.

Thus, some system for routinely evaluating therapy is important, and while one short chapter can only cover a fraction of the issues of research design that arise in this area, we hope we can give you some useful pointers.

Types of evaluation

There are two main foci for evaluation:


  • individual clinical case outcome (including evaluating a single group);
  • whole clinical service outcome (whether provided by one clinician or 100).

We shall look at each of these in turn.

Evaluating individual clinical cases

The major purposes of evaluating individual outcomes are (a) to allow you and your client to see what, if any, changes have occurred in therapy; and (b) in some cases to look more closely at the effects of a clinical intervention, perhaps using what has been called single-case experimental designs.

The first of these is fairly straightforward: we take some relevant measures, perhaps at the beginning and end of therapy, and see whether and by how much they change. Used at this level, such evaluation is straightforward good clinical practice. It gives both therapist and client a clear view of how much difference therapy has made to target problems.

Specific single-case research designs are probably less familiar to many readers, and we shall briefly introduce some of the ideas behind these approaches, although we cannot do more than scratch the surface. The interested reader is directed to classic texts such as Barlow, Andrasik and Hersen (2006) and Kazdin (2010).

The aim of these designs is to allow us to be more confident about evaluating the impact of treatment or some component of treatment. The most common approaches to single-case design rely on regularly repeated measures. The basic logic is that we establish some measure of the problem in which we are interested and then repeat that measure sufficiently often to establish a trend – the so-calledbaseline – against which we can compare subsequent changes when we introduce an intervention. The baseline gives us some protection against the likelihood that changes we observe are actually due to chance or some other factor, rather than our intervention. If we take just one measurement before therapy and one after, with only one individual, then it is impossible to rule out the chance that something external to therapy – for example, that our client won the lottery, or fell in love, or got a wonderful new job – caused any changes we see. If we have larger numbers of measurements, it becomes much less plausible that an external change happened to occur at exactly the time when we changed our intervention.

Figure 18.1    ‘Before and After’ versus ‘Repeated measurements’

Figure 18.1 illustrates this logic. Imagine that the vertical axis here represents some relevant measure: score on a depression questionnaire, or number of obsessional thoughts in a day, or ratings of fear in a particular situation. In the left-hand part of this figure, with a single measurement before and after treatment, there is nothing to assure us that the reduction in score is not due to some external cause unrelated to therapy. We have only two measurements – anything could have happened in the intervening time and had an impact on whatever the measure is. In the right-hand chart, however, the frequent repeated measures give us greater reason to believe that the treatment has caused the change because it is less likely that a sequence of repeated measurements should happen to respond to such an event just at the specific time that the treatment is introduced.

The basic logic of many single-case designs follows this principle. We look at the pattern of measurements to see whether changes coincide with changes of treatment: if they do, that gives us some reason to believe that the treatment was responsible for the change (but we can still not be sure that some coincidental event has not caused the change).

The simple design on the right-hand side of Figure 18.1, consisting of a baseline before treatment and a continuation over the course of treatment, is often known as an A–B design: the baseline is Condition A and treatment is Condition B. If the treatment is one that we would expect not to have a lasting effect but only to work whilst it is being implemented (e.g. perhaps a sleep hygiene programme), then there is scope for extending the A–B design to variations such as A–B–A, in which we first introduce the treatment and then withdraw it; see Figure 18.2 for an illustration of this.

Figure 18.2    A–B–A design

Figure 18.3 Alternating treatments design

The basic logic is strengthened here by the measure’s responding not just to the introduction of the treatment but also to its withdrawal. The likelihood that such opposite responses should coincide with treatment changes just by chance is even smaller, and thus our conviction that the treatment caused a change is stronger. Of course, if the treatment is one that we would expect to have a persisting effect – e.g. CBT for depression leading to improved mood – then this A–B–A model is not usable: we do not expect the client’s mood to drop as soon as the treatment is withdrawn.

We shall briefly describe two further common designs. The first, the alternating treatments design, is a way of determining in a single case which of two treatments is more effective (but requires that the treatment’s effect will be measurable rapidly). During each segment (e.g. a treatment session, or some other unit of time), one of the two treatments is chosen randomly, and the measure is repeated for each segment. If the measure shows a clear separation of the two conditions, as in Figure 18.3, then we have some evidence that one treatment is more effective than the other. For example, suppose we wanted to test the hypothesis that talking about a particular topic makes our client anxious. Then we could agree with the client to decide randomly in some sessions to talk about the topic; and in other sessions not to; and to take ratings of anxiety. In Figure 18.3, if A marks the ‘avoiding’ sessions and B marks the ‘talking’ sessions, then the pattern suggests that avoiding leads to lower scores on our measure than talking does.

Figure 18.4    Multiple baseline across behaviours

This design can also usefully be adapted for patients’ behavioural experiments (Chapter 9), for example to help an obsessional patient decide whether repeated checking of the front door actually causes more or less anxiety than doing one quick check and walking away.

Finally, there is the multiple baseline design, where we look at several different measurements at the same time, hence the name. There are several variations: multiple baseline across behaviours, acrosssettings or across subjects. Consider this simple example of multiple baselines across behaviours. A client has two different obsessional rituals, both of which we monitor regularly during the baseline period (see Figure 18.4, where the triangles represent the frequency of one ritual and the squares the other ritual). Then we introduce the treatment for one behaviour only (one ritual in this case). After a delay, we introduce the treatment for another behaviour (the second ritual in this case). If we get a pattern like Figure 18.4, where each behaviour shows a change just at the time treatment was introduced for that behaviour, then this gives us some reason to believe it was the treatment that caused the change (see Salkovskis & Westbrook, 1989, for an example of this design’s being used to evaluate treatment for obsessional thoughts).

The same principles apply to multiple baseline designs across subject or settings: of course, the number of different baselines does not have to be two, as in our example above, but can be any number. In the example in Figure 18.4, each set of data represents one behaviour (a ritual in our example); in the case of multiple baseline across subjects, each set of data represents a person, to whom we introduce the treatment at different times after baseline; in the case of multiple baseline across settings, each data set represents one situation (for example, a programme for disruptive behaviour that is introduced first in the school setting and then later at home). Note that this design can only work when we would expect some independence between the behaviours, subjects or settings: if the treatment is likely to generalise from one of these to the others, then the synchronised change we are looking for will not happen.

Finally, note that we have described here the common approach of analysing the results of such single case designs by visual inspection – i.e. by looking at the pattern of results and seeing what they seem to show. Over the past 20 years, there have also been developments in the statistical analysis of single-case designs, but such statistics are not yet straightforward enough for most ordinary clinicians to use.

Evaluating services

The other common form of evaluation is the collection of data about whole services and, therefore, larger numbers of clients. The main purposes of such evaluations are:


  • to describe the client population (e.g. age, sex, chronicity of problems etc.);
  • to describe the nature of the service (e.g. drop-out rates, average number of treatment sessions, etc.);
  • to establish the effectiveness of the service’s treatments, using outcome measures;
  • to use routinely collected data as a baseline against which changes of service can be evaluated (e.g. does this change result in better outcomes, or greater client satisfaction?).

It is impossible to specify what kind of data should be collected, as that depends on your own service’s interests and goals, but most services collect various forms of data, including:


  • client outcome data (e.g. mental-health questionnaire measures, administered before and after treatment – see below);
  • client demographic data (e.g. age, sex, duration of problems, employment status, etc.);
  • service parameters, such as dates of referral, etc. (from which waiting times can be calculated);
  • service outcomes such as dropping out of treatment or not attending appointments.

Several years ago the service in which all the authors then worked decided to implement a limit of 10 sessions on treatment, in an attempt to reduce waiting lists. Because this change naturally aroused some worries, it was agreed that its effects should be evaluated. Several different aspects of the new procedure were included in the evaluation:


  1. Did the limit have an effect on client outcomes? The service had been collecting routine outcome data for many years, so those existing data could be used as ‘historical controls’ against which to compare the outcomes obtained under the new regime.
  2. Did it change client satisfaction? Again we had previous data using the Client Satisfaction Questionnaire (Larsen, Attkisson, Hargreaves & Nguyen, 1979) that were used as a comparison.
  3. How did therapists respond to the limits? We used ad hoc rating scales to evaluate whether they found it easier or harder, how it affected therapy, etc.

The results were that broadly the 10-session limit did not result in different outcomes; clients were just as satisfied; and therapists had a ‘swings and roundabouts’ response, in that they found some things harder but some easier. The exception to the finding of broadly similar outcomes was that there was some evidence that clients with ‘personality disorders’ did less well with the brief treatment, so this was investigated further.

Some frequently used questionnaires

Which outcome measures to use is again a matter for each service to decide according to its needs, but the following are suitable for routine clinical use in that they (a) do not take too long for a client to complete; (b) are widely used, so that comparisons can be made with other services and/or research trials; and (c) assess aspects of mental health that are common in most populations.


  • The Beck Depression Inventory (BDI: Beck et al., 1961) is probably the best-known measure of depression. The latest revision is the BDI II (Beck, Brown & Steer, 1996), although the original version is still sometimes used in research in order to retain comparability with earlier work.
  • The Beck Anxiety Inventory (BAI: Beck et al., 1988) is a similar measure of anxiety.
  • The Clinical Outcomes in Routine Evaluation – Outcome Measure (CORE–OM: Evans et al., 2002; Barkham, Mellor-Clark, Connell & Cahill, 2006; Mullin, Barkham, Mothersole, Bewick & Kinder, 2006 – see also the website at is an increasingly popular general measure of mental health in the UK, especially in primary care settings. Mullin et al. (2006) provide some useful national benchmarking standards, by giving mean CORE scores for a sample of over 10,000 clients from many different services across the UK.
  • The Hospital Anxiety and Depression Scale (HADS: Zigmond & Snaith, 1983), which, despite the name, is suitable for community settings. The name arose because it was originally designed for use in general hospital settings and, therefore, aimed to avoid confounding mental-health problems with physical-health problems. This characteristic makes it particularly useful for settings where one might expect a significant proportion of clients to have physical-health problems as well as mental-health problems.

A literature search will quickly turn up other measures suitable for almost any specific mental-health problem.

Other measures

Standardised questionnaires are often supplemented by other measures, such as individual problem ratings, belief ratings for particular cognitions, problem frequency counts or duration timings, and so on (see Chapter 5).

Clinical significance statistics

Service-evaluation data can be analysed using any of the standard statistical approaches. However, an approach known as ‘clinical significance’ analysis is particularly suited for clinical services, and especially an approach developed by Jacobson (Jacobson & Revenstorf, 1988; Jacobson, Roberts, Berns & McGlinchey, 1999). The aim of clinical significance analysis is to deal with the problem in conventional statistical testing that almost any change in average scores, even a tiny one, will emerge as significant if the number of participants in the study is large enough. Conventional testing tells us that such a change is ‘significant’ in the sense that it is unlikely to have emerged by chance but does not tell us that it is significant in the sense of being important. Thus, given large enough numbers, a change in patients’ mean BDI score of a couple of points from start to end of treatment might be statistically significant – and rightly so, in the sense of being ‘not due to chance’. But clinicians would not regard such a change of score as clinically significant, in the sense that their clients would not be happy if this was the kind of benefit they could expect.

Jacobson’s approach to testing for clinical significance looks at each participant in a study individually and asks two questions:


  1. Did this person’s score on a particular measure change sufficiently for it to be unlikely to be due to chance? A ‘reliable change’ index, dependent on the reliability of the measure and its natural variation in the population, is calculated. If a patient’s change score is greater than the calculated criterion, then that patient may be described as reliably improved (or deteriorated) on the measure.
  2. If the patient hasreliably changed, has the change also taken them across a cut-off point into the normal range for this measure? If so, we may consider the person not just improved but also ‘recovered’. Jacobson et al. set out different possible ways of setting this ‘normal cut-off’ criterion, e.g. by calculating the point beyond which a patient is statistically more likely to belong to a normal population than a dysfunctional population.

Figure 18.5    Classification of change scores for clinical significance

Figure 18.5 shows the possible outcomes resulting from this analysis for each client. Depending on the above two calculations, every client is classified as: reliably deteriorated, no reliable change, reliably improved (but not recovered) or recovered. The results of the analysis are reported as the proportion of clients falling into each of these categories.

The advantages of this approach are:

a  that it gives us more meaningful statistics to report: most clinicians would agree that a client who meets both of the Jacobson criteria has truly made clinically significant progress;

b  that the resulting figures are more comprehensible to clients and/or service commissioners: it is much easier for most people to understand ‘On average 56% of clients recover’ than ‘On average, clients’ scores on the BDI move from 17.3 to 11.2’ – Westbrook and Kirk (2005) give an example of this kind of analysis for routine clinical data.

Incidentally, it is worth noting that although such ‘bench-marking’ strategies (see also Wade, Treat & Stuart, 1998; Merrill et al., 2003) typically find that CBT is an effective treatment in clinical practice as well as in research trials, clinical significance analysis is sobering for anyone who believes that CBT (or any other kind of psychological therapy) is a panacea that can help all clients: most such analyses find that only around a third to a half of clients achieve recovery by these criteria.

Difficulties in evaluation

Keep it simple

There is always a temptation to gather more data. It is easy to think ‘Whilst we’re at it, let’s find out about this … and this … and this …’. The result can be an unwieldy mass of data that overburdens the client, is too time-consuming to collect reliably and is even more time-consuming to analyse. In general, it is better to have a small number of data items which can be collected and analysed reasonably economically.

Repeating measures

Sometimes clients become over-familiar with regularly used measures and begin to complete them on ‘automatic pilot’. Always spend a minute or two discussing questionnaire results with your client so that you can assess how valid the responses are.

Keep it going

Most routine data collection starts enthusiastically, but this cannot be sustained. We suggest two factors are important in keeping data collection going. First, having a ‘champion’ at a reasonably senior level – someone who will support data collection and analysis and make sure that people are prompted if they forget about collecting data. Second, it is crucial that clinicians collecting data see that something is done with it and that results are fed back to them periodically. Data that are never analysed are useless anyway, and the chances are low that people will continue to collect it when no results appear.

Research design

Clinical service evaluation usually cannot reach the highest standards of research design, such as RCTs. All research designs involve some compromise between (a) the tightly controlled research that eliminates as much uncertainty as possible but, in doing so, may end up not resembling real clinical practice; and (b) the more ‘real-world’ research that is very close to clinical practice but, as a result, leaves room for ambiguity about causal factors. Service evaluation therefore often works on the principle that some evidence is better than nothing and accepts some lack of rigour for the sake of being able to describe everyday outcomes. Robson’s (2002) book on ‘real-world research’ is a useful resource to look further at these issues.




  • One of CBT’s strengths is its commitment to empiricism, i.e. to evaluating whether there is good evidence to support its theories and the effectiveness of its treatments. This commitment is not just for academics but can and should be incorporated into clinical services.
  • One common form of evaluation looks at individual clinical cases in order to tell, more reliably than mere subjective opinion, whether therapy (or some component of therapy) is effective. So-calledsingle case designsare particularly useful here, and can be implemented without a great deal of change to ordinary CBT practice.
  • The other common form of evaluation takes a broader view and aims to evaluate whether a clinical service as a whole is in some relevant sense ‘doing a good job’: obtaining good outcomes, doing as well as some relevant comparison service, obtaining outcomes better than it used to, or whatever. There are many measurement tools available to assess outcomes, some of which enable comparisons with other services or with research trials.
  • Clinical significanceanalysis can be a useful tool for summarising outcomes in a way that is meaningful and understandable, both to clinicians and to clients.

Learning exercises

Review and reflection:


  • Does your service currently do any form of routine evaluation? If so, how well does it work? What could be improved? If not, what might be the pros and cons of doing some? How might you persuade your colleagues and/or managers that it would be a good idea?
  • What about your own individual clients? Could you do more to evaluate their progress in therapy? How might that be useful, for you or for them? What challenges would arise if you were to do more?

Taking it forward:


  • Many interesting ideas for research and evaluation arise from thinking about questions that come up in clinical practice: ‘It seems to me that treatment technique X works better than Y for this problem’, or ‘Clients seem to be less likely to drop out of treatment if I do Z’, or whatever. Perhaps you could keep a note of thoughts like that, and see whether there is any way to gather some relevant evidence.
  • If your service does not currently collect routine data, could you talk to colleagues about whether it would be useful, what data to collect, and so on?
  • If you have data but have not analysed it or collated it, maybe you could book some time out some way ahead in order to do so.

Further reading

Robson, C. (2002). Real world research (2nd ed.). Oxford: Blackwell. [3rd edition due in 2011]

As the title suggests, this is an excellent and comprehensive introduction to doing research in the ‘real world’, i.e. outside academic settings.

Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.

Statistics is always going to be intimidating for many of us, but Field does as good a job as possible in making it interesting and practical, with many detailed examples of how to do statistical tests using the popular statistics software package, SPSS.

Westbrook, D. (2010). Research and evaluation. Chapter 18 in M. Mueller, H. Kennerley, F. McManus, & D. Westbrook (Eds.), The Oxford guide to surviving as a CBT therapist. Oxford: OUP.

A brief introduction to some of the issues around doing evaluation research in clinical practice.