What is evaluation and why should we do it?
By evaluating practice, we mean gathering data with the aim of determining how well therapy is working or whether one form of therapy is better than another. We believe that CBT practitioners should attempt to evaluate the effectiveness of their therapy for several reasons:
Thus, some system for routinely evaluating therapy is important, and while one short chapter can only cover a fraction of the issues of research design that arise in this area, we hope we can give you some useful pointers.
Types of evaluation
There are two main foci for evaluation:
We shall look at each of these in turn.
Evaluating individual clinical cases
The major purposes of evaluating individual outcomes are (a) to allow you and your client to see what, if any, changes have occurred in therapy; and (b) in some cases to look more closely at the effects of a clinical intervention, perhaps using what has been called single-case experimental designs.
The first of these is fairly straightforward: we take some relevant measures, perhaps at the beginning and end of therapy, and see whether and by how much they change. Used at this level, such evaluation is straightforward good clinical practice. It gives both therapist and client a clear view of how much difference therapy has made to target problems.
Specific single-case research designs are probably less familiar to many readers, and we shall briefly introduce some of the ideas behind these approaches, although we cannot do more than scratch the surface. The interested reader is directed to classic texts such as Barlow, Andrasik and Hersen (2006) and Kazdin (2010).
The aim of these designs is to allow us to be more confident about evaluating the impact of treatment or some component of treatment. The most common approaches to single-case design rely on regularly repeated measures. The basic logic is that we establish some measure of the problem in which we are interested and then repeat that measure sufficiently often to establish a trend – the so-calledbaseline – against which we can compare subsequent changes when we introduce an intervention. The baseline gives us some protection against the likelihood that changes we observe are actually due to chance or some other factor, rather than our intervention. If we take just one measurement before therapy and one after, with only one individual, then it is impossible to rule out the chance that something external to therapy – for example, that our client won the lottery, or fell in love, or got a wonderful new job – caused any changes we see. If we have larger numbers of measurements, it becomes much less plausible that an external change happened to occur at exactly the time when we changed our intervention.
Figure 18.1 ‘Before and After’ versus ‘Repeated measurements’
Figure 18.1 illustrates this logic. Imagine that the vertical axis here represents some relevant measure: score on a depression questionnaire, or number of obsessional thoughts in a day, or ratings of fear in a particular situation. In the left-hand part of this figure, with a single measurement before and after treatment, there is nothing to assure us that the reduction in score is not due to some external cause unrelated to therapy. We have only two measurements – anything could have happened in the intervening time and had an impact on whatever the measure is. In the right-hand chart, however, the frequent repeated measures give us greater reason to believe that the treatment has caused the change because it is less likely that a sequence of repeated measurements should happen to respond to such an event just at the specific time that the treatment is introduced.
The basic logic of many single-case designs follows this principle. We look at the pattern of measurements to see whether changes coincide with changes of treatment: if they do, that gives us some reason to believe that the treatment was responsible for the change (but we can still not be sure that some coincidental event has not caused the change).
The simple design on the right-hand side of Figure 18.1, consisting of a baseline before treatment and a continuation over the course of treatment, is often known as an A–B design: the baseline is Condition A and treatment is Condition B. If the treatment is one that we would expect not to have a lasting effect but only to work whilst it is being implemented (e.g. perhaps a sleep hygiene programme), then there is scope for extending the A–B design to variations such as A–B–A, in which we first introduce the treatment and then withdraw it; see Figure 18.2 for an illustration of this.
Figure 18.2 A–B–A design
Figure 18.3 Alternating treatments design
The basic logic is strengthened here by the measure’s responding not just to the introduction of the treatment but also to its withdrawal. The likelihood that such opposite responses should coincide with treatment changes just by chance is even smaller, and thus our conviction that the treatment caused a change is stronger. Of course, if the treatment is one that we would expect to have a persisting effect – e.g. CBT for depression leading to improved mood – then this A–B–A model is not usable: we do not expect the client’s mood to drop as soon as the treatment is withdrawn.
We shall briefly describe two further common designs. The first, the alternating treatments design, is a way of determining in a single case which of two treatments is more effective (but requires that the treatment’s effect will be measurable rapidly). During each segment (e.g. a treatment session, or some other unit of time), one of the two treatments is chosen randomly, and the measure is repeated for each segment. If the measure shows a clear separation of the two conditions, as in Figure 18.3, then we have some evidence that one treatment is more effective than the other. For example, suppose we wanted to test the hypothesis that talking about a particular topic makes our client anxious. Then we could agree with the client to decide randomly in some sessions to talk about the topic; and in other sessions not to; and to take ratings of anxiety. In Figure 18.3, if A marks the ‘avoiding’ sessions and B marks the ‘talking’ sessions, then the pattern suggests that avoiding leads to lower scores on our measure than talking does.
Figure 18.4 Multiple baseline across behaviours
This design can also usefully be adapted for patients’ behavioural experiments (Chapter 9), for example to help an obsessional patient decide whether repeated checking of the front door actually causes more or less anxiety than doing one quick check and walking away.
Finally, there is the multiple baseline design, where we look at several different measurements at the same time, hence the name. There are several variations: multiple baseline across behaviours, acrosssettings or across subjects. Consider this simple example of multiple baselines across behaviours. A client has two different obsessional rituals, both of which we monitor regularly during the baseline period (see Figure 18.4, where the triangles represent the frequency of one ritual and the squares the other ritual). Then we introduce the treatment for one behaviour only (one ritual in this case). After a delay, we introduce the treatment for another behaviour (the second ritual in this case). If we get a pattern like Figure 18.4, where each behaviour shows a change just at the time treatment was introduced for that behaviour, then this gives us some reason to believe it was the treatment that caused the change (see Salkovskis & Westbrook, 1989, for an example of this design’s being used to evaluate treatment for obsessional thoughts).
The same principles apply to multiple baseline designs across subject or settings: of course, the number of different baselines does not have to be two, as in our example above, but can be any number. In the example in Figure 18.4, each set of data represents one behaviour (a ritual in our example); in the case of multiple baseline across subjects, each set of data represents a person, to whom we introduce the treatment at different times after baseline; in the case of multiple baseline across settings, each data set represents one situation (for example, a programme for disruptive behaviour that is introduced first in the school setting and then later at home). Note that this design can only work when we would expect some independence between the behaviours, subjects or settings: if the treatment is likely to generalise from one of these to the others, then the synchronised change we are looking for will not happen.
Finally, note that we have described here the common approach of analysing the results of such single case designs by visual inspection – i.e. by looking at the pattern of results and seeing what they seem to show. Over the past 20 years, there have also been developments in the statistical analysis of single-case designs, but such statistics are not yet straightforward enough for most ordinary clinicians to use.
The other common form of evaluation is the collection of data about whole services and, therefore, larger numbers of clients. The main purposes of such evaluations are:
It is impossible to specify what kind of data should be collected, as that depends on your own service’s interests and goals, but most services collect various forms of data, including:
Several years ago the service in which all the authors then worked decided to implement a limit of 10 sessions on treatment, in an attempt to reduce waiting lists. Because this change naturally aroused some worries, it was agreed that its effects should be evaluated. Several different aspects of the new procedure were included in the evaluation:
The results were that broadly the 10-session limit did not result in different outcomes; clients were just as satisfied; and therapists had a ‘swings and roundabouts’ response, in that they found some things harder but some easier. The exception to the finding of broadly similar outcomes was that there was some evidence that clients with ‘personality disorders’ did less well with the brief treatment, so this was investigated further.
Some frequently used questionnaires
Which outcome measures to use is again a matter for each service to decide according to its needs, but the following are suitable for routine clinical use in that they (a) do not take too long for a client to complete; (b) are widely used, so that comparisons can be made with other services and/or research trials; and (c) assess aspects of mental health that are common in most populations.
A literature search will quickly turn up other measures suitable for almost any specific mental-health problem.
Standardised questionnaires are often supplemented by other measures, such as individual problem ratings, belief ratings for particular cognitions, problem frequency counts or duration timings, and so on (see Chapter 5).
Clinical significance statistics
Service-evaluation data can be analysed using any of the standard statistical approaches. However, an approach known as ‘clinical significance’ analysis is particularly suited for clinical services, and especially an approach developed by Jacobson (Jacobson & Revenstorf, 1988; Jacobson, Roberts, Berns & McGlinchey, 1999). The aim of clinical significance analysis is to deal with the problem in conventional statistical testing that almost any change in average scores, even a tiny one, will emerge as significant if the number of participants in the study is large enough. Conventional testing tells us that such a change is ‘significant’ in the sense that it is unlikely to have emerged by chance but does not tell us that it is significant in the sense of being important. Thus, given large enough numbers, a change in patients’ mean BDI score of a couple of points from start to end of treatment might be statistically significant – and rightly so, in the sense of being ‘not due to chance’. But clinicians would not regard such a change of score as clinically significant, in the sense that their clients would not be happy if this was the kind of benefit they could expect.
Jacobson’s approach to testing for clinical significance looks at each participant in a study individually and asks two questions:
Figure 18.5 Classification of change scores for clinical significance
Figure 18.5 shows the possible outcomes resulting from this analysis for each client. Depending on the above two calculations, every client is classified as: reliably deteriorated, no reliable change, reliably improved (but not recovered) or recovered. The results of the analysis are reported as the proportion of clients falling into each of these categories.
The advantages of this approach are:
a that it gives us more meaningful statistics to report: most clinicians would agree that a client who meets both of the Jacobson criteria has truly made clinically significant progress;
b that the resulting figures are more comprehensible to clients and/or service commissioners: it is much easier for most people to understand ‘On average 56% of clients recover’ than ‘On average, clients’ scores on the BDI move from 17.3 to 11.2’ – Westbrook and Kirk (2005) give an example of this kind of analysis for routine clinical data.
Incidentally, it is worth noting that although such ‘bench-marking’ strategies (see also Wade, Treat & Stuart, 1998; Merrill et al., 2003) typically find that CBT is an effective treatment in clinical practice as well as in research trials, clinical significance analysis is sobering for anyone who believes that CBT (or any other kind of psychological therapy) is a panacea that can help all clients: most such analyses find that only around a third to a half of clients achieve recovery by these criteria.
Difficulties in evaluation
Keep it simple
There is always a temptation to gather more data. It is easy to think ‘Whilst we’re at it, let’s find out about this … and this … and this …’. The result can be an unwieldy mass of data that overburdens the client, is too time-consuming to collect reliably and is even more time-consuming to analyse. In general, it is better to have a small number of data items which can be collected and analysed reasonably economically.
Sometimes clients become over-familiar with regularly used measures and begin to complete them on ‘automatic pilot’. Always spend a minute or two discussing questionnaire results with your client so that you can assess how valid the responses are.
Keep it going
Most routine data collection starts enthusiastically, but this cannot be sustained. We suggest two factors are important in keeping data collection going. First, having a ‘champion’ at a reasonably senior level – someone who will support data collection and analysis and make sure that people are prompted if they forget about collecting data. Second, it is crucial that clinicians collecting data see that something is done with it and that results are fed back to them periodically. Data that are never analysed are useless anyway, and the chances are low that people will continue to collect it when no results appear.
Clinical service evaluation usually cannot reach the highest standards of research design, such as RCTs. All research designs involve some compromise between (a) the tightly controlled research that eliminates as much uncertainty as possible but, in doing so, may end up not resembling real clinical practice; and (b) the more ‘real-world’ research that is very close to clinical practice but, as a result, leaves room for ambiguity about causal factors. Service evaluation therefore often works on the principle that some evidence is better than nothing and accepts some lack of rigour for the sake of being able to describe everyday outcomes. Robson’s (2002) book on ‘real-world research’ is a useful resource to look further at these issues.
Review and reflection:
Taking it forward:
Robson, C. (2002). Real world research (2nd ed.). Oxford: Blackwell. [3rd edition due in 2011]
As the title suggests, this is an excellent and comprehensive introduction to doing research in the ‘real world’, i.e. outside academic settings.
Field, A. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.
Statistics is always going to be intimidating for many of us, but Field does as good a job as possible in making it interesting and practical, with many detailed examples of how to do statistical tests using the popular statistics software package, SPSS.
Westbrook, D. (2010). Research and evaluation. Chapter 18 in M. Mueller, H. Kennerley, F. McManus, & D. Westbrook (Eds.), The Oxford guide to surviving as a CBT therapist. Oxford: OUP.
A brief introduction to some of the issues around doing evaluation research in clinical practice.