Beyond the happy sheet: measuring leadership development that matters

Satisfaction scores and completion rates tell you a programme ran. They tell you nothing about whether capability changed. The fix is not more measurement. It is better measurement design.

CapabilityFX Editorial Team · Editorial Team4 March 2026 · 8 min read

The happy sheet measures the wrong thing well

At the end of most leadership programmes, participants fill in a form. They rate the facilitator, the venue, the relevance of the content, and how likely they are to recommend the experience to a colleague. The scores come back high. They almost always do. The report goes to the sponsor, the budget is justified for another year, and everyone moves on. This document is known, not affectionately, as the happy sheet. It is a precise instrument. It measures satisfaction accurately. The trouble is that satisfaction is not the thing anyone actually bought.

If you are responsible for leadership development in your organisation, the gap between what you measure and what you are trying to produce is the most expensive blind spot you carry. It is not that the numbers are wrong. It is that they answer a question nobody important is asking.

Activity metrics feel like evidence and are not

The default measures of leadership development fall into two families, and both describe activity rather than change.

The first family is reaction data: satisfaction scores, net promoter ratings, "I will apply this back at work" self-reports collected in the room. The second is completion data: attendance, modules finished, hours logged, certificates issued. Together they make up the bulk of what most L&D functions report upward, because they are cheap to collect, available immediately, and reliably flattering.

The problem is well documented. Donald Kirkpatrick set out his four levels of training evaluation in the 1950s, and the framework is still taught: reaction, learning, behaviour, results. The uncomfortable finding, confirmed repeatedly since, is that the levels barely correlate. People who enjoy a programme do not reliably learn more from it. People who pass the knowledge check do not reliably behave differently afterward. A leader can rate a programme nine out of 10, recall the model perfectly six weeks later, and run their team in exactly the way they did before. Reaction and recall are real signals. They are simply signals about the experience, not about the leader.

This is why activity metrics are seductive and misleading at the same time. They go up. They go up whether or not anything changed in the work. An organisation can improve its satisfaction scores year on year, raise completion rates to 98 percent, and produce no measurable difference in how its leaders behave under pressure. The dashboard looks healthier every quarter while the thing it claims to track stays flat. We have argued before that most leadership programmes do not build capability at all. Measuring them with happy sheets is how that failure stays invisible.

Design the measurement before you design the programme

The instinct, once an organisation sees the gap, is to measure more. Add a six-month follow-up survey. Add a 360. Add a knowledge assessment. This usually makes things worse, not better, because it produces more numbers without producing more clarity. The discipline that matters is not measuring more. It is choosing fewer measures, deliberately, and choosing them before the programme is designed rather than after it has run.

Good measurement design rests on three distinctions that most dashboards blur.

Leading and lagging indicators

A lagging indicator tells you what already happened: retention of high-potential leaders, internal promotion rates, engagement scores in teams whose managers went through development, time to fill senior roles from within. These are the outcomes the organisation actually cares about. They are also slow, noisy, and influenced by a hundred things other than your programme. If you wait only for lagging indicators, you will know whether the investment worked roughly two years after it is too late to do anything about it.

A leading indicator is an earlier, observable signal that the change is underway: are leaders holding the difficult conversations they previously deferred, are they surfacing problems sooner, are their teams escalating issues earlier rather than later. Leading indicators are less authoritative on their own, but they are timely, and they let you correct course while the engagement is still live. The design principle is to pair them: at least one leading indicator you can read within months, anchored to at least one lagging outcome you genuinely care about. A leading indicator with no lagging anchor measures motion. A lagging indicator with no leading signal arrives too late to steer.

Behaviour measured in the work

The level Kirkpatrick called behaviour is where the real evidence sits, and it is the level almost everyone skips, because it is the hardest to capture. Behaviour change cannot be read from a survey filled in by the person who changed. It has to be observed in the work, by people positioned to see it, over time.

This does not require a research department. It requires deciding, in advance, what observable behaviours would count as evidence, and who is placed to notice them. A direct report can tell you whether their manager now asks what they think before deciding. A peer can tell you whether a colleague who used to dominate meetings now makes room. These are specific, watchable things. We have written separately about the qualitative markers a coach learns to watch for; the point for measurement design is that those markers can be turned into a small, structured observation that a few well-placed colleagues complete, not a 40-item competency rating that everyone games.

Outcomes that connect to the business

Finally, the measurement has to reach the business question that justified the spend. Not "did leaders enjoy it" but "did the thing we needed leaders to do better actually get better." If the case for the programme was that senior decisions were bottlenecking at three desks, the outcome metric is whether those decisions now distribute. If the case was that the organisation could not promote from within, the outcome is the internal promotion rate two years out. Naming this metric before the programme starts is what stops the evaluation drifting back to the happy sheet by default.

What it looks like in practice

The difference between activity measurement and capability measurement is clearest in specific cases.

The L&D director who replaced the happy sheet with three observed behaviours. An L&D director at a South African retail group had a flagship leadership programme with a 9.1 average satisfaction score and a completion rate she could be proud of. Her chief executive asked a plain question: what has changed because of it. She could not answer. For the next cohort she did something different. Working with the line managers of the participants, she defined three observable behaviours the programme was meant to produce, expressed as things a colleague could watch happen: the leader raises a difficult issue in a meeting rather than handling it privately afterward; the leader asks a direct report for their view before stating their own; the leader names a mistake plainly without either deflecting or over-apologising. At the start, midpoint, and four months after the programme, each participant's manager and two direct reports completed a short observation against those three behaviours. The satisfaction score still got collected, but it stopped being the headline. The headline became a movement she could show: on the first behaviour, raising difficult issues, the proportion of participants observed doing it consistently rose across the cohort. The number was modest. It was also real, and for the first time it described the leaders rather than the experience.

The HR business partner who paired a leading signal with a lagging outcome. An HR business partner in a manufacturing business was under pressure to prove that a costly development engagement was working before the next budget cycle, well before any promotion or retention data could mature. Rather than wait for the lagging numbers, she chose one leading indicator she could read within four months: the rate at which problems were being escalated to senior leaders early, while they were still small, versus surfacing late as crises. She already had a rough proxy in the incident log. As the engagement progressed, the pattern shifted: more issues arrived at the executive table while they were still manageable, fewer arrived as escalations that had already done damage. That did not prove the programme had worked. It did show the behaviour the programme targeted was moving, in the work, in time to report. She paired it explicitly with the lagging outcome the business actually wanted, internal promotion into senior operational roles, and set the expectation that the real verdict would come from that number two years out. The leading signal bought the engagement the patience the lagging signal required.

What both cases share is restraint. Neither director added a dozen new metrics. Each chose a few, defined them before measuring, and accepted that a modest, honest number beats an impressive, meaningless one. That restraint is the discipline. A dashboard with 30 indicators is usually a sign that nobody decided which three mattered.

How the work itself produces the measure

There is a reason capability is hard to measure with conventional tools, and it is the same reason it is worth building. Real development happens at the level of who a leader is, not only what they can do, and identity-level change shows up in behaviour under pressure rather than in a self-report the next morning. The measurement problem and the development problem are the same problem viewed from two sides.

This is where the structure of the work helps. The DUAL model, which organises how CapabilityFX approaches development, moves a leader through Discover, Understand, Accept and Lead, and each stage produces observable evidence if you know to look for it. A leader who has genuinely reached Accept stops retreating from uncomfortable realities, and that is watchable in how they run a hard conversation. Our 4D method is built around an arc of practice, reflection and challenge across months rather than a compressed sprint, which means there is time for behaviour to move and to be observed at intervals rather than guessed at from a single end-of-course form.

Validated assessment has a place here too, used honestly. CapabilityFX uses Ennea International's Five Lens Development Platform and Tomorrows Compass's future-readiness assessment as part of how it establishes a baseline and tracks shift on dimensions that predict performance under pressure. These are licensed instruments owned by those companies, not ours, and they are not a substitute for watching behaviour in the work. They are one input in a measurement design, not the whole of it. Our assessments page sets out what each instrument does and where it fits.

The reader's next step

If you want to test whether your current measurement is doing any real work, three questions are usually enough.

What would change if every satisfaction score went up by a point? If the honest answer is "nothing important," that metric is decorative. Keep it if you like, but stop treating it as evidence.

Can you name one observable behaviour your last programme was meant to produce, and say whether it moved? If you cannot name the behaviour, the programme had no measurable target. If you named it but cannot say whether it moved, you measured the wrong thing.

Which of your indicators is leading, and which is lagging? If they are all lagging, you will only ever learn the verdict too late to act. If they are all leading, you have no anchor to the outcome the business actually wanted. The design is the pairing.

These questions do not require a new system. They require deciding what you are actually trying to produce, and choosing a small number of measures that would honestly tell you whether you are producing it. That decision is cheap. The absence of it is what makes leadership spend so hard to defend.

Measure the change, not the activity

The happy sheet survives because it is easy and because it flatters. It will keep surviving until someone decides that the question worth answering is not whether leaders enjoyed the programme but whether they lead differently because of it. That decision changes what you measure, and changing what you measure eventually changes what you build, because you can no longer hide a programme that produces no behaviour behind a programme that produces high scores.

If you want to think through how to measure development that actually changes behaviour in your organisation, start a conversation with us. It is the conversation most leadership investment never has, and it is the one that decides whether the investment was worth making.

The leaders described here are representative composites drawn from patterns we observe in practice, not identifiable individuals.

CapabilityFX Editorial Team · Editorial Team

The CapabilityFX editorial team writes on leadership capability, future-readiness, assessment, and the research behind how leaders actually change. Our pieces are grounded in Dr Eric Albertini’s doctoral research and the firm’s work with leadership teams, and are reviewed for evidence and accuracy before publication.