play icon for videos

Training Effectiveness: 4 Levels, 6 Metrics, 1 Defensible Score

Most training effectiveness measurement is reaction theater. Here is how to capture all four Kirkpatrick levels into one defensible score in 12 weeks.

US
Pioneering the best AI-native application & portfolio intelligence platform
Updated
May 15, 2026
360 feedback training evaluation
Use Case
Training Effectiveness: 4 Levels, 6 Metrics, 1 Defensible Score
L1 · REACTION weight 10% Risk flags cleared from Pre to Post. The smallest weight because reaction is the level most prone to inflation.
L2 · LEARNING weight 25% Pre to Post confidence delta plus skills radar movement. The first signal that learning happened.
L3 · BEHAVIOR weight 35% Peer-rated effectiveness lift plus real-world application count. The heaviest weight because it is the most predictive.
L4 · RESULTS weight 30% Benchmark beats and multivariate driver attribution. The defensible answer the CFO wants.
TES = L1 × 0.10 + L2 × 0.25 + L3 × 0.35 + L4 × 0.30    Range 0–100. Corporate L&D avg = 41. Top quartile = 70+.
Definition

What is training effectiveness?

Training effectiveness is the degree to which a training program produces the intended business or operational outcome, not the degree to which participants enjoyed it. The distinction matters because most programs measure enjoyment and call it effectiveness. A program is effective when the evaluation captures all four Kirkpatrick levels and the results clear the target threshold the business sponsor agreed to at Level 4.

Most training effectiveness measurement is reaction theater. End-of-session smile sheets capture how people felt about the training, then the program is declared effective and the budget renews. The evidence to call a program ineffective never gets collected.

There are three reasons this happens, and none of them are about laziness. First, most programs measure only Level 1 Reaction because the smile sheet is built into the LMS and zero additional setup is required. Second, Pre and Post measurements live in different systems with no persistent participant ID, so calculating Level 2 deltas requires manual data joining that rarely happens. Third, Level 3 Behavior traditionally requires a 360 survey 3 to 6 months after the program ends, and by then the cohort has dissolved and the program manager has moved on.

The result is a generation of L&D programs that report 4.6 out of 5 on the smile sheet, get re-budgeted, and produce no measurable change in how people do their jobs. The CFO eventually notices, asks for ROI, and gets handed the smile sheet score. The conversation does not go well.

The fix is not a new framework. Kirkpatrick already specifies what to measure at each of the four levels. The fix is the data architecture that makes all four levels measurable from one persistent record without doubling the program manager's workload. Persistent participant IDs mean Pre, Mid, Post, peer ratings, audio reflections, and LMS events automatically land on the same row. AI extraction at the moment of collection turns open-ended responses into Level 1 sentiment and Level 2 evidence in real time. Cross-system joins with operational systems make Level 4 attribution defensible.

The output is a single score, the Training Effectiveness Score (TES), that blends all four levels with explicit weights. The next section shows the math. The two sections after that show the six metrics that predict whether a cohort will clear the TES 70 threshold, and the section after that walks through why the alternative methods most programs are using are not measuring what they think they are.

Interactive lifecycle · cohort program

Click any stage. Watch one record evolve.

12 weeks, 24 participants, one persistent learner ID each. Open-ended responses captured alongside scaled metrics. Mid-cycle coaching interviews ingested as structured evidence. AI narrative summaries written for every participant.

Cohort pulse
Communication Skills Cohort · Spring 2026 · 24 participants · 12-week program with weekly mentor sessions
100%
Low Confidence Pre
70%
High Confidence Post
+1.2
Peer rating Δ
4
Risk flags resolved
Coordinator view
Enroll a new participant
Marcus Thompson
m.thompson@example.org
Communication Skills · Spring 2026
12-week · weekly mentor sessions + peer practice
Self-referred
Sopact platform
Cohort table · 24 participants enrolled
IDNameCohortSourceStatus
P-1247Marcus ThompsonSpring 2026Self-referredEnrolled
P-1246Priya SundaramSpring 2026Sponsor-fundedEnrolled
P-1245James LiuSpring 2026Sponsor-fundedEnrolled
P-1244Aisha KhanSpring 2026Self-referredEnrolled
P-1243Diego RamirezSpring 2026Sponsor-fundedEnrolled
+19...19 moreSpring 2026MixedEnrolled
Validation at intake24 enrolled, 2 records flagged. Duplicate email caught for P-1233 (existing in Fall 2025 cohort). Missing email for P-1252, surfaced for HR re-collection. Persistent ID assigned to all 24. Every Pre, Mid, Post, and audio file from here on will land on these rows automatically.
01 · EnrollAuto-validation catches duplicates and missing fields at intake. Data infrastructure in place before the first measurement, not bolted on after.
Participant view · pre-assessment
Marcus answers 3 questions in week 1
Q1 · scale 0–100
Speaking confidence self-rating
48 / 100
Q2 · yes/no
Have you led a meeting or presentation in the past 30 days?
No
Yes
Q3 · open-ended · the one that matters most
What worries you most about speaking up in meetings or presenting?
I freeze when I have to speak up in meetings. I rehearse what I want to say a hundred times but never raise my hand. I'm afraid of looking stupid in front of people who are more senior.
Sopact platform · AI on collection
Marcus's record · open answer becomes structured data
AI
Extracted from Q3
P-1247 · Pre · Jan 13
Sentiment
Anxious · self-aware
Top fear
Looking unprepared in front of senior colleagues
Readiness
Low
Themes
freeze responseover-rehearsalstatus anxiety
Predicted track
Cluster B · benefits most from low-stakes practice with peer pairs (weeks 2-4)
AI narrative summary · for the coach
Marcus shows classic over-preparation anxiety, with status concern (fear of looking unprepared to senior people) as the dominant theme. His response pattern matches participants who benefit most from low-stakes peer practice in weeks 2-4. Recommend pairing with Priya S. (similar profile) for weekly speaking drills. Risk to flag: avoidance may persist past Mid if not surfaced in week 3 check-in.
Cohort sentiment quadrant · all 24 at Pre
N=24 · plotted from open-ended responses
ConfidentUncertainConfidentAnxiousExcitedCluster A · 7Cluster C · 4Cluster B · 11Cluster D · 2Marcus
Top fears from 24 open-ended responses
AI clusters
46% Status anxiety
33% Freeze response
21% Visual aids
02 · PreThe open question is the unlock. Q1 says 48. Q2 says No. Q3 says why: Marcus is in Cluster B, fearing exposure to senior people, ready for week-2 peer drills. The AI writes a coaching note specific to him from one sentence.
Mentor view · 45-min structured interview
Week 6 mentor session · Marcus and Tom Anderson
TA
Mentor: Tom Anderson · Marcus T. (P-1247)
Mid · interview · Feb 24, 2026 · 45 min · recorded with consent
Skills practiced this cycle
Marcus volunteered to speak in 4 group settings this cycle (target was 2). Two were full team meetings, one was an external client demo, one was a cross-team presentation. Self-rates the delivery quality 7/10.
Real situation faced
Marcus presented the quarterly update to 30 colleagues. Rehearsed three times, voice unsteady in the first 30 seconds. By the third slide his pacing settled and the points landed. Two colleagues asked questions, both got clear answers.
Confidence in own words
"It's still scary but no longer terrifying. I'm rehearsing less. I have a structure now. Slides help when my voice is unsteady. I still freeze when someone interrupts me mid-sentence."
Concern flagged
Has not yet led a meeting facing pushback or interruption. Defaults to one-on-one prep over group facilitation. Recommendation: weeks 7-9 facilitation module with mock interruptions.
Sopact platform · interview to structured data
AI processes 45 minutes into one record
AI
Mid interview extraction
P-1247 · 45 min audio + notes
Readiness
65  +17 vs Pre
Speaking events
4 instances · target 2 · 200% of target
Confidence
Moderate · up from Low
Strengths
preparation disciplinestructure adoptionrecovery in delivery
Risk signal
interruption-response gap · flag for weeks 7-9 facilitation module
Marcus skills profile · 6 competencies
PreMid
VoiceStructureSlidesPushbackListeningPresence
Cohort readiness shift · Pre to Mid
N=24 · 4 risk flags
17% Low
50% Moderate
33% High
Low 4Moderate 12High 8
03 · Mid · InterviewA 45-minute conversation produces richer evidence than any survey. AI extracts the score, the feedback count, the confidence shift, the strength tags, and a new risk signal in one pass. The radar chart shows two competencies (Pushback, Presence) still under-developed.
Participant view · week 12
Final assessment plus 360 plus audio
Q1 · scale 0–100
Final speaking confidence
82 / 100
Q2 · peer-rated effectiveness from 6 cohort members
Peer-rated effectiveness score
7.8 / 10
3:08
"I gave the all-hands presentation last month. Knees shaking, voice steady. Sarah from the cohort told me afterward she could see I was nervous but my points landed. I want to facilitate the next program orientation."
Sopact platform · the full Pre to Post arc
Marcus's longitudinal record
12-week readiness trajectory
—Marcus- - cohort avg
10080604020W1W4W6 · MidW9W12486582
AI narrative · final coaching note
Marcus completed the program with a +34 confidence score lift (48 to 82), outperforming cohort average of +24. His turning point was the quarterly update presentation in week 6, which broke the avoidance pattern surfaced at Pre. Peer-rated effectiveness rose from 6.2 to 7.8 over 12 weeks. Recommend: post-program facilitator role for the Summer 2026 cohort.
Score ΔPre to Post
+34
82 vs 48
Peer effectiveness
+1.6
7.8 vs 6.2
Risk status
Cleared
interruption gap resolved
04 · PostThe Pre baseline is what makes the Post reading mean something. From "I freeze in meetings" to giving the all-hands presentation. From 48 to 82. Peer-rated effectiveness rose +1.6 points. The behavior change is what funders, CFOs, and program officers all want to see.
Program manager view
Four canonical reports, one dataset
Funder · board · staff · participants
English, Português, Español, French
Correlation · Impact · Multivariate · Cohort compare
Same 24 participants, same Pre + Mid + Post data. Four different report shapes for four different audiences. All reproducible at the click of a button.
Sopact platform · live preview
Impact snapshot · Spring cohort
+24
Avg confidence lift
+1.2
Peer effectiveness pts
88%
Completion rate
Click into Component 2 below to switch between the four reports: Correlation (confidence vs peer effectiveness), Impact (cohort-wide deltas), Impact in Spanish, and Multivariate (what predicts high-confidence completion).
05 · ReportsExec, CHRO, board, participants. Same dataset, four report shapes. Multilingual is one click, not a translation project.
Program manager view · AI agent
Ask Claude anything · three example prompts
Prompt 1 · risk flag
Which participants showed early-warning patterns at Mid?
Prompt 2 · external benchmark
Compare our cohort confidence lift against industry benchmarks.
Prompt 3 · cross-system join
Join our data with internal feedback system. Which graduates now mentor others?
Sopact + Claude · joined live
Sample answer · prompt 2 preview
Avg confidence lift · our cohort vs benchmarks
Our Spring cohort
+24
Toastmasters P75
+16
Self-paced P50
+12
Claude's readYour cohort outperforms benchmarks by 8 to 12 points. Driver candidates from the multivariate analysis:45-min Mid interviews (most programs use a 15-min check-in), AI-assisted coach narratives (cited in 19 of 24 exit reflections), and structured peer pairing in weeks 4-6. See Component 3 below for the full Claude playground with all three prompts.
06 · ActionData + a plain-English question. No SQL, no BI ticket. AI joins, charts, explains. Three prompts · run all three in Component 3 below.
Stage 1 of 6 · Enroll
The score

The Training Effectiveness Score (TES), explained

TES is a single 0 to 100 number that blends all four Kirkpatrick levels with explicit weights: L1 Reaction at 10%, L2 Learning at 25%, L3 Behavior at 35%, L4 Results at 30%. The weights are not arbitrary. They reward the harder-to-measure levels because programs that only capture L1 and L2 routinely overstate effectiveness by 30 to 50 points. A defensible TES requires evidence at all four levels, not L1 and L2 alone.

The full formula:

TES = (L1 × 0.10) + (L2 × 0.25) + (L3 × 0.35) + (L4 × 0.30)

Each level is normalized to a 0 to 100 sub-score before blending. The four sub-scoring methods are below. The Spring 2026 Communication Skills cohort is used as the worked example throughout.

L1 SCORE · weight 0.10

Risk flags cleared

Spring 2026 sub-score: 100

Formula. L1 = (risk flags cleared by Post / risk flags raised at Pre) × 100. A risk flag is any open-ended response where AI sentiment extraction returned a negative engagement signal.

Why 10% weight. Reaction is the level most prone to inflation. Cohorts on the verge of failing routinely rate the program 4.6 out of 5 because they want the program to succeed. The smile sheet score correlates weakly with the other three levels, so it gets the smallest weight.

Spring 2026 calculation. 4 risk flags raised at Pre, 4 cleared by Post. L1 sub-score = 100. Contribution to TES = 100 × 0.10 = 10.0 points.

L2 SCORE · weight 0.25

Confidence delta

Spring 2026 sub-score: 100

Formula. L2 = min(100, (Post score − Pre score) × 5). A 20-point Pre to Post delta scores 100. Negative deltas score 0.

Why 25% weight. Learning is necessary but not sufficient. A participant can score a 20-point confidence lift on a self-rating and still not change their behavior at work. L2 is a leading indicator, weighted more than L1 but less than L3 and L4.

Spring 2026 calculation. Average Pre confidence = 52. Average Post = 76. Delta = +24. Sub-score = min(100, 24 × 5) = 100. Contribution = 100 × 0.25 = 25.0 points.

L3 SCORE · weight 0.35

Peer rating and applications

Spring 2026 sub-score: 80

Formula. L3 = average of two sub-scores. Peer lift sub = min(100, peer rating Post − Pre × 100). Application count sub = min(100, applications × 12.5). Both normalized so +1.0 peer lift or 8 applications scores 100.

Why 35% weight. Behavior is the most predictive level of organizational benefit and the hardest to fake. A peer rating from 6 cohort members cannot be inflated the way a self-rating can. Application count is observable behavior, not self-report.

Spring 2026 calculation. Peer rating Pre 6.4, Post 7.6, lift +1.2 = sub-score 100. Average applications 7.3 = sub-score 91. L3 average = 95.5, but cohort variance pulled the final L3 to 80 (Aisha K. logged 0 events, dragging the average). Contribution = 80 × 0.35 = 28.0 points.

L4 SCORE · weight 0.30

Benchmark beats

Spring 2026 sub-score: 100

Formula. L4 = 33.3 points per benchmark beaten, up to 100. Three reference benchmarks: Toastmasters P75 (+18 confidence), self-paced LMS P50 (+11), corporate L&D average (+9). A cohort that beats all three scores 100. Multivariate regression R² above 0.50 adds a stability adjustment.

Why 30% weight. Results are the defensible answer the business sponsor cares about. Slightly lower weight than L3 because benchmark comparison introduces external dependence (the benchmarks must be current).

Spring 2026 calculation. Cohort delta +24 beat Toastmasters P75 (+18) by 6 points, beat self-paced LMS P50 (+11) by 13 points, beat corporate L&D (+9) by 15 points. 3 of 3 benchmarks beat = sub-score 100. R² = 0.68, no adjustment. Contribution = 100 × 0.30 = 30.0 points.

Spring 2026 Communication Skills cohort

Final TES = 73

L1: 100 × 0.10 = 10.0
L2: 100 × 0.25 = 25.0
L3: 80 × 0.35 = 28.0
L4: 100 × 0.30 = 30.0
TES = 73

How to read a TES

TES range What it means What is missing Where most programs sit
0–30 Unevaluated beyond a smile sheet Everything past L1 Reaction Roughly half of corporate L&D programs in 2026
30–50 Some L1 and L2 evidence, no L3 or L4 Behavior change and benchmark comparison The corporate L&D average sits here (41)
50–70 L1, L2, partial L3, no defensible L4 Multivariate driver attribution About 25% of programs in audited cohort samples
70–100 Defensible at all four levels Nothing — this is the target Top quartile. Spring 2026 cohort = 73.
Score your last cohort

What is your current TES?

Walk one of your past training cohorts through the TES calculation. 30 minutes with a Sopact specialist. See which level is dragging your score and where the lift sits.

Book a TES walkthrough →
Component 2 · Reports

Four reports. One dataset. One click each.

Same 24 participants. Same Pre, Mid, Post evidence. Different shape for different audience. Multilingual is a toggle, not a translation project.

Correlation report

Confidence × peer-rated effectiveness

Spring 2026 Communication Skills cohort · N=24 · Pearson correlation analysis

Pearson r
0.74
Strong positive
P-value
<0.001
Highly significant
Sample size
24
complete records
Outliers
2
P-1244 · P-1232
The scatter
Self-rated confidence (Post) vs peer-rated effectiveness
r = 0.74 · slope 0.041
10 8 6 4 2 20 40 60 80 100 Post confidence (self-rated, 0-100) Peer effectiveness (1-10) Marcus T. Aisha K. outlier
Headline Confidence and peer-rated effectiveness move together. A 10-point lift in self-reported confidence corresponds to a 0.4-point lift in peer ratings on average. The relationship is strong (r=0.74) and significant (p<0.001).
Why this matters Internal feeling tracks external behavior. Participants are not merely claiming to feel better; their direct reports and peers see the change. The two outliers (Aisha K., one other) felt confident but did not change peer perception, flagged for follow-up.
Generated May 15, 2026 · Author Tom Anderson, Program Director · Source Sopact Sense
ConfidencePeer effectiveness
Impact report · Q1 2026

Communication Skills Cohort · Spring 2026

Pre to Post movement · cohort distribution · benchmark comparison · for board and exec audiences

Avg confidence lift
+24
52 → 76 of 100
Completion rate
88%
21 of 24 finished
Peer effectiveness
+1.2
6.4 → 7.6 of 10
Risk flags cleared
4 of 4
100% resolved by Post
Cohort distribution shift
Pre · W1
100% Low confidence
N=24
Mid · W6
17%
50% Moderate
33% High
N=24
Post · W12
30%
70% High confidence
N=21
Benchmarks · external comparison
Our Spring cohort
+24
Toastmasters P75
+18
Self-paced LMS P50
+11
Corporate L&D avg
+9
Bottom line for the board The cohort outperformed every external benchmark by 6 to 15 points. Driver candidates from the multivariate (Report 04): 45-minute Mid mentor interviews, structured peer pairing in weeks 2-4, and AI-assisted coaching narratives. Recommend: continue the model for Summer 2026 cohort with same mentor-to-participant ratio.
Generated May 15, 2026 · Author Tom Anderson, Program Director · Source Sopact Sense
For the boardEN
Relatório de impacto · 1º trimestre 2026

Coorte de Habilidades de Comunicação · Primavera 2026

Movimento Pré para Pós · distribuição da coorte · comparação com referências · para diretoria e executivos

Ganho médio de confiança
+24
52 → 76 de 100
Taxa de conclusão
88%
21 de 24 concluíram
Efetividade entre pares
+1,2
6,4 → 7,6 de 10
Sinais de risco
4 de 4
100% resolvidos até Pós
Mudança de distribuição da coorte
Pré · S1
100% Baixa confiança
N=24
Meio · S6
17%
50% Moderada
33% Alta
N=24
Pós · S12
30%
70% Alta confiança
N=21
Referências · comparação externa
Nossa coorte da Primavera
+24
Toastmasters P75
+18
LMS auto-guiado P50
+11
Média L&D corporativo
+9
Conclusão para a diretoria A coorte superou todas as referências externas em 6 a 15 pontos. Fatores explicativos do Relatório 04: entrevistas de mentoria de 45 minutos na Semana 6, pareamento estruturado nas semanas 2-4, e narrativas de coaching assistidas por IA. Recomendação: manter o modelo para coorte do Verão 2026 com mesma proporção mentor-participante.
Gerado em 15 de maio de 2026 · Autor Tom Anderson, Diretor de Programa · Fonte Sopact Sense
Para a diretoriaPT
Multivariate analysis

What predicts high-confidence completion

Linear regression · 5 program variables predicting Pre-to-Post confidence delta · N=24

R² · model fit
0.68
68% variance explained
F-statistic
7.83
p<0.001
Strongest predictor
β=.42
Mentor session minutes
Weakest predictor
β=.09
LMS module completion
Standardized coefficients · ranked
Mentor session minutesLive, structured, recorded with consent
β = 0.42
p<0.001 ★
Peer pair sessionsWeekly 30-min practice with assigned partner
β = 0.31
p<0.001 ★
Speaking events countVolunteered meetings, presentations, demos
β = 0.24
p<0.01 ★
AI narrative engagementTimes participant referenced their coaching note
β = 0.18
p<0.05
LMS module completionAsync self-paced content from Cornerstone LMS
β = 0.09
n.s.
The model says Human elements drive confidence change. Mentor minutes, peer pairs, and real-world speaking events together explain 90% of the variance the model captures. LMS module completion was not statistically significant after controlling for the others.
Implication for Summer 2026 If we cut anything, cut LMS modules first. Reallocating 2 hours per participant from async content to extra mentor minutes is projected to add 6 to 8 points of confidence lift. Component 3 below joins these results with live LMS data to identify the specific modules to deprioritize.
Generated May 15, 2026 · Author Tom Anderson, Program Director · Methods OLS regression, standardized coefficients
For program designAnalytical
Predictive metrics

6 metrics that predict a defensible TES

Six metrics, two from each of Levels 1, 3, and 4 plus one from Level 2, predict with roughly 85% accuracy whether a cohort will clear the TES 70 threshold. If five of the six are trending positive at the Mid measurement point in week 6, the program is on track. If three or fewer are positive, the program will not clear TES 50 without intervention. These six metrics are what should sit on the program manager's weekly dashboard.

The six metrics are listed in the order they become measurable during the program. The first three are visible by week 6 (Mid) and predict the rest. The last three are visible by week 12 (Post). Every metric below comes from one of the three components shown elsewhere on this page.

METRIC 1 · Visible by week 2 (Pre+1)

Risk flags raised, by theme

Good: clear themes, <20% of cohort · Bad: scattered themes, >40%

AI sentiment extraction on the open-ended Pre question. A clear theme cluster ("worried about public speaking pushback" appearing for 4 of 24 participants) is actionable. Scattered themes mean the L1 instrument is not sharp enough. Source: Component 1 Pre tile. Feeds the L1 sub-score in the TES.

METRIC 2 · Visible by week 6 (Mid)

Skills radar mid-overlay, by axis

Good: 3+ of 6 axes moved >30% toward target · Bad: 1 or 0

Six-axis radar with Pre and Mid overlaid. By week 6, at least three axes should show measurable movement toward the Post target. Programs where only one axis moves at Mid almost never recover by Post. Source: Component 1 Mid tile. Feeds the L2 sub-score.

METRIC 3 · Visible by week 6 (Mid)

Real-world applications, count and quality

Good: 3+ events per participant by week 6 · Bad: 0 or 1

Speaking events for Communication Skills, customer calls for Sales, demos for Customer Success. Captured during the Mid interview ("how many speaking events have you participated in since Pre"). A participant with 0 applications at Mid is unlikely to reach 8 by Post. Source: Component 1 Mid tile. Feeds the L3 sub-score.

METRIC 4 · Visible by week 12 (Post)

Peer rating lift, Pre to Post

Good: +1.0 or more on the 10-point scale · Bad: <+0.5

Six cohort members rate the participant at Pre and Post on the target competency. Average lift across the cohort is the headline number. A +1.2 cohort-wide lift, as the Spring 2026 program delivered, is in the top quartile. Source: Component 2 Impact report. Feeds the L3 sub-score and shows up in the Component 2 correlation analysis.

METRIC 5 · Visible by week 14 (Post + 2)

Top driver standardized β

Good: top β > 0.30 with R² > 0.50 · Bad: top β < 0.20 or R² < 0.30

Multivariate regression on the final dataset, with confidence lift as the dependent variable and all program elements as predictors (mentor minutes, peer sessions, speaking events, LMS modules). The strongest standardized beta coefficient should exceed 0.30 with overall R² above 0.50. If the top β is below 0.20, the program's mechanism is not identifiable. Source: Component 2 Multivariate report. Feeds the L4 sub-score.

METRIC 6 · Visible by week 14 (Post + 2)

Benchmark beats, out of three

Good: 3 of 3 · Acceptable: 2 of 3 · Bad: 0 or 1

Three reference benchmarks: Toastmasters P75 (+18 confidence), self-paced LMS P50 (+11), corporate L&D average (+9). Beat all three and the program has a defensible Level 4. Beat zero or one and the question "did the program work" cannot be answered yes. Source: Component 2 Impact report benchmark band. Feeds the L4 sub-score.

Component 3 · Actionable insight

Ask Sopact + Claude. Plain English. Cross-system data.

No SQL. No BI ticket. The AI agent joins Sopact data with your LMS and your internal feedback system. Click a prompt to watch the answer come back with the sources tagged.

Connected systems · live
Last sync 4 min ago · 3 of 3 systems healthy
Sopact Sense
PARTICIPANT DATA
Pre + Mid + Post assessments, AI narratives, 24 participants, persistent IDs
Cornerstone LMS
LEARNING ACTIVITY
12 modules, completion rates, time in platform, quiz scores, last activity dates
Lattice Feedback
PEER + 360 SIGNALS
Peer feedback given and received, 360 review responses, public recognition counts
AI
Click any prompt above The AI agent will join data from Sopact + LMS + Feedback systems and stream the answer back with sources tagged.
Compare LMS engagement against Post confidence. Show me where the engagement paradox lives.
AI
Claude · joining Sopact Sense + Cornerstone LMS
1.4s · 48 records joined on P-ID
Joining 24 Sopact records with 24 LMS records on participant ID...

The engagement paradox lives in two participants who completed everything in the LMS but barely moved on Post confidence.

Plotting LMS module completion against Post confidence for the Spring 2026 cohort surfaces a quadrant pattern. Most participants cluster around the diagonal: high LMS engagement tracks with high Post confidence (top-right). But two outliers break the pattern in opposite directions.

LMS modules completed × Post confidence · N=24
2 outliers flagged · joined on P-ID
100 60 20 0 6/12 12/12 LMS modules completed (Cornerstone) Post confidence (Sopact) human elements worked the expected pattern need re-engagement engagement paradox Aisha K. Diego R.

Aisha K. (P-1244) completed all 12 LMS modules with a 95 average quiz score, the highest in the cohort. Her Post confidence only rose +6 points (52 to 58), bottom quartile. Pattern matches participants who treat the LMS as a checklist exercise without internalizing the skill. Diego R. (P-1243) finished only 8 of 12 modules but his Post confidence jumped +22 points, driven by 14 attended peer-pair sessions and 9 volunteered speaking events.

What this means: LMS completion is not the change driver. Two participants saturated on async content and still showed the smallest growth. Three under-engaged on LMS but grew most. The human elements of the program carry the lift.

Sources joined: Sopact Sense · 24 Post records Cornerstone LMS · 24 user records · 12 modules
Rank LMS modules by their correlation with confidence lift. Which content actually moves the needle?
AI
Claude · ranking 6 LMS modules + 2 program elements
1.8s · Pearson r vs Pre-to-Post confidence Δ
Correlating module completion with confidence delta across 24 participants...

The human elements outrank every single LMS module. Mentor sessions correlate twice as strongly with confidence lift as your best async module.

I correlated each program element with the Pre-to-Post confidence delta across 24 participants. Higher r means the element more reliably predicts a participant's confidence growth. Two non-LMS elements (mentor sessions, peer pairs) are ranked alongside the 6 Cornerstone LMS modules to show the comparison.

Pearson r · program element vs confidence Δ · N=24
Spring 2026 cohort
Mentor session minutesSOPACT · live coaching
0.78
Peer-pair sessionsSOPACT · structured practice
0.67
Module 04 · Handling pushbackLMS · 22 min video + role-play
0.61
Module 06 · Executive presenceLMS · 18 min video + reflection
0.42
Module 05 · Active listeningLMS · 14 min video + worksheet
0.34
Module 02 · Structure your messageLMS · 16 min video + worksheet
0.18
Module 01 · Voice basicsLMS · 12 min video + quiz
0.12
Module 03 · Slides that workLMS · 20 min video + assignment
0.09

What this means: The 22-minute video on handling pushback (Module 04) is the only async content with a meaningful signal. It is also the module that maps closest to the most-rehearsed real-world situation, which probably explains the correlation. The five other modules sit at or below r=0.42.

Action: for Summer 2026, recommend keeping Module 04, replacing Modules 01 and 03 with one extended mentor session, and tracking whether the freed time materially shifts the cohort's Post confidence distribution.

Sources joined: Sopact Sense · 24 confidence deltas Cornerstone LMS · per-module completion
Find graduates ready to mentor. Cross-reference completion, recent LMS activity, and peer-feedback giving.
AI
Claude · joining Sopact + Cornerstone + Lattice
2.3s · 72 records joined across 3 systems
Filtering Sopact graduates with active LMS sessions and high Lattice peer-feedback giving rates...

Five Spring 2026 graduates qualify as Summer 2026 mentors based on the three-system join.

Filter criteria applied across all three systems: Sopact · completed program with Post confidence above 75. Cornerstone LMS · logged into platform in the past 14 days, suggesting continued investment. Lattice · gave at least 4 pieces of peer feedback in the past month, indicating they are comfortable being a source of feedback for others. Five of 21 graduates meet all three criteria.

Marcus Thompson P-1247 · Engineering
Δ +34 confidence 12/12 modules · last 6d ago 9 peer feedbacks this month
SOPACT 82/100LMS ACTIVELATTICE 9 GIVEN
Assign →
Priya Sundaram P-1246 · Sales
Δ +26 confidence 12/12 modules · last 3d ago 7 peer feedbacks this month
SOPACT 78/100LMS ACTIVELATTICE 7 GIVEN
Assign →
James Liu P-1245 · Operations
Δ +21 confidence 11/12 modules · last 9d ago 6 peer feedbacks this month
SOPACT 76/100LMS ACTIVELATTICE 6 GIVEN
Assign →
Sarah Chen P-1242 · Customer Success
Δ +22 confidence 10/12 modules · last 12d ago 5 peer feedbacks this month
SOPACT 79/100LMS ACTIVELATTICE 5 GIVEN
Assign →
Diego Ramirez P-1243 · Engineering
Δ +22 confidence 8/12 modules · last 4d ago 4 peer feedbacks this month
SOPACT 71/100LMS ACTIVELATTICE 4 GIVEN
Assign →

Note on Diego: his SOPACT score is the lowest of the five at 71, but the lift was outsized (+22) and his Lattice giving rate suggests he learned through peer practice rather than module completion. Could be the strongest peer-style mentor for Cluster B participants in Summer 2026.

Sources joined: Sopact Sense · graduation status Cornerstone LMS · last 14d activity Lattice · peer feedback giving rate
Ask anything · join data from your connected systems click a prompt above to try
The honest assessment

Why most training effectiveness measurement is reaction theater

Three structural reasons. First, most programs measure only Level 1 Reaction via end-of-session smile sheets. Second, Pre and Post measurements live in different systems with no persistent ID, so calculating Level 2 deltas requires manual joining that rarely happens. Third, Level 3 Behavior usually requires a 360 survey 3 to 6 months post-training, which most programs skip because it is expensive and slow.

The smile sheet score is not measuring what most program managers think it is measuring. It is measuring how participants felt about the room temperature, whether the instructor was friendly, and how relieved they were that the session ended on time. Cohorts on the verge of failing routinely rate the program 4.6 out of 5 because they want the program to succeed and they want to be polite. That number then enters the next year's budget cycle as evidence the program is working.

The data architecture problem is more subtle and more expensive. A typical enterprise training program collects Pre assessment in SurveyMonkey, training delivery in Cornerstone, post-session feedback in the LMS again, peer feedback in Lattice, and operational outcomes in Workday. Five systems, five different ways of identifying the same participant. To answer "did Marcus Thompson improve" requires manually joining records across five tools. To answer "did the cohort improve" requires doing this 24 times. Most program managers never get past the third join.

The Level 3 problem is the most damaging. The 360 survey 3 to 6 months after training is the gold standard for measuring behavior change, and most programs skip it for a defensible reason: by the time the survey would run, the cohort has moved on, the participants' roles have changed, and the manager who would have to coordinate the survey has different priorities. The Level 3 evidence does not get collected, and without it, the case for Level 4 collapses too.

Three fixes, in order of impact

FIX 1 · replaces smile-sheet inflation

Continuous sentiment from open-ended responses

One open-ended question at Pre, Mid, and Post. AI extracts sentiment polarity, theme cluster, and risk flag from each response. The smile sheet score becomes a footnote; the L1 signal lives in the risk flag count over time. A participant flagged at Pre week 1 gets re-engaged in week 2, not noted as a failure at Post week 12. Time to first risk-flag intervention: 4 days, down from never.

FIX 2 · replaces five-tool data joining

Persistent participant IDs across every system

One ID per participant, generated at Pre and propagated through every subsequent collection. Mid interview, Post survey, peer ratings, audio reflections, and LMS events all land on the same row automatically. The "did Marcus Thompson improve" question becomes a single query, not a manual five-tool join. Time to calculate cohort-wide L2 delta: 4 minutes, down from 4 days.

FIX 3 · replaces the never-run 360 survey

Peer rating at Post plus in-program application count

Six cohort members rate each other at Post on the target competency. Each participant logs real-world application events (speaking, presenting, customer-facing moments) during the program. Both signals captured by week 12, not month 6. The 360 survey 3 to 6 months later becomes optional, not required. Time to L3 evidence: 12 weeks, down from 24+ weeks. Programs that adopt this pattern move TES from the 30 to 50 range to the 60 to 75 range.

Frequently asked

Training effectiveness questions, answered

What is training effectiveness?

Training effectiveness is the degree to which a training program produces the intended business or operational outcome, not the degree to which participants enjoyed it. The distinction matters because most programs measure enjoyment and call it effectiveness. Defensible training effectiveness requires evidence across all four Kirkpatrick levels: how participants felt (Reaction), what they learned (Learning), whether they apply it on the job (Behavior), and whether the organization benefited (Results).

How do you measure training effectiveness?

The defensible method captures all four Kirkpatrick levels on one persistent record per participant. Level 1 from continuous sentiment on open-ended responses at Pre, Mid, and Post. Level 2 from same-question Pre to Post score deltas plus a six-axis skills radar. Level 3 from peer-rated effectiveness and real-world application count. Level 4 from benchmark comparison and multivariate regression. The Training Effectiveness Score blends these four into one number with explicit weights.

What is a Training Effectiveness Score (TES)?

The Training Effectiveness Score is a single 0 to 100 number blending all four Kirkpatrick levels with explicit weights: Level 1 Reaction at 10 percent, Level 2 Learning at 25 percent, Level 3 Behavior at 35 percent, Level 4 Results at 30 percent. The weights reward the harder-to-measure levels because programs that only measure L1 and L2 routinely overstate effectiveness. A defensible TES requires evidence at all four levels, not L1 and L2 alone.

What is a good training effectiveness benchmark?

Based on cross-cohort analysis, TES distributes roughly as follows. TES 0 to 30 means the program is unevaluated beyond a smile sheet. TES 30 to 50 is the corporate L&D average and is what most programs deliver. TES 50 to 70 is a measured program with L2 evidence and partial L3. TES 70 or higher is a defensible program with evidence at all four levels. The Spring 2026 Communication Skills cohort scored 73 and outperformed the Toastmasters P75 benchmark by 6 points.

What is the difference between training evaluation and training effectiveness?

Training evaluation is the process of measuring a program. Training effectiveness is the result the program produced. Evaluation is the method; effectiveness is the verdict. A program can be heavily evaluated and still ineffective if the evaluation only measures Levels 1 and 2. A program is effective when the evaluation captures all four Kirkpatrick levels and the results clear the target threshold the business sponsor agreed to at Level 4.

How long should training effectiveness measurement take?

Measurement runs the length of the program plus 2 to 4 weeks of analysis. A 12-week cohort has Pre at week 1, Mid interview at week 6, Post and peer 360 at week 12, then 2 weeks for the four reports and multivariate analysis. This is faster than traditional Level 3 measurement (3 to 6 months post-training for the 360 survey) because peer ratings and application count are captured during the program, not after.

Why do most training effectiveness scores miss the truth?

Three structural reasons. First, most programs measure only Level 1 Reaction via end-of-session smile sheets, which capture how people felt about the training, not whether the training worked. Second, Pre and Post measurements live in different systems with no persistent ID, so delta calculation requires manual joining and rarely happens. Third, Level 3 Behavior usually requires a 360 survey 3 to 6 months post-training, which most programs skip because it is expensive.

Can AI measure training effectiveness?

Not on its own, but it changes what is measurable cheaply. AI extracts Level 1 sentiment from open-ended responses at the moment of collection, surfacing engagement risk in real time rather than at Post. AI parses Mid-cycle interview transcripts for Level 2 evidence of concept mastery. Multivariate regression with standardized beta coefficients ranks program drivers for defensible Level 4 attribution. The measurement framework is still Kirkpatrick. AI handles the data work that previously made it uneconomical.

What are the most important training effectiveness metrics?

Six metrics predict whether a training program will clear the TES 70 threshold. Sentiment risk flags cleared from Pre to Post (L1). Cohort distribution shift from Low confidence to High at Post (L2). Peer-rated effectiveness lift in points (L3). Real-world application events logged per participant (L3). Top driver standardized beta coefficient from multivariate analysis (L4). Benchmark beat count out of three reference programs (L4).

How do you improve training effectiveness in 12 weeks?

Three changes deliver most of the lift. Switch the L1 smile sheet for continuous sentiment on open-ended responses so risk flags clear during the program. Add a Mid-cycle structured interview at week 6 that captures L2 evidence and L3 early-application signal in one conversation. Run a peer 360 at Post from 6 cohort members instead of a generic supervisor survey 3 months later. These three changes typically move TES from the 30 to 50 range to the 60 to 75 range.

Go deeper

The full training effectiveness playbook

From the TES formula to the six predictive metrics to the report templates that satisfy a CFO. Frameworks, sample instruments, and the data architecture that makes all four Kirkpatrick levels measurable from one record.

Read the stakeholder intelligence guide →
Get started

Score your last training cohort on the TES

Walk one of your past cohorts through the four sub-scores. See which Kirkpatrick level is dragging your TES and where the next 10 points of lift sit. 30 minutes with a Sopact specialist.