Frequently asked questions.
Common questions about OKR quality, the anti-patterns the rubric checks for, and how OKR Orca works.
What does the score actually mean?
The score is a 0-100 normalisation of a 7-criterion rubric. Each criterion scores 0, 1, or 2. KR-level criteria apply per Key Result, so a set with three KRs has more points in play than a set with one. The raw points are normalised.
Score ranges map to five action-oriented tiers: 0-20 Rewrite, 21-40 Reframe, 41-60 Refine, 61-80 Solid, 81-100 Ship. See the methodology for how each criterion is scored.
What if the LLM gives a bad rewrite?
Rewrites are starting points, not final answers. The rubric is the source of truth: if a suggested rewrite still contains an output verb or lacks a baseline, it is still broken regardless of how polished it sounds. Re-run the diagnosis on any rewrite by copying it into the input field. The rule engine scores it locally in under a second with no key required.
LLMs occasionally produce fluent but structurally weak OKRs. Apply the rubric to the rewrite. A rewrite that scores below 56 needs another pass.
How much does Coach mode cost per session?
Approximately $0.05-$0.15 for a 6-10 turn session using GPT-4o. Claude Sonnet 4.6 costs slightly more, roughly $0.08-$0.20 for the same length. The rule-engine pre-score in Diagnose mode is free regardless of key. Only the LLM analysis step consumes API credit, at $0.005-$0.015 per analysis.
Coach mode is bounded: it sends your full conversation history on each turn, so longer sessions cost more. A 15-turn session costs noticeably more than one that reaches a draft in 6.
What makes an OKR bad?
The most reliable signal is Key Results that describe work instead of results. "Launch the onboarding redesign," "Migrate to the new platform," "Complete user research": those are tasks. They belong in a sprint backlog. A KR should describe what changes for a real person after the work is done, not the work itself.
The second most common failure is vagueness that passes for ambition. "Improve customer satisfaction" sounds like a goal. It is not. Without a baseline, a target, and a data source, it is a field name. A rubric check catches both problems inside 60 seconds.
Can an Objective be measurable?
Yes, and that is a sign of a well-written one. The OKR framework reserves numbers for Key Results, but there is no rule against an Objective that includes a specific, observable condition. "Cut median PR cycle time so teams can ship daily by end of Q3" is directional and measurable.
The distinction that matters is between Objective and KRs in level of abstraction. The Objective states the desired state. The KRs prove it was reached.
What is wrong with "launch X" as a Key Result?
"Launch X" is Output-as-KR. It describes an action your team takes, not a change that happens in the world because you took it. The test: if you complete the KR and nothing changes for any real user, it is not a KR.
Write the outcome instead. If you are launching a self-service billing portal, the outcome might be "customers who change their own plan without contacting support, from 12% to 45%." Now you can tell at week 8 whether the launch worked.
One question: what will be different for users after this ships? Write that.
How do you avoid vanity metrics?
Name the actor and the specific action. "Increase engagement by 25%" fails because engagement is undefined. Engagement of what, by whom, on which surface, compared to when?
Replacement test: can you imagine a plausible scenario where this metric goes up and the business gets worse? If yes, it is a vanity metric. Pageviews go up when you publish low-quality content. Email open rates go up when you send panic-inducing subject lines. Swap for the behaviour change it was supposed to proxy: "blog readers who start a free trial, from 1.4% to 3.2%."
Why does ambition matter for an OKR?
OKRs that are guaranteed to succeed are planning theatre. If the target is set at a level the team would reach anyway, the OKR is not driving anything.
Ambition forces a conversation about what would have to be true. "Triple the activation rate" requires rethinking the onboarding funnel. "Grow activation by 5%" permits incremental tweaking. The calibration question: if we hit 70% of this, would we be satisfied? If yes, the target is probably too low.
Can a team have too many OKRs?
Yes, and most teams do. Three to five Key Results per Objective is the practical ceiling. Beyond that, the set stops being a prioritisation mechanism and becomes a commitment catalogue. If everything is a KR, nothing is.
The useful question: if this KR turns red, does the team drop everything to investigate? If "we would notice but carry on," the KR is not important enough to be in the set.
How often should we score OKRs?
Weekly check-ins on KR progress, formal scoring at mid-point and end of cycle. The weekly check is a signal check, not a scoring exercise. The mid-point score decides whether to adjust scope, targets, or approach.
The failure mode: teams that skip the mid-point review and discover at end of quarter that two KRs were never measurable because instrumentation was never built. A mid-cycle check forces that conversation at week 6 rather than week 13.
Is "score 7 out of 10" a good Key Result?
No. "Score 7 out of 10" is almost always a Vanity Metric or Placeholder in disguise. First question: 7 out of 10 on what? If it is an NPS survey or CSAT, that might be a real metric. If it is a manager's subjective rating of a process, it is not a metric at all.
Point-scale scores on non-standardised instruments are gameable and not actionable. If you score 6.8 instead of 7, what do you do differently? The test: could two team members independently verify this score using the same data?
Paste your OKR, get a rubric score in 60 seconds.