interview-design
Use this skill when designing structured interviews, creating rubrics, building coding challenges, or assessing culture fit. Triggers on interview design, rubrics, scoring criteria, coding challenges, behavioral interviews, system design interviews, culture fit assessment, and any task requiring interview process design or evaluation criteria.
operations interviewsrubricscoding-challengeshiringassessmentWhat is interview-design?
Use this skill when designing structured interviews, creating rubrics, building coding challenges, or assessing culture fit. Triggers on interview design, rubrics, scoring criteria, coding challenges, behavioral interviews, system design interviews, culture fit assessment, and any task requiring interview process design or evaluation criteria.
interview-design
interview-design is a production-ready AI agent skill for claude-code, gemini-cli, openai-codex. Designing structured interviews, creating rubrics, building coding challenges, or assessing culture fit.
Quick Facts
| Field | Value |
|---|---|
| Category | operations |
| Version | 0.1.0 |
| Platforms | claude-code, gemini-cli, openai-codex |
| License | MIT |
How to Install
- Make sure you have Node.js installed on your machine.
- Run the following command in your terminal:
npx skills add AbsolutelySkilled/AbsolutelySkilled --skill interview-design- The interview-design skill is now available in your AI coding agent (Claude Code, Gemini CLI, OpenAI Codex, etc.).
Overview
Structured interview design is the discipline of building hiring processes that produce consistent, defensible, and predictive hiring decisions. The core insight is that unstructured conversations are notoriously unreliable predictors of job performance - structured processes with explicit rubrics dramatically improve both accuracy and fairness. This skill covers the full lifecycle: scoping the interview loop, writing rubrics, building coding challenges, calibrating interviewers, and running debriefs that lead to confident decisions.
Tags
interviews rubrics coding-challenges hiring assessment
Platforms
- claude-code
- gemini-cli
- openai-codex
Related Skills
Pair interview-design with these complementary skills:
Frequently Asked Questions
What is interview-design?
Use this skill when designing structured interviews, creating rubrics, building coding challenges, or assessing culture fit. Triggers on interview design, rubrics, scoring criteria, coding challenges, behavioral interviews, system design interviews, culture fit assessment, and any task requiring interview process design or evaluation criteria.
How do I install interview-design?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill interview-design in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support interview-design?
This skill works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.
Maintainers
Generated from AbsolutelySkilled
SKILL.md
Interview Design
Structured interview design is the discipline of building hiring processes that produce consistent, defensible, and predictive hiring decisions. The core insight is that unstructured conversations are notoriously unreliable predictors of job performance - structured processes with explicit rubrics dramatically improve both accuracy and fairness. This skill covers the full lifecycle: scoping the interview loop, writing rubrics, building coding challenges, calibrating interviewers, and running debriefs that lead to confident decisions.
When to use this skill
Trigger this skill when the user:
- Needs to design an interview loop or process for a role
- Wants to create scoring rubrics or evaluation criteria
- Asks how to build a coding challenge or take-home assignment
- Needs help writing behavioral interview questions
- Wants to design a system design interview round
- Is trying to assess culture fit in a structured, defensible way
- Needs to run calibration sessions with a panel
- Asks how to run an effective debrief meeting
Do NOT trigger this skill for:
- Preparing as a candidate to pass interviews (different audience, different goal)
- Compensation benchmarking or offer negotiation (use a compensation skill instead)
Key principles
Structured beats unstructured - Consistent questions asked in the same order with pre-defined scoring criteria outperform free-form conversations every time. Interviewers who "go with their gut" introduce bias, not signal.
Score independently before debrief - Every interviewer must submit a written score and evidence summary before the panel debrief. Verbal-only debrief allows the first strong opinion to anchor everyone else. Written scores first.
Test for the actual job - Every interview exercise should map to a real task the candidate will perform in the role. If a backend engineer will never sort arrays on the job, don't test array sorting in isolation. Use job-relevant problems.
Rubrics prevent drift - Without a rubric, two interviewers evaluating the same candidate will produce wildly different scores. A rubric aligns on what "strong" and "weak" looks like before the first candidate walks in.
Debrief is where decisions happen - The debrief meeting is not a vote-counting exercise. It is a structured discussion to surface new evidence, resolve disagreements, and reach a confident collective judgment. The hiring manager owns the final call.
Core concepts
Interview types map to different evaluation needs. Coding interviews assess problem-solving and technical mechanics. System design interviews assess architectural thinking at scale. Behavioral interviews (using STAR) assess past behavior as a proxy for future behavior. Values/culture interviews assess alignment with how the team operates. Take-homes assess real-world execution and follow-through. Most loops include 3-5 rounds covering different dimensions so no single round carries all the weight.
Rubric design is the practice of defining expected performance at multiple
levels (typically 1-4 or Strong No / No / Yes / Strong Yes) before interviews begin.
A good rubric specifies concrete behaviors, not adjectives. "Breaks problem into
subproblems, names variables clearly, asks clarifying questions before coding" is
a rubric. "Good technical skills" is not. See references/rubric-templates.md for
ready-to-use rubric templates.
Signal vs noise distinguishes real predictors of job performance from irrelevant factors. Signal: how a candidate structures ambiguity, responds to hints, explains trade-offs. Noise: how polished their communication style is, whether they went to a brand-name school, how quickly they reached the solution. Train interviewers to write down evidence (what the candidate said/did) rather than impressions ("seemed smart").
Calibration is the practice of running mock interviews with known candidates (or invented personas) so interviewers practice applying the rubric consistently before live interviews begin. A calibration session where two interviewers score the same response and then compare notes surfaces misalignment early.
Common tasks
Design a structured interview loop
Start by mapping the role's core competencies - typically 4-6 dimensions that predict success. Common dimensions for engineering roles:
| Dimension | Who covers it |
|---|---|
| Technical fundamentals | Coding round 1 |
| System design / architecture | System design round |
| Problem-solving approach | Coding round 2 |
| Collaboration / communication | Bar raiser or cross-functional |
| Values and culture | Hiring manager or peer |
| Past impact and trajectory | Behavioral / resume deep-dive |
Rules for a well-designed loop:
- Every dimension is covered by exactly one round (no redundancy)
- No interviewer covers more than one dimension (keeps each fresh)
- The loop can be completed in one business day on-site or two days virtual
- Assign a "bar raiser" - someone outside the immediate team with veto power
Create scoring rubrics - template
Use a 4-level rubric for each dimension. The key is defining the middle levels precisely - candidates cluster there, and those are the hard decisions.
Dimension: [Name, e.g., "Problem Decomposition"]
Weight: [High / Medium / Low]
4 - Strong Yes
Candidate independently breaks problem into clean subproblems. Names
intermediate data structures without prompting. Explains trade-offs of
multiple approaches before choosing. Handles edge cases proactively.
3 - Yes
Candidate breaks problem into subproblems with minor prompting. Solves
the core problem correctly. Handles most edge cases when prompted.
Explains the primary trade-off.
2 - No
Candidate solves simple version but struggles to generalize. Requires
significant prompting to identify subproblems. Misses important edge
cases. Does not discuss trade-offs unless directly asked.
1 - Strong No
Candidate cannot decompose the problem independently. Solution is
incorrect or incomplete. Does not respond to hints. Cannot explain
what their own code does.See references/rubric-templates.md for complete rubrics for coding,
system design, behavioral, and culture fit rounds.
Build a take-home coding challenge
Take-homes reveal real-world execution that 45-minute whiteboard problems cannot. Design one that:
- Scopes to 2-3 hours max - Respect candidate time. If it takes a senior engineer 2 hours, calibrate down. State the expected time in the instructions.
- Uses a realistic problem - "Build a rate limiter for our API" beats "implement a binary search tree." Domain-adjacent problems reveal how candidates think about the actual work.
- Provides a starter repo - Give candidates a repo with the scaffolding, CI, and test runner already wired. Evaluating candidates on setup skills is noise.
- Defines evaluation criteria upfront - Include a
EVALUATION.mdin the repo that lists exactly what reviewers will look for: correctness, test coverage, code clarity, README quality. - Has a follow-up interview - Schedule a 30-minute code walkthrough. This prevents submitting work that isn't the candidate's own and surfaces how they think about their own decisions.
Evaluation checklist for reviewers:
- Does the solution solve the stated problem?
- Are edge cases handled?
- Is the code readable without explanation?
- Are there tests, and are they meaningful?
- Does the README explain design decisions?
- Are there obvious improvements the candidate noted themselves?
Design behavioral interview questions - STAR format
Behavioral questions follow the pattern: "Tell me about a time when..." The STAR framework (Situation, Task, Action, Result) gives candidates a structure and gives interviewers a rubric for what a complete answer looks like.
Writing strong behavioral questions:
- Anchor to a specific competency (e.g., "conflict resolution" or "driving alignment without authority")
- Phrase as past behavior, not hypothetical: "Tell me about a time you disagreed with your manager" not "What would you do if..."
- Prepare follow-up probes in advance
| Competency | Primary question | Follow-up probe |
|---|---|---|
| Handling ambiguity | Tell me about a project where the requirements were unclear. How did you proceed? | What would you do differently? |
| Driving impact | Tell me about the highest-impact project you've worked on. What made it high-impact? | How did you measure that impact? |
| Conflict resolution | Tell me about a time you had a serious technical disagreement with a peer. | How was it resolved? |
| Prioritization | Tell me about a time you had more work than you could finish. | What did you drop, and how did you decide? |
| Ownership | Tell me about something that went wrong on a project you led. | What did you change afterward? |
Scoring STAR responses:
- Situation/Task - Is the context clear and relevant to the role?
- Action - Did the candidate describe their specific actions (not "we")?
- Result - Is there a concrete, quantified outcome?
- Learning - Does the candidate show reflection and growth?
Design system design interviews
System design interviews assess whether a candidate can architect solutions for real-world scale and ambiguity. The structure matters as much as the content.
Interview structure (45-60 minutes):
Requirements clarification (5-10 min) - Candidate should ask scoping questions: scale, read/write ratio, latency requirements, consistency model. Award signal for good questions, not just correct answers.
High-level design (10-15 min) - Candidate draws the major components and data flows. Watch for separation of concerns and component boundaries.
Deep dive (15-20 min) - Interviewer picks one or two components to explore in depth: database schema, caching strategy, failure modes.
Trade-offs and bottlenecks (5-10 min) - Candidate explains what they would improve with more time, where the system might break, and why they made specific choices.
Rubric signals to watch:
- Does the candidate ask clarifying questions before whiteboarding?
- Can they estimate load and justify their component choices with numbers?
- Do they proactively identify single points of failure?
- Can they explain the trade-off between consistency and availability?
- Do they adjust the design when the interviewer changes a constraint?
Run calibration sessions
Calibration prevents rubric drift before it happens. Run one calibration session per new interviewer and one per quarter for existing panelists.
Calibration session format (60 minutes):
- Distribute a transcript or video of a mock interview (use a fabricated candidate, never a real one without consent)
- Each interviewer scores independently using the rubric - no discussion yet
- Reveal all scores simultaneously (prevents anchoring)
- Discuss every dimension where scores diverge by 2+ points
- Reach consensus on the "correct" score and the reasoning
- Document the calibrated examples as reference cases for future interviewers
Red flags indicating calibration is needed:
- Two interviewers gave the same candidate a 4 and a 1 on the same dimension
- An interviewer cannot cite specific evidence for their score
- Scores correlate with candidate demographics, not candidate performance
- The team has not hired anyone in 6 months despite many interviews
Conduct effective debriefs
The debrief is the most consequential 30-60 minutes in the hiring process. Run it badly and you amplify bias. Run it well and you surface the truth.
Before debrief:
- All interviewers submit written scorecards independently (hiring manager cannot see scores until all are submitted)
- Block 48 hours maximum between last interview and debrief
Debrief agenda:
- Hiring manager reads all scorecards silently (5 min)
- Each interviewer speaks to their dimension only - what evidence they saw, what level they scored, why (2 min per interviewer, no interruption)
- Open discussion on dimensions with significant disagreement
- Hiring manager asks: "Is there anything about this candidate we have not yet discussed that is relevant?"
- Hiring manager states the decision and the primary evidence that drove it
Decision framework:
- Any Strong No from a bar raiser or domain expert is a block unless directly rebutted with evidence (not "they seemed nervous")
- "Probably yes" is a No - only hire on conviction
- Document the stated rationale in the ATS for every decision, hire or no-hire
Anti-patterns
| Anti-pattern | Why it fails | What to do instead |
|---|---|---|
| Gut-feel interviews | Interviewers cannot separate "I like them" from "they can do the job." Correlates with affinity bias, not job performance | Use structured questions and rubrics; require evidence-based scorecards |
| Brainteaser questions | "How many golf balls fit in a school bus?" measures nothing relevant to engineering work. Banned at most major tech companies | Use problems derived from real work the candidate will actually do |
| Group debrief without written scores | First speaker anchors the group. Quieter interviewers defer. The decision reflects seniority, not evidence | Require independent written scorecards before any verbal discussion |
| Hiring bar creep | Interviewers gradually raise standards over months until no one is hireable, stalling team growth | Tie rubric levels to job requirements, not to the best candidate ever interviewed |
| Same-style duplication | Two rounds both test the same coding dimension because neither interviewer was briefed on coverage | Map each dimension to exactly one round before the loop starts |
| Culture fit as veto | "Not a culture fit" used as a catch-all rejection with no supporting evidence - often a proxy for bias | Define culture/values criteria explicitly in the rubric; require behavioral evidence |
Gotchas
Rubrics calibrated only on strong candidates produce grade inflation - If interviewers see 10 qualified candidates in a row, their mental model of "a 3" drifts upward over time. Run calibration sessions quarterly using the same reference examples to anchor scores. Grade inflation causes you to reject good candidates because they score "only" a 3 when the bar has silently moved to 3.5.
Take-home exercises without a time cap select for availability, not skill - Without an explicit time cap, candidates who are between jobs or have no outside commitments will spend 12 hours on a "2-3 hour" challenge, and their submissions will look objectively better. State "we expect this to take 2-3 hours" in the instructions and calibrate your evaluation rubric to what a strong engineer can produce in that time.
Allowing the hiring manager to see scores before all interviewers submit creates anchor bias - If one interviewer submits a Strong Yes early and the hiring manager shares it in Slack, every subsequent scorer is subconsciously anchored. Enforce blind independent scoring: no interviewer sees anyone else's score or written feedback until all scorecards are submitted.
Interview loops covering the same dimension twice produce inflated confidence - Two coding rounds both asking algorithm questions don't double the signal - they create redundant data while leaving system design, communication, or values completely unevaluated. Map each round to a unique dimension before the first candidate interviews, and stick to the map.
Debrief dominated by the most senior person in the room is not a debrief - If an engineering director speaks first with a strong opinion, junior interviewers defer and the decision reflects the director's prior, not the collective evidence. The facilitator must ask every interviewer to present their evidence before any discussion begins, starting with the most junior voice.
References
For detailed content on specific topics, read the relevant file from references/:
references/rubric-templates.md- Ready-to-use scoring rubrics for coding, system design, behavioral, and culture fit rounds
Only load a references file if the current task requires deep detail on that topic.
References
rubric-templates.md
Rubric Templates
Rubrics work only when every interviewer applies them before the interview, not after. Print or share these with the panel before the first candidate. Calibrate on a practice response together to align on what each level looks like.
How to use these templates
- Copy the relevant template for each interview round
- Fill in the role-specific context under each dimension
- Run a calibration session - interviewers score the same practice response independently, then compare and discuss
- Treat any dimension with >1 point spread in calibration as "needs discussion"
Score key used throughout:
| Score | Label | Meaning |
|---|---|---|
| 4 | Strong Yes | Clear signal; would raise the bar on the team |
| 3 | Yes | Solid hire; meets the bar for this level |
| 2 | No | Does not yet meet the bar; significant gaps |
| 1 | Strong No | Critical gaps; declined regardless of other scores |
Coding Interview Rubric
Round purpose: Assess technical problem-solving, code quality, and communication under time pressure.
Dimension 1: Problem Decomposition
| Score | Behavioral indicators |
|---|---|
| 4 | Independently breaks problem into clean subproblems before writing any code. Names data structures without prompting. Explains why they chose a particular decomposition. Proactively identifies edge cases (empty input, overflow, duplicate entries) and handles them before the interviewer asks. |
| 3 | Breaks problem into subproblems with light prompting ("what would you tackle first?"). Solves the core case correctly. Handles most edge cases when prompted. Can explain their decomposition after the fact. |
| 2 | Jumps straight into coding without a plan. Solution handles the happy path but misses edge cases. Requires significant hints to identify what is missing. Difficulty explaining why code is structured as it is. |
| 1 | Cannot break the problem down even with hints. Solution is incorrect or incomplete. Does not know where to start without a near-full solution provided. |
Dimension 2: Code Quality and Clarity
| Score | Behavioral indicators |
|---|---|
| 4 | Variable and function names clearly communicate intent. No need to ask "what does this variable do?" Functions are small with a single responsibility. Would pass code review with minimal comments. |
| 3 | Names are reasonable; code is mostly readable. Minor issues (e.g., one vague variable name) that do not impede understanding. Reviewer would approve with small suggestions. |
| 2 | Names are generic (temp, data, flag). Functions mix multiple responsibilities. Reviewer would request significant changes before approving. |
| 1 | Code is incomprehensible without extensive explanation from the candidate. |
Dimension 3: Complexity Analysis
| Score | Behavioral indicators |
|---|---|
| 4 | States time and space complexity unprompted and correctly. Can reason about trade-offs between solutions with different complexity profiles. Identifies when a more complex algorithm is not worth it for small inputs. |
| 3 | States complexity correctly when asked. Understands that nested loops are typically O(n^2). May not spontaneously discuss trade-offs. |
| 2 | Incorrect complexity analysis or significant prompting required. Confuses time and space complexity. |
| 1 | Cannot reason about complexity even when asked directly. |
Dimension 4: Communication and Collaboration
| Score | Behavioral indicators |
|---|---|
| 4 | Asks clarifying questions before writing code. Narrates thinking as they work ("I'm going to use a hashmap here because..."). Receives and integrates hints gracefully. Tells the interviewer when they are stuck rather than going silent. |
| 3 | Occasionally narrates thinking. Handles hints without frustration. May not ask clarifying questions spontaneously but does when prompted. |
| 2 | Long stretches of silence. Does not integrate hints - either ignores them or abandons their approach entirely. Seems frustrated when challenged. |
| 1 | Non-communicative. Cannot describe what they are doing. Hostile to hints or suggestions. |
System Design Interview Rubric
Round purpose: Assess architectural thinking, scalability reasoning, and ability to navigate ambiguity in open-ended problems.
Dimension 1: Requirements Clarification
| Score | Behavioral indicators |
|---|---|
| 4 | Asks structured scoping questions before drawing anything: scale (DAU, QPS), read/write ratio, latency SLAs, consistency requirements, geographic distribution. Captures answers visibly. Explicitly states assumptions made. |
| 3 | Asks at least 3 meaningful clarifying questions. May miss one dimension (e.g., does not ask about consistency) but covers scale and latency. States assumptions at the start of the design. |
| 2 | Asks one or two surface questions, then jumps to design. Does not state assumptions. Design later requires backtracking when constraints are revealed. |
| 1 | Immediately starts drawing without asking any questions. Design is untethered from actual requirements. |
Dimension 2: High-Level Design Quality
| Score | Behavioral indicators |
|---|---|
| 4 | Draws clean component boundaries. Identifies all major components (load balancer, app layer, database, cache, queue) with correct relationships. Explains the responsibility of each component. Notes where data flows and in what format. |
| 3 | Correct high-level design with minor gaps (e.g., misses a CDN or doesn't specify database type). Components and relationships are clear. Could build from this diagram. |
| 2 | Design has significant omissions or incorrect component relationships. Unclear how components communicate. Single-point-of-failure not acknowledged. |
| 1 | Design is incomplete or fundamentally incorrect for the stated requirements. |
Dimension 3: Scalability and Bottleneck Analysis
| Score | Behavioral indicators |
|---|---|
| 4 | Identifies the bottleneck in their design proactively and proposes a mitigation. Uses back-of-envelope math to justify component choices ("at 10k QPS we need at least 5 app servers"). Discusses horizontal vs vertical scaling trade-offs for specific components. |
| 3 | Can identify bottlenecks when prompted. Knows that databases are often the first bottleneck. Can describe caching and read replicas as mitigations. |
| 2 | Does not think about scale until pushed. Proposes general solutions ("just add more servers") without reasoning. Cannot estimate required capacity. |
| 1 | Cannot identify bottlenecks even when told where to look. No concept of horizontal scaling. |
Dimension 4: Trade-off Articulation
| Score | Behavioral indicators |
|---|---|
| 4 | For every significant design decision, names the trade-off: "I chose eventual consistency here because strong consistency would require a distributed transaction which hurts write latency." Adjusts the design when the interviewer changes a constraint without starting over from scratch. |
| 3 | Can explain the primary trade-off for major design choices when asked. Adapts to changed constraints with moderate prompting. |
| 2 | Presents a single design as "the answer" without acknowledging alternatives. Struggles to adapt when a constraint changes. |
| 1 | Cannot discuss trade-offs. Defensive about design choices. Cannot adjust when constraints change. |
Behavioral Interview Rubric (STAR)
Round purpose: Assess past behavior as a predictor of future behavior across key competencies.
Dimension: STAR Response Completeness
| Score | Behavioral indicators |
|---|---|
| 4 | Provides clear Situation and Task context without over-explaining. Actions are specific and personal ("I did X" not "we did X"). Result is quantified or concretely described. Includes a reflection or learning unprompted. |
| 3 | All four STAR components present, though one may be thin. Actions are mostly specific. Result is stated, though may not be quantified. |
| 2 | Missing one component (usually Result or specific Actions). Answers tend toward the hypothetical ("what we would normally do") rather than a specific past event. Difficult to tell what the candidate personally did vs the team. |
| 1 | Cannot provide a specific example. Speaks only in generalities. No concrete result described. |
Dimension: Ownership and Accountability
| Score | Behavioral indicators |
|---|---|
| 4 | Takes clear personal ownership of both successes and failures. When describing something that went wrong, explains exactly what they did to fix it and what they changed afterward. Does not deflect to circumstances or other people. |
| 3 | Takes ownership in most cases. May occasionally frame failures as externally caused, but can reflect on their own contribution when probed. |
| 2 | Consistently credits others for successes and blames others or circumstances for failures. Difficulty identifying their own contribution. |
| 1 | No evidence of personal ownership. All outcomes attributed to the team or to luck. |
Dimension: Impact and Scope
| Score | Behavioral indicators |
|---|---|
| 4 | Examples are proportional to the seniority of the role. Impact is clearly described and significant (e.g., "reduced P99 latency by 40%", "drove adoption across 3 teams"). Can describe how they influenced decisions beyond their direct scope. |
| 3 | Examples are reasonable for the level. Impact is described, though may lack quantification. Scope is appropriate for individual contributor level. |
| 2 | Examples are too small for the target level (task-level, not project-level). Impact is vague or unmeasured. |
| 1 | Cannot describe meaningful impact. Examples are entirely task execution with no broader outcome. |
Culture / Values Interview Rubric
Round purpose: Assess alignment with how the team operates and makes decisions - not personality, not "vibe."
Before using this rubric, define your values explicitly. The rubric below uses example values (Direct Communication, Customer Obsession, Long-term Thinking). Replace with your team's actual values.
Dimension: Direct Communication
| Score | Behavioral indicators |
|---|---|
| 4 | Gives a specific example of delivering difficult feedback directly, including the setting, what they said, and how it was received. Can describe a time they received hard feedback and acted on it. Speaks candidly about past disagreements. |
| 3 | Shows directness in their example, but may have softened the delivery in ways that reduced impact. Open about receiving feedback. |
| 2 | Example involves conflict avoidance or routing concerns through a third party. Difficulty recalling a time they delivered hard feedback directly. |
| 1 | No examples of direct communication. Describes a style that relies on consensus or avoiding difficult conversations entirely. |
Dimension: Customer Obsession
| Score | Behavioral indicators |
|---|---|
| 4 | Brings up the end user unprompted when describing technical decisions. Can give a specific example of changing a technical plan because of customer feedback or data. Understands the difference between what customers ask for and what they need. |
| 3 | Can describe customer-driven decisions when asked. Aware of the end user, though it may not be their primary frame for technical choices. |
| 2 | Discusses technical decisions entirely in terms of technical elegance or internal process. Customer is an afterthought in examples. |
| 1 | Cannot provide an example involving customer impact. Decisions are purely internally-motivated. |
Dimension: Long-term Thinking
| Score | Behavioral indicators |
|---|---|
| 4 | Can give a specific example of accepting short-term pain (shipping slower, taking on tech debt deliberately) for long-term gain. Explains the reasoning and the outcome. Thinks about second-order effects of decisions. |
| 3 | Shows awareness of long-term trade-offs in examples. May not always have acted on long-term thinking but can reason about it. |
| 2 | Examples are dominated by short-term metrics (shipping fast, hitting the sprint goal) without consideration of downstream effects. |
| 1 | No evidence of thinking beyond the immediate task or sprint. |
Rubric Calibration Worksheet
Use this worksheet when running calibration sessions with new interviewers.
Instructions:
- Each interviewer reads the same candidate response (transcript or recording)
- Score each dimension independently - do not share scores yet
- Reveal all scores simultaneously
- For any dimension with a spread of 2+, discuss: what evidence led to each score?
- Reach consensus and document the "anchor" score with a one-sentence rationale
| Dimension | Interviewer A | Interviewer B | Interviewer C | Consensus | Rationale |
|---|---|---|---|---|---|
| Problem Decomposition | |||||
| Code Quality | |||||
| Complexity Analysis | |||||
| Communication |
Calibration is complete when: every interviewer can independently arrive at the consensus score when presented with the same evidence, within 1 point.
Frequently Asked Questions
What is interview-design?
Use this skill when designing structured interviews, creating rubrics, building coding challenges, or assessing culture fit. Triggers on interview design, rubrics, scoring criteria, coding challenges, behavioral interviews, system design interviews, culture fit assessment, and any task requiring interview process design or evaluation criteria.
How do I install interview-design?
Run npx skills add AbsolutelySkilled/AbsolutelySkilled --skill interview-design in your terminal. The skill will be immediately available in your AI coding agent.
What AI agents support interview-design?
interview-design works with claude-code, gemini-cli, openai-codex. Install it once and use it across any supported AI coding agent.