What Gets Documented, Gets Rewarded
Gen AI promises to anchor performance reviews in evidence rather than narrative. The shift trades one bias for a quieter one—and the new bias punishes a different kind of person.

Gen AI in performance reviews is being sold as the cure for manager fiction—surface the artifacts, replace storytelling with retrieval, anchor the review in what got written down. Boston Consulting Group reports its internal tool cuts review-writing time by 40 percent. Citi and JPMorgan have shipped their own. The pitch is that managers will stop telling stories and start reading evidence.
It is the wrong fix, framed as the right one.
Boston University’s Chrysanthos Dellarocas makes a sharper version of the case in an essay. Gen AI should not be polishing narratives; it should be surfacing the artifacts—the decision memo where a flawed assumption was challenged, the postmortem where a failed initiative got restructured, and the email thread where a regional restructuring got driven. He is right that the current deployments are pointed in the wrong direction. He is also working from an assumption that breaks down once you look at how work is distributed inside an org.
The trade-off
Narrative reviews have a bias problem. Different managers describe identical performance differently, shaped by memory, relationship, and storytelling instinct. It happens, and it’s worth fixing.
Artifact-based reviews have a different bias problem. They can only see the work that left a trail. A review system anchored in retrieved evidence will recognize the contributors whose work generates documentation. It will quietly fail to recognize the contributors whose value sits in work that doesn’t.
Neither distortion is neutral. Both under-recognize quiet operators—the first because managers tell stories about people they remember, the second because the system can only count what was filed. The shift Gen AI enables is not from biased to unbiased. It is from one bias to another.
The new bias is harder to argue with. A manager’s narrative is contestable—an employee can push back, a calibration committee can interrogate it, an external reviewer can spot the soft spots. An artifact corpus looks like data, and decisions justified by “the system surfaced these episodes” are hard to push back on. The people most disadvantaged by what the system didn’t surface have the weakest hand in arguing the gap.
Whose work goes invisible
Picture four people on a team.
The senior engineer who untangles a critical system in an afternoon and types “shipped” in Slack. The principal designer who shapes a junior over six months of one-on-ones and never writes any of it down. The operator who walks into a stuck cross-functional meeting and realigns it in 20 minutes—a calendar invite and someone else’s notes are the only surface artifacts. The technical IC who debugs a bad architectural decision live in a Zoom call, before it ships, before there is anything to document.
An evaluation regime that runs on artifacts will not recognize any of them in proportion to their contribution. The product manager on the same team who narrates her reasoning in a public Slack thread before every decision will look strategic by comparison. She may also be strategic—but the system cannot tell the difference between strategy and the documentation of strategy.
None of this is new. Research on workplace recognition has been documenting this for years. The work that doesn’t generate visible artifacts—mentorship, glue work, in-flight problem-solving, and the unglamorous translation between functions—is systematically under-credited. The people doing it skew predictably: more women, more mid-career operators, and more ICs whose effectiveness sits in conversations rather than commits.
Gen-AI-curated evidence reviews do not introduce this bias. They industrialize it.
Why Gen AI accelerates the bias
Before Gen AI, evidence-based review had one accidental safeguard.
It was expensive.
A manager who could only read so many artifacts had to make judgment calls about which ones. That judgment had room for “I know this person delivers, the receipts aren’t in this folder.” That margin is gone. Once retrieval is cheap and semantic analysis is automated, the artifact corpus becomes the de facto definition of the work.
It also creates a second-order effect. Documentation hygiene becomes the fitness function. People learn that the system grades what is legible to it, and they adapt.
People write decision memos for future evaluators, not to decide. They post in public channels instead of direct messages. They generate retros for work that didn’t previously warrant them. The performative writing the current pitch was trying to eliminate at the manager layer reappears at the employee layer—and now everyone is producing it. It is the same dynamic that makes AI adoption stall inside otherwise functional rollouts: the tool works, the surrounding workflow has not been redesigned, and employees adapt to the parts the system can see.
Why orgs will deploy it
None of this will slow adoption. Orgs are not buying these tools primarily for accuracy. They are buying them for defensibility.
A Gen-AI-curated evidence trail is a legal and HR document regardless of whether it captures the right work. Calibration meetings move faster when every rating links to a paragraph and a citation. Comp decisions are easier to defend in a court filing when the supporting evidence looks systematic. The same is true of performance-improvement plans, terminations, and promotion denials.
None of that is a hidden agenda—it is the rational product of the constraints HR operates under. But it means the implementation logic of these tools is being shaped by defensibility, not by signal. Signal accuracy is a feature of the same system. It is not the buying criterion.
What to ask before deploying
For a CEO or chief people officer weighing this deployment, the move is not to refuse it. It is to know what the system is being optimized for, and to instrument the deployment so the bias doesn’t win quietly.
Three moves.
Run a comparison pilot, not a single-system rollout
In the pilot quarter, run two reviews on the same population: Gen-AI-evidence only, and Gen-AI-evidence plus manager narrative. Compare calibration outputs. The people who rank differently are the ones the artifact system is missing. If that delta correlates with role, level, or demographic, you have your answer about the bias and where to act on it.
Critique the governance, don’t adopt it wholesale
Dellarocas proposes three pillars:
- Verification that every surfaced piece of evidence links to its source
- Employee control over what gets included in the portfolio
- Scope boundaries that exclude DMs and casual messages
They make sense. They are also governance for the failure mode he is paying attention to—surveillance drift.
They do nothing about the failure mode this piece is pointing at: the corpus itself is uneven. Adopt the three pillars, then add a fourth—a structural check on whose work is and isn’t being surfaced.
Fund what AI cannot replace
What Gen AI does not provide is a manager who has spent enough time with each report to know what they did without retrieving it. That is a function of span of control, manager tenure, and calibration committee composition. Each is a budget item.
If the Gen AI deployment is being used to extend manager span or shorten manager tenure, the savings come out of recognition accuracy. Name that trade when the deployment is justified on productivity grounds.
• • •
The three moves still leave one open question.
Can an org run an artifact-curation system without the system redefining what counts as good work?
The answer differs by industry, by function, and by the maturity of the manager bench.
Legible work is not the same as good work.
Bring two questions to the next leadership review.
- Which patterns in the current performance system already reward visibility over contribution.
- Whether the AI being deployed is about to scale those patterns or correct them.
The deployments that don’t put the question on the table end up scaling the patterns they were sold to correct.