Credit Scoring Fairness

Context

An ML-based credit scoring model ranks applicants for loan approval. Historical data may embed societal or institutional biases. The system must ensure equitable performance and decisions across protected groups while maintaining predictive validity.

Trigger

Retail banking credit risk platform (scoring + decisioning) processes loan applications for underwriting teams.

Acceptance Criteria

Demographic parity difference (selection-rate gap) limited to ≤ 0.10 across any protected-group comparison
Disparate impact investigated and remediated if falling below 0.80 (4/5ths rule per EEOC Uniform Guidelines)
Equal opportunity difference (TPR gap) ≤ 0.05 for “good payer” prediction across protected groups
Equalized odds (TPR/FPR gaps) reported each release
Group-wise calibration (Brier or ECE) maintained within ±2% across protected groups
Mitigation documented (reweighing, adversarial debiasing, post-processing thresholds) with justified trade-offs
Model cards and data sheets provided with quarterly group-wise performance dashboards
Continuous disparity monitoring runs in production
Alerts triggered when fairness metric breaches threshold for >24h with rollback or human review
Auditable log maintained for model versions, features, thresholds, and mitigation settings for at least 24 months

Terminology

True Positive Rate (TPR): Share of actual positives correctly predicted as positive; also called recall/sensitivity
Expected Calibration Error (ECE): Weighted average gap between predicted probability (by bin) and observed outcome frequency; lower is better

Measurement & Verification

Use held-out, stratified evaluation set with sufficient support per group (≥100 positives/negatives) and fixed data-splits
Compute per-group metrics (selection rate, DI, parity diff, TPR/FPR, calibration curves) and confidence intervals (bootstrap 95% CI) each release
Validate calibration via reliability diagrams and ECE/Brier
Verify group-wise parity against thresholds
Recommended tooling: scikit-learn + Fairlearn or AIF360
Archive notebooks/reports with data and code hashes
Monitor same metrics on fresh labeled samples in production
Alert and trigger rollback/human review on sustained breaches

Related Qualities

arc42 Quality Model
187 quality attributes, explained.
114 examples of quality requirements, related to 31 quality standards.