Purpose: To determine the inter-rater reliability (IRR) of a procedure-specific checklist scored in a binary fashion for the evaluation of surgical skill and whether it meets a minimum level of agreement (>= 0.8 between 2 raters) required for high-stakes assessment. Methods: In a prospective randomized and blinded fashion, and after detailed assessment training, 10 Arthroscopy Association of North America Master/Associate Master faculty arthroscopic surgeons (in 5 pairs) with an average of 21 years of surgical experience assessed the video-recorded 3-anchor arthroscopic Bankart repair performance of 44 postgraduate year 4 or 5 residents from 21 Accreditation Council for Graduate Medical Education orthopaedic residency training programs from across the United States. Results: No paired scores of resident surgeon performance evaluated by the 5 teams of faculty assessors dropped below the 0.8 IRR level (mean = 0.93; range 0.84-0.99; standard deviation = 0.035). A comparison between the 5 assessor groups with 1 factor analysis of variance showed that there was no significant difference between the groups (P = .205). Pearson's product-moment correlation coefficient revealed a strong and statistically significant negative correlation, that is, -0.856 (P < .000), indicating that as intra-operative error rate scores increased, the IRR decreased. Conclusions: Arthroscopy Association of North America shoulder faculty raters from across the United States showed high levels of IRR in the assessment of an arthroscopic 3-anchor Bankart repair procedure. All paired assessments were above the 0.8 level and the mean IRR of all resident assessments was 0.93, indicating that they could be used for high-stakes decisions. Clinical Relevance: With the move toward outcomes-based performance evaluation for graduate medical education, high-stakes assessments of surgical skill will require robust, reliable measurement tools that are able to withstand challenge. Surgical checklists employing metrics scored in a binary fashion meet the need and can show a high (>80%) IRR.