SAFE-Bench evaluates AI models across seven comprehensive domains that reflect real-world tasks performed by safety and risk professionals. Our methodology ensures objective, reproducible assessments of model capabilities and limitations.
All tests are independently evaluated by certified safety professionals with an average of 10+ years of industry experience.
Foundation AI models like GPT, Claude, and Gemini are regularly tested and benchmarked across general capabilities. However, there exists no industry-standard benchmark for safety and risk use cases.
Without clear performance metrics, safety professionals face a dilemma: some trust AI tools too much, while others don't trust them at all. SAFE-Bench provides the clarity needed to make informed decisions about where AI can enhance efficiency and where human expertise remains irreplaceable.
Our goal is to create authoritative, objective measurements that help the industry understand AI capabilities and limitations in real-world safety and risk scenarios.
Each domain tests a critical competency required of safety and risk professionals in real-world scenarios.
Raw cognitive assessment of safety and risk knowledge
100 multiple-choice questions based on CCSP (Certified Safety and Security Professional) standards testing foundational knowledge across safety principles, risk management, compliance, and industry best practices.
Pass rate based on percentage of correct answers. Models must achieve 70% or higher to pass this domain.
Visual analysis and comprehensive hazard reporting
Models analyze a sample set of workplace photos and must accurately identify all present hazards. Performance is measured on completeness (finding all hazards), accuracy (no false positives), and report quality (actionable, clearly written documentation).
Scored on three dimensions: detection accuracy, false positive rate, and report usability score from safety professionals.
Biomechanical analysis and ergonomic risk assessment
Models analyze video footage of workers performing various tasks and provide comprehensive ergonomic assessments. Evaluations must identify risk factors, assess severity using standard frameworks (REBA/RULA), and recommend specific interventions.
Assessed on risk factor identification, severity scoring accuracy, and quality/practicality of recommendations.
Document analysis and program improvement
Models review safety programs, written procedures, and policy documents to identify gaps, errors, and areas for improvement. Must understand regulatory requirements, industry standards, and best practices while providing specific, actionable feedback.
Evaluated on completeness of findings, accuracy of regulatory knowledge, and usefulness of improvement suggestions.
Root cause analysis and logical reasoning
Given incident scenarios with supporting evidence, models must conduct thorough investigations using recognized methodologies (5 Whys, Fishbone, TapRooT). Analysis must identify contributing factors, determine root causes, and recommend corrective actions.
Judged on logical reasoning, root cause accuracy, comprehensiveness, and practicality of corrective actions.
Creating training content and safety resources
Models generate common workplace safety materials including toolbox talks, training presentations, safety bulletins, and educational content. Materials must be accurate, engaging, appropriately detailed, and immediately usable by safety professionals.
Human evaluators rate materials on accuracy, clarity, engagement, and practical usability in real training scenarios.
Risk Analysis and Method Statements
Models analyze work scenarios and create or review Risk Assessment Method Statements (RAMS). Must demonstrate understanding of work activities, identify hazards, assess risks using standard matrices, and provide appropriate control measures following the hierarchy of controls.
Evaluated on hazard identification completeness, risk rating accuracy, and quality/appropriateness of control measures.
Unlike generic AI benchmarks that test broad capabilities, SAFE-Bench focuses exclusively on tasks that safety and risk professionals encounter daily. Every test reflects actual work scenarios, not theoretical problems.
Each domain is evaluated by certified safety professionals with diverse backgrounds across consulting, insurance, manufacturing, and regulatory compliance. Multiple evaluators ensure objectivity and reduce bias.
AI models improve rapidly. We re-test all models quarterly to track performance changes and ensure our benchmark remains current. New models are added as they become available.
All testing criteria, scoring rubrics, and evaluation processes are publicly documented. We believe transparency builds trust and allows the industry to improve testing standards over time.
Explore our leaderboard to see how different AI models perform across all seven testing domains.