BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//pretalx//cfp.nsec.io//KACS9T
BEGIN:VTIMEZONE
TZID:EST
BEGIN:STANDARD
DTSTART:20001029T030000
RRULE:FREQ=YEARLY;BYDAY=-1SU;BYMONTH=10;UNTIL=20061029T070000Z
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:STANDARD
DTSTART:20071104T030000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=11
TZNAME:EST
TZOFFSETFROM:-0400
TZOFFSETTO:-0500
END:STANDARD
BEGIN:DAYLIGHT
DTSTART:20000402T030000
RRULE:FREQ=YEARLY;BYDAY=1SU;BYMONTH=4;UNTIL=20060402T080000Z
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
BEGIN:DAYLIGHT
DTSTART:20070311T030000
RRULE:FREQ=YEARLY;BYDAY=2SU;BYMONTH=3
TZNAME:EDT
TZOFFSETFROM:-0500
TZOFFSETTO:-0400
END:DAYLIGHT
END:VTIMEZONE
BEGIN:VEVENT
UID:pretalx-2026-KACS9T@cfp.nsec.io
DTSTART;TZID=EST:20260515T141500
DTEND;TZID=EST:20260515T144500
DESCRIPTION:In March of 2025\, the Model Evaluation & Threat Research (METR
 ) group introduced **AI task time horizons** as a method for measuring the
  length of tasks that models can autonomously complete coherently. They de
 monstrated rapid growth in capabilities across frontier systems: effective
 ly showing a doubling every \\~7 months. While this framework has primaril
 y been applied to general software and knowledge work\, its implications f
 or adversarial domains remain largely unexplored.\n\nIn this talk\, I pres
 ent work I've done with **Sean Peters and Jack Payne\,** extending METR’
 s methodology to **offensive cybersecurity workflows**\, alongside a compl
 ementary **human baseline study** to ground and interpret model performanc
 e.\n\nMotivated by the desire to better understand offensive model capabil
 ities\, we assembled realistic multi-step offensive task sequences by leve
 raging a suite of industry standard benchmarks. Both human participants an
 d frontier models were evaluated across increasing task lengths to quantif
 y sustained autonomy\, coherence\, and failure modes.\n\nInitial results i
 ndicate that AI task horizons in offensive cyber are already meaningful an
 d extending rapidly. In several domains\, models can chain complex tool-dr
 iven actions resembling early-stage intrusion playbooks rather than isolat
 ed exploitation steps. The human study provides critical context\, highlig
 hting where models approach or diverge from human performance as task leng
 th increases.\n\nThe talk will cover the experimental design\, empirical f
 indings\, and key limitations\, emphasizing how horizon-based evaluation c
 ombined with human grounding surfaces trends that may not be observable by
  standalone\, static benchmarks.\n\nFinally\, this work is positioned as *
 *exploratory research**. It raises questions about whether similar horizon
  trends appear in defensive workflows: how could we measure defensive task
  horizons\, and what methods would allow meaningful comparisons to offensi
 ve performance? If the trend does not replicate in defense\, what interven
 tions\, tooling\, or policy changes could help close the gap? This framing
  invites further investigation and provides a roadmap for research and pra
 ctitioner engagement in understanding and mitigating offense–defense asy
 mmetries under AI automation.
DTSTAMP:20260507T212444Z
LOCATION:Ville-Marie
SUMMARY:Measuring AI Ability to Complete Long Cybersecurity Tasks - Jeremy 
 Miller
URL:https://cfp.nsec.io/2026/talk/KACS9T/
END:VEVENT
END:VCALENDAR
