The Pentagon’s cutting-edge technology research agency awarded cash prizes worth $8.5 million at hacker conference DEFCON last week as part of a contest to build open-source generative AI tools that can help find and patch software vulnerabilities.
Each of the seven finalist teams in the Defense Advanced Research Projects Agency’s AI Cyber Challenge (AIxCC) built a slightly different Large Language Model toolset, but they were all designed to do the same thing: Scan a software library of more than 54 million lines of code for hidden flaws; validate the ones they found to make sure they could actually be used by a hacker; and then write and deploy patches to fix each one. All seven toolsets will be publicly released.
The winners, Team Atlanta, are a “multi-organizational and international team,” backed by Korean technology conglomerate Samsung, said team leader Georgia Institute of Technology Professor Taesu Kim, who is also vice president of Samsung Research. The 42-member team, which will get a $4 million first prize, includes students and former students from Georgia Tech, as well as researchers from prestigious Korean technology institutions, Kim said.
The team had already decided to donate a “large portion” of the prize money back to Georgia Tech, Kim added, to ensure it could take on research projects “without strings attached, so that we can explore any subjects based on students’ interest.”
The field of AI research “is a fast-changing area, and the traditional funding model [for university research] doesn’t work very well. So we really want to push this boundary by ourselves,” Kim said.
The contest was designed to incentivize research to develop LLM tools that could solve a key problem in building “secure by design” software—a top goal of federal cybersecurity policy.
“The cost and difficulty of identifying bugs and creating appropriate patches for them,” is a key bottleneck in the software development lifecycle, said AIxCC Program Manager Andrew Carney, “AIxCC helps solve for that.”
The problem is especially severe for end users of software that are “target rich, but cyber poor,” like regional hospital chains and other smaller healthcare institutions, said Jen Roberts, director of the resilient systems office at the Advanced Research Projects Agency for Health, which is cofunding the AIxCC contest with DARPA. In the healthcare sector, she noted, “it takes 491 days on average to patch software.”
On the defense side, officials have noted vulnerabilities in everything from legacy weapon systems to logistics and infrastructure networks, and previous “Hack-a-thons” and bug bounty programs have uncovered thousands of flaws.
For the contest, the Cyber Reasoning Systems built by the competing teams found 59 percent of the artificial vulnerabilities injected by contest organizers into the open source codebases in the contest sandbox, and patched 43 percent of them, Carney said.
More significantly, the teams found a total of 18 previously undiscovered real vulnerabilities—and patched 11 of them—which were being responsibly disclosed to the publishers and maintainers of the code, he said.
“CRSs have proven that they can create valuable bug reports and patches for a fraction of the cost of traditional methods like bug bounties,” said Carney, adding that the average cost to find and patch a vulnerability with the AI tools was $152.
The hopeful news from the contest comes as a counterpoint to recent concerns among cybersecurity experts about a rising tide of low-quality, LLM-generated bug reports, which might in the future threaten to overwhelm crowdsourced bug-hunting platforms.
Contestants had to demonstrate that the vulnerabilities their LLMs found could be reached by hackers but didn’t have to provide a “proof of concept” exploitation code, which was fortunate, said Micheal Brown, from second-placed contestant Trail of Bits, a team from a small research company in New York.
Research that Trail of Bits had done for the UK’s AI Security Institute had revealed that while LLMs are good at writing patches, they are much less skilled at writing exploits. “That’s a much more niche problem and much more complicated,” said Brown.
He added that the “first thing” Trail of Bits will do with their $3 million second prize “is buy a disgustingly expensive dinner tonight, for myself and the team.”
“But after that, we plan to reinvest a lot of our winnings,” he said, explaining that part of the money would be used to maintain and develop Buttercup, the open source AI tool that they’d developed
Like Buttercup, all the AIxCC toolsets are open-source, Carney said, and can be used for free by those maintaining other open-source codebases — open-source software libraries provide IT tools and applications widely used by companies and government agencies but they’re often maintained by small, under-resourced teams of volunteers.
Carney said the competition had been designed from the beginning so that the tools contestants developed could be accessible through ”popular software development platforms.”