A government trial of artificial intelligence has shown it was quite poor at summarising public submissions when compared to how humans do the work.
As the Federal Government continues to stress that its workforce must learn how to embrace the responsible use of AI, a targeted trial at the Australian Securities and Investments Commission (ASIC) has delivered results that are, to say the least, very interesting.
Submitted summaries were marked by reviewers who were unaware AI played any role in the summaries at all, with the outcome being that human work was graded roughly twice as highly as the AI papers.
Across all criteria, humans far outperformed AI.
A report of the test concluded that the use of AI could potentially create unnecessary workloads due to the greater need for fact-checking.
Amazon Web Services (AWS) conducted a test for ASIC to assess the capability of generative AI to summarise a sample of public submissions made to an external parliamentary joint committee inquiry.
Meta’s Llama2-70B model was prompted to focus on ASIC references and explain what they meant while summarising the submissions.
ASIC staff were given the same task, using identical directions and prompts.
The reviewers found wrong information, lack of nuance, misplaced context and overlooked emphases in some of the summaries they were marking – which turned out to be AI-generated work.
The inquiry was looking into audit and consultancy firms, and the test that was conducted on submissions to it is what AWS called a Proof of Concept (PoC).
When ASIC officials appeared in May before the Senate Committee on Adopting Artificial Intelligence, they were asked about the trial and its results.
ASIC officials took a question on notice and, after being pressed by Greens Senator David Shoebridge and independent Senator David Pocock, committed to tabling the report.
With that report now delivered, AWS stressed the PoC was not used for any of ASIC’s regulatory work or business purposes.
The report does raise, however, a number of red flags.
“The objectives of the PoC were to explore and trial Gen AI technologies, to focus on measuring the quality of the generated output rather than the performance of the models and to understand the future potential for business use of Gen AI,” the report states.
“The final assessment results of the PoC showed that out of a maximum of 75 points, the aggregated human summaries scored 61 (81 per cent), and the aggregated Gen AI summaries scored 35 (47 per cent).
“Whilst the Gen AI summaries scored lower on all criteria, it is important to note the PoC tested the performance of one particular AI model (Llama2-70B) at one point in time.
“The PoC was also specific to one use case with prompts selected for this kind of inquiry.
“In the final assessment, ASIC assessors generally agreed that AI outputs could potentially create more work if used (in current state), due to the need to fact check outputs, or because the original source material actually presented information better.
“The assessments showed that one of the most significant issues with the model was its limited ability to pick up the nuance or context required to analyse submissions.”
In its key observations, the report noted that to a human, the request to summarise a document appears straightforward, whereas AI had difficulty with the task consisting of several different actions “depending on the specifics of the summarisation request”.
In the PoC, the summarisation work was achieved by a series of discrete tasks, with the selected AI model found to perform strongly with some actions and less capably with others.
The report found that generic prompting without specific directions or considerations resulted in lower-quality output compared to specific or targeted prompting.
And it suggests that things can only get better.
“An environment for rapid experimentation and iteration is necessary, as well as monitoring outcomes,” the report states.
“Technology is advancing rapidly in this area. More powerful and accurate models and Gen AI solutions are being continually released, with several promising models released during the period of the PoC. It is highly likely that future models will improve performance and accuracy of the results.
“The PoC provided valuable learnings, demonstrating the current capabilities of Llama2-70B as well as the potential for growth.
“Although there are opportunities for Gen AI, particularly as the technology continues to advance, this PoC also found limitations and challenges for adopting Gen AI for this specific use case.”
Original Article published by Chris Johnson on Riotact.