Like so many others, I’ve been playing around with language models lately. Also like so many others, I’ve been alternately amazed and amused at what they can do; one thing I’ve found myself wishing for is accessible ways to get LLMs to do things other than just chat and generate text.
This led me to OpenAI’s Functions API, which brings the idea of “plugins” to API consumers. The idea is that you can define functions that expose some information from or even actions on the outside world, and tell the LLM about those functions so it can call out to them when it wants to query / act on the world. The LLM can’t really call those functions on its own, of course, but it can tell you which function it’d like you to call on its behalf and await the result. The upshot for us here is that function calling requires the LLM to be able to understand a function’s purpose and signature so it can pass parameters in a compatible way. I hoped to take advantage of that functionality to get nice clean structured responses from the model.
We’ve also been brainstorming ways language models might be able to help with software delivery, so I thought I’d do a little experiment to see how GPT-3.5-turbo and GPT-4 are at estimating story points on Github issues. For the uninitiated, story points are a subjective estimate of the amount of effort and complexity involved in completing a software development objective.
I was pretty skeptical that these models would have enough context or experience to do well at this task, but I was ready to be surprised.
For what it’s worth, I wouldn’t want to replace our planning poker process with LLM-assigned scores even if they were perfect. The most important thing about poker is that it gets the team on the same page about what the issue entails and surfaces any misunderstandings, dependencies, or gaps before work starts.
“Methodology”
To do an informal evaluation of story point estimates made by OpenAI’s function-capable models, I decided to do the following:
- Use the Github API to pull a few hundred scored user stores from one of our repos
- For each round:
- Partition stories by their score, and choose one from each of the 5 most common buckets (scores 1, 2, 3, 5, 8). Shuffle their order so there’s no pattern to confuse the model.
- Add in another random story whose score has been redacted
- Generate a prompt asking the LLM to call a function with the correct score for the unscored issue
- Also generate a random score for the unscored issue
- Compare those against the score given to that issue by our project team
I used a Livebook to parse the issues and run the process over 100 rounds using both GPT-3.5-turbo and GPT-4.
Generating the Scores
The interesting pieces are the bits of configuration used when querying the OpenAI completions API.
The Functions
Parameter
This describes all of the functions the LLM can call. We just have one function that describes a function to record a score. We don’t actually run the function, but it forces the LLM to structure its response in a machine-parsable way.
[
{
"description": "Record the appropriate score for the given issue. Valid scores are the following: 1, 2, 3, 5, 8, 13",
"name": "record_issue_score",
"parameters": {
"properties": {
"issue_id": {
"description": "The id of the issue whose complexity score is being recorded.",
"type": "integer"
},
"score": {
"description": "The complexity score for the issue with the given id.",
"enum": [
1,
2,
3,
5,
8,
13
],
"type": "integer"
}
},
"required": [
"issue_id",
"score"
],
"type": "object"
}
}
]
The Function_Call
Parameter
This parameter can be configured a few ways. You can tell the LLM to choose whether it needs to call a function, and which one, or you can tell the LLM to always call a particular function.
{"name":"record_issue_score"}
The Prompt
The prompt is always the same except for the list of issues (4 scored, one with null score), which is encoded as JSON at the bottom of the prompt.
Below is a list of issue / ticket details in JSON format.
Each issue, except one, includes a complexity score, which helps our software delivery team to estimate the amount of effort required to complete the issue.
Acting as a software engineer, you are to propose a complexity score for the issue whose score is null. Some guidelines on scoring:
- Scores follow a Fibonacci sequence, indicating increasing complexity as the score increases
- Complexity might be increased by: novel problems, significant uncertainty, new vendor integrations, complex business rules, coordination between subsystems like front-end, back-end, and databases
- A straight-forward copy change might be a 1, if the correct copy is known and the mechanisms for adding it are in place
- A score of 13 would be reserved only for very large or complex issues that ideally should be broken into multiple issues
Examine the scored issues to learn the complexity scale, then call the record_issue_score() function to record the complexity score of the un-scored issue.
#{issue_json}
I ran several smaller rounds with different prompts before landing on this one, but it could probably be optimized further.
Putting It All Together
To get the model’s estimate for a particular story, we use OpenAI’s completions API:
{:ok, %{choices: [result]}} =
OpenAI.chat_completion(
model: "gpt-3.5-turbo-0613",
stream: false,
messages: [%{role: "user", content: prompt(issue_json)}]
function_call: %{"name":"record_issue_score"},
functions: [
%{
name: "record_issue_score",
parameters: %{
properties: %{
issue_id: %{ ... },
score: %{ ...}
},
required: ["issue_id", "score"],
...
},
...
}
]
)
# response includes a `function_call` object with a JSON-encoded arguments key
args = get_in(result, ["message", "function_call", "arguments"])
{:ok, parsed_args} = Jason.decode(args)
score = Map.get(parsed_args, "score")
# ... store the score somewhere
Results
Score Frequencies
Here are the frequencies of the scores in my 100-issue sample:
Some observations just from the distribution of scores:
- The actual scores look like a normal-ish distribution centered around 4
- Scores of 8 and 13 were more common from the LLMs than in real life, with GPT 3.5 skewing more than 4
- I cheated and omitted 13 from the randomly-chosen scores, since I knew they were very very rare in the real dataset. The prompt also warns the LLMs that a score of 13 would be reserved for an issue we’d prefer to break up, so it’s almost fair.
Error
I identified a few ways to measure the error for each sample:
- Raw difference: The difference between the scores, calculated
model_score - actual_score
- Step difference: the number of “steps” away from the actual score, calculated
index(model_score) - index(actual_score)
whereindex()
gives the score’s index in the list of scores[1, 2, 3, 5, 8, 13]
- Absolute difference: the distance away from the actual score, calculated
abs(model_score - actual_score)
- Absolute step difference: the number of “steps” away from the actual score, calculated
abs(index(model_score) - index(actual_score))
Frequency of ‘step difference’ amounts
We can make some intuitive observations based on the frequencies of each step error value:
- GPT-4 agreed with the team 44% of the time and was within a step of the team’s score about 79% of the time
- GPT-3.5 agreed with the team 22% of the time and was within a step of the team’s score about 60% of the time
- Randomly chosen scores agreed with the team 21% of the time and came within a step in 59% of the rounds
Average errors across the sample set
random | GPT-3.5 | GPT-4 | |
---|---|---|---|
Raw difference | -0.19 | 2.3 | 0.33 |
Step difference | -0.30 | 0.77 | -0.03 |
Absolute difference | 2.31 | 3.18 | 1.69 |
Absolute step difference | 1.34 | 1.27 | 0.77 |
In the end, GPT-3.5 looks like it’s randomly choosing scores, with a bias toward the higher end of the scale. In some of the early testing, it was actually significantly worse than random.
I think GPT-4 did OK: better than random but definitely not a substitute for human scoring. It’s not surprising, though, given the very limited context: With just 5 random stories and zero knowledge about the code or the rest of the application, I might be hard–pressed to do better. Interestingly, in that position I would absolutely err on the side of higher scores, as both models did 🙂
Other Thoughts
- In one of my test rounds, GPT-4 submitted a score of 4 for one of the stories, but otherwise every response fit the prescribed format. Functions called by LLMs definitely need to be serious about input validation and error handling, of course.
- I would have liked to experiment more with different prompts, but it was getting expensive and time-consuming for an experiment I wasn’t planning to operationalize. I can say the results of my best prompt were a lot better than my worst, and making sure each sample included a range of scores also helped.
- Ultimately, I wanted to try out the idea of using the functions API to generate structured responses, and that part feels like a success. I’ll definitely keep this in my toolbelt as a way to get machine-readable output from LLMs in the future.
What’s Next?
There are two different avenues I’m interested in pursuing further. The first is just a general question of whether fine-tuning would result in much better results. It seems at least plausible that the few-shot approach we used here just didn’t provide enough data and context for the model to understand how the complexity as measured by story points maps to the words in the issue.
The other thing I’d like to look into (perhaps to facilitate doing the above affordably) is to look into the local-llm-function-calling library to get function calling behavior from some of the open source models.
Want to talk AI? Or work together to address a challenge your business is facing? Let’s connect. We love this stuff.
We're building an AI-powered Product Operations Cloud, leveraging AI in almost every aspect of the software delivery lifecycle. Want to test drive it with us? Join the ProdOps party at ProdOps.ai.