Does ChatGPT Have Better Judgement than a U.S. Diplomat?

7/9/2023

In a previous post I used ChatGPT to generate a plot synopsis for a Star Trek episode that has never existed. It’s response was well organized, coherent, and a huge improvement over the previous artificial intelligence/machine learning (AI/ML) methods that I used for a similar project in 2018. Other groups have used ChatGPT to take law and business school classes where it has performed as a C+/B- level student. The newest version of ChatGPT, GPT-4, even passed the bar exam, scoring in the 90th percentile of test takers. Inspired by these case studies, below I describe how I applied ChatGPT to another human-centric decision-making exam: The U.S. Foreign Service Officer Test (FSOT) Situational Judgement Test (SJT).

U.S. Foreign Service Officers (FSO) are the diplomats who represent the United States abroad at embassies and consulates. Their mission is to “promote peace, support prosperity, and protect American citizens while advancing the interests of the U.S. abroad.” Becoming an FSO is a multistep process, but one of the initial steps is taking the FSOT which consists of one essay and three multiple choice sections: job knowledge, English expression, and situational judgement.

The SJT is the focus of today’s experiment. From the Information Guide to FSO Selection: “The [SJT] is designed to assess an individual's ability to determine the most and least appropriate actions given a series of scenarios. The questions were written to assess precepts or competencies that are related to the job of a Foreign Service Officer, including Adaptability, Decision Making and Judgment, Operational Effectiveness, Professional Standards, Team Building, and Workplace Perceptiveness. (NOTE: Knowledge about the State Department's policies, procedures, or organizational culture is NOT required to answer these questions.)”

Each question of the SJT presents descriptions of situations that an FSO might encounter on the job along with possible responses to that situation. For each scenario, test takers are to select the best and worst responses. Extensive blog posts have been written on how to approach these questions because, unlike job knowledge or English language, this section is perceived as less objective and, as the description and name imply, more based on judgement. This is what makes assessing an AI’s performance on this section the most interesting. Would AI have better judgement than an FSO?

To approach this, I used a sample of 20 SJT questions from the internet with available answers. Here is one example SJT question: There are multiple tasks to be completed by the end of the day. Three people will be arriving to complete these tasks. What should you do?
A. Make a detailed list of everything to do, and leave it for the three people to divide up among themselves.
B. Make a detailed list for each person to do, ensuring each list is tailored to each person's strengths.
C. Talk to each person and assign tasks based on their interests.
D. Ask the first person who arrives to be in charge of coordinating the work with the other two.
(answers below)

I sent this question and the other 19 to ChatGPT (version 3.5 Turbo) in a few variations. In the first and simplest variation, I only gave the AI instructions to determine the best and worst responses. In the second variation, I gave the AI the context from the Information Guide to FSO Selection shown above in addition to the instructions. In the third and final variation, I gave the same information as the second variation but used a slightly different model, version 3.5 Turbo-16K (at the time of writing GPT-4 was not available to the author via the API).

Overall, the third variation with the full context and the larger model scored the most points (70%; 28 out of a possible 40) followed by the second variation (65%; 26/40) and the first variation (57.5%; 23/40). There was a lot of variation between the models and the third variation answered some questions incorrectly that the other variations answered correctly. In one extreme example, variations two and three got both parts of the question incorrect while variation one got one part correct. Embarrassingly, for this question the second variation thought that the best answer was the worst answer. On another occasion, variations one and two got both parts correct but variation three guessed incorrect on the worst response. Finally, on the example question given above, only the third variation got the correct answer with the other two variations getting opposite parts of the question correct.

These variations are due to the non-deterministic nature of ChatGPT. Because of this randomness, it would be wise to run each variation multiple times to get average scores for each. Why didn’t I do this? While there is some financial cost associated with running these models, that was not the rate limiting step that prevented these additional tests (this whole post only cost $0.08). The real cost was time. Despite several efforts, ChatGPT was always too verbose in its response. It gave too much context to its answers making it time consuming to figure out the results of each run for the 20 questions. More trial and error with prompt optimization would eventually produce more streamlined answers (deeplearning.ai has a good online course on this topic that informed much of this work).

So, would ChatGPT be an FSO with sound judgement? We don’t know! The test scores are normalized to the average of other test takers, and we do not know what a typical passing score on this part of the test is. With a score of 70%, it is possible that the third variation could be in contention. The fact that all three variations were able to score so highly at all on multipart, multiple-choice questions on complex social scenarios is a testament to how far this technology has advanced.

The SJT portion of the FSOT is the most interesting test or ChatGPT because judgement is the skill that is the most "human" of the three test components. There are already AI tools to check spelling, commas, and other English grammar points. ChatGPT was trained on a broadcast of knowledge and would likely be able to pass the fact-based job knowledge portion of the exam. Taken together with this test, I am confident that ChatGPT could pass the FSOT.

However, the FSOT is only one part of the FSO application. What would an AI write for its personal narratives and previous job experience? How would it participate in the in-person oral exam? Furthermore, this experiment did not evaluate if ChatGPT would be effective at determining who gets a visa, representing government interests in a negotiation, or visiting a citizen in prison. It cannot fully replace a human. But just because it cannot do all aspects of the job, does not mean that it would not be useful. This experiment demonstrates that ChatGPT and similar LLMs may be able to be more supportive in diplomatic work than one might naively assume.

The zeitgeist is grappling with whose jobs are going to be taken by AI in the near and far future. Are Diplomats unreplaceable in the long run? Future iterations of ChatGPT-like AI may be able to handle all diplomatic work without human intervention, but not now. However, this work shows that AI today is capable of applying judgements to social situations, potentially making it more useful than one might first assume. The real short-term risk is not that diplomats are replaced by AI, but rather that diplomats do not embrace AI as a force multiplier and get left behind in this new chapter of our AI-enabled world.

FSOT/SJT Example Question Answers: Best (B); Worst (D)

0 Comments

Does ChatGPT Have Better Judgement than a U.S. Diplomat?

Leave a Reply.

Archives

Categories

Home

Projects

Blog