A Question Answering System for Situation Puzzle with SPQA

: There are many questions answering (QA) system built for solving QA tasks. In 2020 and 2022, Allen Institute and the University of Washington proposed UnifiedQA and UnifiedQA-v2. Their core concept is that the semantic understanding and reasoning capabilities required by models are common, and may not require format specific models although the QA task forms are different. Behind this concept, I build a new QA model named SPQA, aiming to answer the situation puzzle questions by adding new situation-puzzle related dataset (SpQ). In addition, I evaluate the performance of SPQA and UnifiedQA-v2 for fine-tuning and prompt-tuning. The results of fine-tuning indicate that SpQ dataset is important for fine-tuning and prompt-tuning to answer situation puzzle questions well, but also make the answering ability of normal yes/no questions worse. Eventually, the results of prompt-tuning indicate that the effects of SpQ is larger and more significant on situation puzzle questions and normal yes/no questions under the same data scale. In the future work, the further research like building larger SpQ dataset should be considered.


Introduction
Situation puzzles (also called "lateral thinking puzzles" or "yes/no puzzles") are usually played by a group of players. The players asking questions which can only be answered with "yes" or "no" to the person who is hosting the game. Depending on the settings and difficulty of the puzzle, some information can be added in the answers, such as hints, simple explanations about why the answer is that, or be informed by "not related". The puzzle is informed by "solved" when one of the players can state the same process or truth as the host's thought [1].
In 2018, BERT was proposed by Google. Its "bidirectional encoder representation from transformers" was awarded the Best Long Paper Award at the 2019 North American Branch of the Association for Computational Linguistics (NAACL), and its performance on 11 NLP tasks has set a new record [7].
In the same year, the GPT model proposed by OpenAI can be migrated to NLP [8].
In 2019, BART method combined BERT and GPT model [10]. Its "Bidirectional and Auto-Regressive Transformers" built a pre training language model by Transformer model with encoder-decoder structure [9].
In the same year, with the introduction of a large-scale pre training model, almost all NLP tasks became "pre-train to fine-tune" mode. Instead of modifying the pre training model itself, people generally introduced few additional parameters (network layer) to complete downstream tasks by setting various objective functions. At this point, the focus of work has shifted to objective function engineering [10].
In 2020, Google released the T5 model. Its most important role is to provide a common framework for the entire NLP pre training model field, transforming all tasks into one form. After that, the main task became how to convert tasks into appropriate text-input and text-output [11].
However, as pre-trained language models (PLMs) become larger and larger, the requirements of hardware, data, and actual costs are also increasing. What's more, the design of the pre-training and fine-tuning stages become complex as the result of the large and diverse downstream tasks. In order to explore smaller, more lightweight, and more universal and efficient methods, researchers attempt to use "Prompt" method. In 2021, "pre train, prompt, and predict" was introduced and the original "pre-train to fine-tune" mode has gradually been replaced by this mode. People no longer use customized objective function engineering to adapt pre training models to downstream tasks. Instead, various downstream tasks are redefined under a short text prompt to resemble as much as possible the problem forms that PLMs solves during training [12].

Related Works
QA tasks is a kind of downstream tasks in NLP. Although the QA task forms are different, the semantic understanding and reasoning capabilities required by models are common, and may not require format specific models. Based on this concept, Allen Institute and the University of Washington proposed the first pre training question answering model, UnifiedQA, on EMNLP in November 2020, which can handle multiple forms of questions and answers, becoming a new SOTA for multiple question answering tasks. All NLP tasks can be converted to seq2seq tasks. Based on the same idea, UnifiedQA is a text-to-text pre training question answering model. The encoder receives questions spliced with "\n", and the decoder generates answers [13]. In 2022, the original team only added more pre training datasets to the original UnifiedQA for pre training, which further improved the performance of the model on both the "seen" dataset and the "unseen" dataset to UnifiedQA-V2 [14].
At the same time, GPT-2 had 1.5 billion parameters in 2019 [15]. In 2020, GPT-3 already had an astonishing 175 billion parameters [16]. In 2022, InstrumentGPT and ChatGPT [17]. On March 14, 2023, GPT-4 was released [18]. This paper focuses on constructing a question answering system based on SPQA for situation puzzles. Regarding UNFIEDQA-V2 model as original model and adding situation puzzles training set, I constructed a situation-puzzle QA model (SPQA) and then a SPQA prompt tuning model. Eventually, I compared the appearance of UnifiedQA, SPQA, SPQA-prompt and ChatGPT (GPT-3.5) on solving situationpuzzle problem.

Methodology
In this paper, I used two methods: UnifiedQA (v1 and v2) multi-format training and fine tuning, and parameter-efficient prompt tuning aim to evaluate the performance of adding spQ datasets.

Multi-format Training in UnifiedQA
Firstly, I want to train a SPQA model that can operate over formats , , … , , like the structure of UnifiedQA model. For each format , there is ℓ datasets set: , , … , ℓ , where , , which includes training set and evaluation set . If the dataset is considered to be used only for evaluation, I will ignore the aim to treat as an "unseen" dataset. In SPQA model, the "unseen" dataset only includes "yes/no questions" format datasets.
In pre-processing progress, I also transfer each training question in format into a plain-text input representation , which is the same as that of UnifiedQA training datasets. I use the UnifiedQA approach of creating a mixed training pool including all available training examples:

Adapter Tuning and Parameter-Efficient Prompt Tuning
Adapter tuning is related to multi-task and continual learning but also differ because the tasks don't interact and the shared parameters are fixed, which indicates that the model can remember previous tasks perfectly by using few task-specific parameters. Parameter-Efficient Prompt Tuning (also called soft-prompt tuning) proposed the use of adapter modules to transfer, thereby creating a compact and extensible model. Only a few trainable parameters added in each task, and new tasks can be added without the need to revisit previous ones [19].

Experiment
According to the paper of UnifiedQA, its concept is suitable for text-to-text encoding and therefore used T5 and BART to reach this multi-task target. UnifiedQA eventually used T5-11B and BART-large as the starting point to pretrain. For the further research, UnifiedQA-v2 is trained on 20 datasets while UnifiedQA is trained on 8 datasets. In addition, UnifiedQA-v2 is trained for 350k steps and UnifiedQA is trained for 100k steps [13][14].
In this paper, I also use T5 as the starting point to pretrain. Firstly, I collect, generate and pre-process some situation puzzle data and put them into the situation puzzle dataset (SpQ), and then train 20 UnifiedQA-v2 datasets + spQ dataset in the model. In the end, I fine-tune them by TPU and prompt-tune them by GPU and discuss the differences between the SPQA and UnifiedQA-v2. In this paper, I added some datasets about situation puzzle. Take an example, the contents of the story are: "My pants are torn, I know I'm going to die soon. Because I am an astronaut. One day, I was carrying out a mission in space wearing a spacesuit when I suddenly noticed that my pants were torn. Afterwards, I was exposed to space without air pressure and oxygen, and in less than a few seconds, I would die." The questions and answers show in the table 1.

Details on the Experiments
Some details on the experiments shows as follows: (1) Models: T5(3B-TPU) and T5(GPU).
(3) Input/output size: Use token-limits of size 512 and 100 for inputs and outputs.
(4) # of iterations for pretraining on the seed datasets: All models are trained for 102k on the seed datasets.
(8) Fine tuning on datasets: Fine-tuned for 102k steps and checkpoints were saved per 20k steps.

Evaluation and Results
In this paper, I compared the SPQA with the UnifiedQA-V2 and evaluate a fixed checkpoint across the target datasets: BoolQ, BoolQ-np, BoolQ-CS (unseen), SpQ and SpQ-test (unseen): both checkpoint 100k for SPQA and UnifiedQA-V2. In addition, I discuss the prompt tuning between the SPQA and UnifiedQA-v2. Eventually, I observe the characteristics of the answers given by the SPQA, UnifiedQA-v2 and chatGPT (GPT3.5).
Evaluation datasets shows as follows:

Metrics
I evaluate each dataset via their common metric by the accuracy. For Yes/No questions, if the model gives the correct answer ("yes" or "yes, it is right."), it gains one score.
Otherwise, it gains no score. In addition, I also provide "aggregate scores" that compare the two models. For "aggregate scores", it gives two metrics: 1.the difference between the average performance score of SPQA and UnifiedQA-v2 models of the same size (indicated with 'SP -Uni2'); 2.the percentage that SPQA causes a better performance than UnifiedQA-v2 of the same size (indicated with 'SP Uni2?') [14].

Evaluation
I evaluate each bool QA dataset via their common metric.
Similar tendency shows with 'SP Uni2?' metric (percentage that SPQA outperforms UNIFIEDQA-v2). On this metric, all the numbers are only above 10%, which demonstrates that SPQA models generally don't causes better performance on all yes/no QA datasets. However, on SpQtest(unseen), particularly, the numbers are always 100%, which demonstrates that SPQA models causes better performance on all situation puzzle QA datasets. In addition, SPQA of size'base' outperforms UNIFIEDQA-v2 of the same size on both 100% of the datasets, for in-domain and outdomain datasets.
According to the results of fine-tuning part, UNIFIEDQA-v2 always keeps the performance well, especially on normal yes/no questions for small and 3B. On the other side, the SPQA always keeps better performance than UNIFIEDQA-v2, especially on situation puzzle questions from small to 3B. When the model scales are 'base' and 'large', SPQA is faster than UNIFIEDQA-v2 to get the receptable results. When the model scale reaches to 3B, UNIFIEDQA-v2 begin to surpassed SPQA on all the yes/no questions, which means the situation puzzle training dataset may affect the system to judge the normal questions.
Summarizing the GPU prompt-tuning results from Table 4  and Table 5. In all experiments, SPQA-prompt causes 1.88% performance improvements over UNIFIEDQA-v2-prompt, on average ('SP -Uni2'). The highest gains appear on midsized 'base' models (11.36% for overall, 4.17% for in-domain and 20.0% for out-of-domain). On the contrary, the lowest gains appear on the extreme size ('small').
Similar tendency shows with 'SP Uni2?' metric (percentage that SPQA-prompt outperforms UNIFIEDQA-v2-prompt). On this metric, all the numbers are only above 30%, which demonstrates that SPQA models generally don't causes better performance on all yes/no QA datasets. However, on SpQ-test(unseen), particularly, the numbers are always 100%, which demonstrates that SPQA models causes better performance on all situation puzzle QA datasets. In addition, SPQA of size'base' outperforms UNIFIEDQA-v2prompt of the same size on 66.7% and 100% of the datasets, for in-domain and out-domain datasets, respectively.
According to the results of prompt-tuning part, all datasets trained under the same data scale. UNIFIEDQA-v2 and SPQA keep the similar performance for 'small' and 'base'. For 'large' model, the situation puzzle dataset affects more significantly, make the performance of SpQ test is better and make the performance of other normal yes/no datasets worse.

SPQA vs UnifiedQA vs chatGPT (GPT-3.5)
According to the Table 6, I take some examples as the reference. When the contents of a situation puzzle are about the crime or the negative information, the chatGPT will reject the requests because of the legal reasons even though this topic is just a story. When the contents of a situation puzzle are about the other aspects, the chatGPT may sometimes give the answers such as "The information given is not clear", "can't answer and need more information" or "Not mentioned in the story", means that chatGPT need more information about the stories clearly mentioned in these questions. In these situation-puzzle questions, chatGPT has a serious attitude and will refuse to answer questions without clear answers, like serious humans. However, for this point, the rules need the model to answer only by "yes" or "no", so the answers sometimes given by chatGPT is unacceptable. chatGPT Refuse to answer "The information given is not clear" "Not mentioned in the story" If it answered, the answers are right.

UnifiedQA-v2
(Small) still have wrong answers neither "yes" nor "no" (Small) still have wrong answers neither "yes" nor "no" SPQA Judge the task without mistakes, but the correct rate is properly the same as UnifiedQA-v2.
On situation puzzle questions, the performance is faster to reach the target.
Judge the task without mistakes, but the correct rate is properly the same as UnifiedQA-v2.
On situation puzzle questions, the performance is faster to reach the target.

Conclusion
In this paper, I used the T5 architecture, like UNIFIEDQA's structure, trained and fine-tuned a new model named SPQA to answer situation puzzle questions to reach the requests by adding a new dataset SpQ on TPU-v2.8. In addition, I used less datasets on GPU to train and prompt-tuned a new model titled SPQA-prompt to answer situation puzzle questions under the same dataset scale. According to the performance as above, the conclusion is that situation puzzle datasets (spQ) is important for fine-tuning and prompt-tuning to answer situation puzzle questions, but also make the answering ability of normal yes/no questions worse. For fine-tuning, the performance of SPQA for '3B' is good enough at the current stage. However, because of TPU and GPU limit and the data scales of SpQ dataset, the effects of larger SpQ dataset, data scale of '11B' and prompt-tuning with larger datasets cannot considered in this paper. In addition, the performance and gains are not quite uniform for all datasets. I will try to build and train the model only with large SpQ dataset and other further research in the future work.