A-BBL: A Risk Prediction Model for Patient Readmission based on Electronic Medical Records

: With the spread of medical digitization, electronic health record data has been accumulated in large quantities, laying the foundation for intelligent medical changes. ICU data is mined and analyzed to identify the risk of patient readmission in a timely manner, prevent and control the deterioration of patients' conditions, and reduce the burden of patient costs. However, due to the poor quality of medical data, potential information cannot be effectively mined. In view of the above problems, a patient readmission risk prediction model A-BBL is proposed. By extracting and analyzing the patient 's discharge summary information, the readmission risk of discharged patients within 30 days is predicted. The A-BBL model consists of three parts: firstly, BioBert is used to pre-train the medical text data, extract the semantic information of the medical text, and then generate the corresponding word vector. Then, the sequence model BiLSTM is used to capture the context information and model the input sequence. Finally, the self-attention mechanism is used to extract the key information in the input sequence, enhance the vector representation ability of the sequence, thereby improving the performance and accuracy of the model, so as to predict the readmission rate of patients. Based on the MIMIC-III real medical data set, the A-BBL model for patient readmission prediction proposed in this paper is verified. Compared with the baseline model, the accuracy is improved by 7.2 %. This study can help medical staff better understand and pay attention to the progression of critically ill patients, im-prove the survival rate of patients, and reduce the readmission rate of patients.


Introduction
The digital transformation of the medical field has led to the rapid growth of medical health information. Electronic health records (EHR) and electronic medical records (EMR) are important components of medical electronic data, including a large number of diagnosis and treatment information of patients, such as electrocardiogram (ECG) waveform, medical text, laboratory test results, treatment, drugs, diagnosis and population information. This heterogeneous information has become valuable re-sources for medical staff to assist clinical decision-making. Electronic health record plays an important role in improving the medical system and improving the medical and health conditions of residents. However, due to the large amount of electronic medical data, complex structure and scattered information, it is difficult for doctors to dig out more useful information.
In recent years, electronic medical data has become a new focus of academic at-tention and research, and more and more scholars have explored it in depth. Mining and analyzing medical data can help medical staff to understand the health status and condition of patients more comprehensively, discover the law and trend of disease development in time, and provide patients with the best diagnosis, treatment and prevention programs, so as to improve the quality and effect of medical services. In addition, by evaluating the therapeutic effect and predicting the survival time of pa-tients, it can reduce medical accidents and misdiagnosis rates, rationally allocate med-ical resources, and improve the efficiency and quality of medical services. These appli-cations are of great significance in clinical practice. The application of deep learning technology in the medical field is of great significance. It can make healthcare more intelligent, refined and personalized, and provide a huge boost to the development and progress of healthcare.
In the medical field, predicting and preventing patients' readmission is an im-portant issue. The readmission of patients has certain harmfulness, which is mainly manifested in the following aspects: (1) Increasing medical costs: readmission means that re-diagnosis and treatment are required, and these processes require medical re-sources and time to increase medical costs; (2) Increased treatment time: Readmission of patients usually requires longer treatment and recovery, which may affect the patient 's life and work; (3) Increased risk of complications: readmission may cause more serious health problems, such as nosocomial infection, thrombosis, etc. These complications may have a longer-term impact on the health of patients; (4) Psychological impact: readmission may have a negative impact on the patient 's psychology, such as doubts about the treatment effect, distrust of medical institutions, etc., thereby affecting the patient 's enthusiasm and confidence in treatment [5][6]. Therefore, people pay more and more attention to the prediction and prevention of readmission in order to improve the efficiency and quality of patient care, and determine it as one of the goals of medical quality improvement. In summary, this paper selects a patient readmission risk prediction task for study based on deep learning techniques. Based on publicly available electronic health records, we use the textual information of patients' pre-discharge medical summaries to study a deep learning patient readmission risk model, which can further improve the accuracy of patient readmission prediction.

Related Work
Readmission risk prediction is an important medical problem, which can be pre-dicted by a variety of methods, including statistical learning, machine learning and deep learning.

Statistical methods
Regression analysis and survival analysis are commonly used statistical methods. Regression analysis mainly uses the patient 's basic information (such as age, gender, etc.), clinical indicators (such as vital signs, laboratory examination indicators, etc.) and treatment methods and other variables to establish a linear regression model or Logistic regression model to predict the risk of readmission. Blecker et al. used a regression model to study the trend of hospitalization readmission rate of heart failure patients covered by Medicare insurance in the United States. Survival analysis mainly aims to establish a survival model for patients who have readmission within a certain period of time. The Kaplan-Meier curve can intuitively represent the survival probability of patients, while the Cox proportional hazard model can consider the interaction between multiple variables, so as to more comprehensively assess the risk of readmis-sion. However, statistical-based patient readmission risk prediction usually requires feature selection and variable screening to ensure the accuracy and interpretability of the prediction model. At the same time, because the statistical method relies on the assumed data distribution and model assumptions, and has high requirements on data quality, it is necessary to take certain measures to preprocess and clean the data to eliminate the influence of data noise and missing values on the prediction results.

Machine learning methods
In the field of machine learning, methods such as decision tree and support vector machine are widely used in readmission risk prediction. Kerexeta et al. used a va-riety of machine learning algorithms to construct a prediction model for the 30-day readmission rate of heart failure patients. The comparison showed that the random forest algorithm had the highest prediction accuracy. Zheng et al. proposed a patient readmission rate prediction model based on ant colony algorithm meta-heuristic algorithm and data mining technology. Mortazavi et al. analyzed the application of different machine learning techniques in the prediction of readmission in patients with heart failure. A series of classification algorithms are used and their performance is compared. The experimental results show that the random forest algorithm performs best in predicting 30-day readmission of patients with heart failure, with high accuracy and sensitivity. Based on the time series information generated by patients during hospitalization, the researchers explored the application of recurrent neural networks and their variants. These models can capture the time dependence of disease devel-opment and learn the potential links between different diseases in the patient 's history. Reddy et al. used recurrent neural network and long short-term memory network (RNN-LSTM) to predict the readmission of patients with systemic lupus erythematosus. Use the patient 's previous hospitalization records to learn the potential links between the diseases and add this information to the model. The contribution of this study is to use time-serialized information to more accurately predict the risk of hospital read-mission. Lin et al. used long short-term memory (LSTM) networks and convolu-tional neural network (CNN) models to analyze and predict unplanned intensive care unit (ICU) readmissions. Using time series data from medical records, important fea-tures are extracted and modeled using LSTM to predict whether patients will be ad-mitted to ICU again.

Research methods and models
With the continuous advancement of Natural Language Processing (NLP) tech-nology, more researchers have begun to use the text data in patients ' electronic health records to predict the risk of readmission. Compared with the traditional risk prediction model based on structured data, the model based on NLP technology can obtain the patient 's health status and medical history information more comprehensively and accurately, thereby improving the prediction accuracy of readmission risk. NLP tech-nology can automatically identify and extract information such as entities, relationships, and events in medical texts. At the same time, it can also analyze unstructured infor-mation such as patient 's emotional state and language features, providing a richer data source for risk prediction. Craig et al. proposed a patient readmission prediction model based on a one-dimensional convolutional neural network structure. The model converts words into vector representations through word2vec pre-training technology, extracts features by convolutional layers, retains the most important features by maximum pooling layers, and generates prediction results by fully connected output layers. Features can be automatically extracted from doctor notes to predict patient re-admission risk. Although the prediction effect of this model is not as good as some published models, it emphasizes the importance of unstructured text in medical records in predicting patients ' readmission risk, and conducts in-depth research on text feature learning and proposes new clinical insights. Based on the BERT model, Huang et al. proposed a ClinicalBERT model for medical text training. Deep representation learning of clinical tests can automatically learn the representation of words, phrases and sentences, as well as the semantic relationship between them, so as to reveal the clinical information hidden in the deep text.
At present, these methods have begun to pay attention to the importance of pa-tients' medical text data, but the application of data feature extraction is not sufficient. This paper proposes a A-BBL model for predicting patients' readmission risk, and ex-tracts more comprehensive text information to judge patients' readmission risk.

Research methods and models
This study explored the feasibility and effectiveness of patient readmission risk prediction based on BioBERT-BiLSTM-Attention model by summarizing medical texts 48 hours before discharge. In the study of text classification prediction, the BioBERT model is an improved version of the BERT model. It is pre-trained on a large-scale biomedical text corpus, making it perform well in text processing tasks in the biomedical field. The BiBLST-Attention model can use the BiLSTM model to model the input sequence, so that it can better capture the context information when dealing with long text. At the same time, the Attention mechanism can help the model focus on learning key information. It has good robustness when dealing with text data with certain noise, and can effectively deal with some outliers and mislabeled data. The A-BBL model is shown in Figure 1.

BioBERT medical text pre-training
The discharge summary text data in health records contains a large number of medical terms. The general BERT model cannot understand a large number of profes-sional terms, abbreviations, and synonyms in medical text data. BioBERT is a do-main-specific BERT model based on biomedical corpus. It is pre-trained on large-scale biomedical domain databases such as PubMed and PMC, and retains the structure and parameters of the original BERT model. The pre-trained model diagram is shown in Figure 2. BERT (Bidirectional Encoder Representations from Transformers) is a Transformer-based pre-training model published by Google in 2018. It is an unsupervised pre-training model that trains on a large-scale corpus and can learn rich language knowledge, common natural language representations, including vocabulary, grammar, syntax and semantics. The pre-training process of BERT includes two stages: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In the MLM phase, BERT randomly masks some words in the input text, and then predicts the masked words by the context in-formation of the remaining words through the Transformer encoder. In the NSP stage, BERT combines two input sentences into a training instance, and then uses the Trans-former encoder to predict whether the two sentences are adjacent in the original text.
The input representation of the BERT model consists of three parts. The word vector (Token Embeddings) is the first part of the input layer of the BERT model, which maps each word to a fixed-length vector representation, that is, the word vector. Seg-ment embeddings are used to distinguish different sentences or paragraphs in the input sequence; positional Embeddings are used to specify the position of each word in the input sequence. The combination of these three vectors represents the BERT model 's encoding of the input text, as shown in Figure 3. This input representation enables BERT to effectively handle the relationship between different lengths and different text seg-ments.

BiLSTM learns text context features
In the electronic health record summary text processing, BiLSTM is used as a text summarizer, which is responsible for capturing the contextual semantic information of the input text sequence, extracting the deeper features of the input text vector, and avoiding the influence of the latter words in the RNN. Specifically, BiLSTM is a deep recurrent neural network that can consider both the context information before and after the current word. By learning the context information, the input text is modeled and the corresponding semantic representation vector is generated. BiLSTM is composed of two one-way, opposite-direction LSTMs, with multiple shared weights, and ultimately connected to the same layer of output, which has the ability to remember past and future information, as shown in Figure 4. Compared with the traditional one-way LSTM, BiLSTM can capture the depend-encies between words more comprehensively and further improve the expression abil-ity and accuracy of the model.LSTM improves the hidden layer on the basis of RNN, and adds three gates to it, which are forget gate, input gate, output gate, and a new hidden state (cell state). The LSTM model diagram is shown in Figure 5.  (2) and (3).  ( ) Similarly, the calculation process of backward LSTM is similar to that of forward LSTM, but the input sequence is calculated in reverse order, and the calculation formula is no longer introduced in detail. The final BiLSTM calculation formula is shown in Formula (8).

Attention mechanism
The original intention of the attention mechanism is the application of biological attention in artificial intelligence. It is a technology for weighted aggregation. It can give different weights according to different parts of the input. These weights reflect the contribution of each element in the input sequence to the output. When calculating the attention weight, we represent each element in the input sequence as a vector and the target representation (or query vector) as another vector. By calculating the similarity between the target representation and each element vector in the input sequence, we can get a weight vector. By multiplying and adding the weight vector to each element vector in the input sequence, a weighted sum vector is obtained, which represents the most relevant element in the input sequence to the target representation. The Attention mechanism structure diagram is shown in Figure 6. In the attention mechanism, the elements in the input sequence can be regarded as a set of data pairs, where Key is used to calculate the attention weight, and Value is the value to be weighted and summed. Query represents the content that needs attention, and the attention weight coefficient corresponding to each Key is obtained by calculating the similarity or correlation between Query and each Key. These weight coefficients are used to weight Value and get the final Attention value. Therefore, the essence of the attention mechanism is to weight and sum the Value values of the elements in the input sequence, as shown in Formula (9).
The most commonly used attention mechanism is selfattention, in which the input sequence itself is used as a query, key, and value. The weight coefficient of the corresponding Value is obtained by calculating the similarity between the query vector Query and Key, and then the weight vector is multiplied by the value vector to obtain the attention vector, and then the output vector of each time step is calculated.
The specific calculation process of the Attention mechanism, according to the similarity score between Query and Key, first obtains the query vector

Experimental data
The experimental data are based on the 1.4 version of the MIMIC-III multi-parameter intelligent monitoring database for intensive care published by the computational physiology laboratory of the Massachusetts Institute of Technology. MIMIC-III is a very important publicly available medical information database, which is jointly maintained by MIT Computer Science and Artificial Intelligence Laboratory (MIT CSAIL) and Massachusetts General Hospital. The database includes electronic health records (EHRs) of nearly 50,000 patients who were hospitalized in the ICU ward of Beth Israel Moral Education Medical Center in Boston, Massachusetts from 2001 to 2012.
In this experiment, a subset of MIMIC-III V1.4 was used, and six tables were used, namely PATIENTS, ADMISSIONS, ICUSTAYS, DIAGNOSES _ ICD, D _ ICD _ DIAGNOSES and NOTEEVENTS. The information is shown in table 1.
Because this experiment is based on the medical text (patient discharge summary) in the electronic health record, these data are manually recorded by the medical staff, there may be spelling errors, missing words, different formats, writing irregularities and medical terminology abbreviations and other quality problems. Before the model training, the medical text needs to be preprocessed. The specific operations are as follows: (1) Remove non-text content in medical data, such as illegal characters and labels. This experiment uses Python 's regular expression (re) to complete the filtering work, and also establishes an illegal character vocabulary to filter out some punctuation marks and special non-English characters.
(2)The NLTK (Natural Language Toolkit) natural language processing library is used to segment the medical text.
(3) Spelling check correction. There may be spelling errors in medical texts. Use Python 's third-party library pyenchant to complete the spelling check function.
(4) Stem extraction and morphological restoration. The form of English words is changeable, such as single and plural nouns, verb tenses and so on. It needs to be restored to the basic form, and the WordNetLemmatizer class based on wordnet dictionary in NLTK is used to restore the form.
(5) Convert to lowercase. Due to the case problem in English, all words are converted to lowercase by using python 's API, so that statistics like 'Heart ' and 'heart ' are one word.
(6) Introduce stop words. Stop words are words with high frequency but no actual meaning in English text, such as ' a ', ' to ' and some short words. These words do not contain information about the theme of medical text. Filter them out using the list of stop words provided by the NLTK package.
Through the above steps, the noise in the medical text can be reduced and the data can be more clean and standardized. In addition, the key information in the data is preliminarily extracted to reduce unnecessary calculation and storage, thereby reducing the calculation cost and processing time. Finally, it helps to make the data more readable and interpretable, facilitate the understanding of the working principle of the data and algorithm, and improve the accuracy of the model.

Evaluation index
The patient readmission risk prediction task in this paper is essentially a binary classification task. The goal is to predict whether patients have the risk of readmission for treatment within 30 days of discharge, and to intervene in patients with higher risks in advance, thereby reducing the risk of readmission. Therefore, the evaluation model of the four indicators of Accuracy, Precision, Recall and F1-Measure commonly used in the classification task selected in this paper.
Accuracy: The number of correctly predicted samples divided by the total number of samples. This indicator performs better for data sets with uniform category distribution, but is susceptible to smaller categories for unbalanced data sets. The calculation is shown in Formula (14).
TP TN Accuracy TP FN FP TN Precision: The proportion of the number of positive samples correctly predicted by the classifier to the number of positive samples predicted by the classifier. This indicator focuses on the accuracy of the classifier 's prediction of positive samples, that is, to avoid incorrectly predicting negative samples as positive samples. The calculation is shown in Formula (15).
Among them, the number of samples whose TP is actually a positive sample is predicted by the classifier as the number of positive samples; the number of samples whose FP is actually negative samples is predicted by the classifier as positive samples. The number of samples whose FN is actually a positive sample is predicted by the classifier as a negative sample; the samples whose TN is actually negative are predicted by the classifier as the number of negative samples.

Comparison method
The performance of the A-BBL model in the patient readmission risk prediction task was evaluated by comparing with other baseline model methods. The experimental comparison is mainly divided into two aspects: (1) Verify the effectiveness of the feature representation method based on BioBERT; (2) Verify the effectiveness of the BiLSTM-Attention classification model. The comparative models include: 1.BioBERT: The BioBERT model is pre-trained to obtain the text feature representation of the patient 's discharge summary, and then directly input into the Softmax classifier through a fully connected layer.
2.Word2Vec-BiLSTM: Word2Vec is used to train the word vector representation of the patient 's discharge summary text and input it as a feature into BLSTM for classification.
3.BioBERT-RNN: The BioBERT model is pre-trained to obtain the text feature representation of the patient 's discharge summary, which is input into the RNN to complete the feature training and classification.
4.BioBERT-CNN: The BioBERT model is pre-trained to obtain the text feature representation of the patient 's discharge summary, which is input into CNN to complete the feature training and classification.
5.BioBERT-BiLSTM: The BioBERT model is pre-trained to obtain the text feature representation of the patient 's discharge summary, which is input into BiLSTM to complete feature training and classification.

Results analysis
The patient discharge summary text extracted from the MIMIC-III dataset is used to verify the patient readmission risk prediction model A-BBL proposed in this paper. The experimental results compared with other model methods are shown in Table 2. It can be seen from Table 2 that in the task of predicting the risk of readmission within 30 days after discharge, the accuracy of A-BBL model reached 83.5 %, and Word2Vec-BiLSTM performed the worst, with an accuracy of 72.5 %. The A-BBL model has the highest F1 value of 83.6 %, and the F1 value of Word2Vec-BiLSTM is the worst of 72.1 %. The BioBERT-BiLSTM model was compared with Word2Vec-BiLSTM to verify the effectiveness of BioBERT pre-training. Compared with BioBERT-CNN and BioBERT-RNN, the feature representation uses BioBERT model for medical text pre-training to ensure a single amount, which proves the advantages of BiLSTM model in learning text context semantics. Finally, the comparison between the A-BBL model proposed in this paper and the BioBERT-BiLSTM model proves that the attention mechanism can extract important features and has the best performance on the patient readmission risk prediction data set.
The test set is used to verify the model, and the accuracy of A-BBL and other comparison models is obtained. The results are shown in Figure 7.
It can be seen from Figure 7 that the A-BBL model has the highest accuracy, with an accuracy of 82.6 %, followed by BioBERT-BiLSTM and BioBERT-RNN, and the worst performance is Word2Vec-BiLSTM, with an accuracy of 79.5 %. The text representation method based on Word2Vec has the lowest indicators, mainly because Word2Vec can well express the semantic relationship between words, but ignores the long-distance semantic association information. Overall, the patient readmission risk prediction model A-BBL proposed in this paper has certain advantages compared with other comparative models.

Conclusion
This paper studies the patient 's readmission risk prediction task, combined with the conventional judgment index of medical research, defines the problem as judging whether the discharged patient will be readmitted for treatment within 30 days. Based on the discharge summary text in the MIMIC-III data set of electronic health records, a patient readmission risk prediction A-BBL (BioBERT-BiLSTM-Attention) model was constructed to learn the commonality between patients who were readmitted for treatment, and to predict the readmission risk of discharged patients. A large number of experiments show that the A-BBL model proposed in this paper has higher recall rate, accuracy rate and F1 value in the prediction task of ICU patients ' readmission, which is significantly better than other models. However, due to the limitations of the data, the prediction model did not achieve the expected results. In the future, more data sets will be found to verify the model, and new patient readmission risk prediction models will be studied to obtain better prediction results and provide better diagnosis and treatment services for patients.