Is machine learning (ML) helping to optimize clinical trials?
According to a study conducted by researchers at MIT; machine learning, artificial intelligence and data analytics can help predict outcomes of clinical trials. This leads to a faster approval time for drugs and vaccine at a lower cost. To predict clinical trial outcomes, researchers used a big set of data. They developed machine-learning algorithms to analyses over 140 features such as trial status, accrual rates, and duration and sponsor records of accomplishment. The researchers added some statistical techniques to estimate missing values to come up with more accurate predictions (Battineni et al., 2019). This study resulted in 0.78 predictive measures for forecasting transitions from phase 2 to approval, and 0.81 for predicting transitions from stage 3 to approval. These accurate predictions can reduce the ambiguity of drug development and growth of the amount that investors are willing to offer to clinical trials. The study showed that these algorithms could play a significant role in advancing new treatments from phase 2 to regulatory approval, or from phase 3 to regulatory approval.
The algorithms can be run using in analytical modeling and simulation technology that can help and boost the pharmaceutical industries in bringing sustainable treatments to the public in a short time. The modeling and simulation technology can transform the entire procedures for drug development and save lives. It takes up to 15 years to license a drug for medication because it has to undergo the standard clinical trial phases and procedures. Based on the US Food and Drug Administration (FDA) standards and regulations out ten drugs developed during research, only one drug gets approval for medicinal use when many resources have been used in researching and carrying out trials. The integration of advanced technological processes and optimization of clinical research and trial procedures will enhance and shorten the timestamp for developing life-saving drugs at a reasonably affordable cost. The emergence and outbreak of the coronavirus type in December 2019 named COVID-19 by the World Health Organization (WHO) on February 2020 terming it as a global threat and declaring it outbreak a global health emergency. There is no specific treatment for the Virus, which has affected more than 20 million people globally and more than 4 million American citizens. A global coronavirus statistics of as of April 2020
Machine learning offers support in the process of identifying the disease by utilizing the available image and textual data. It, however, requires big data in the process of classifying and predicting the diseases patterns, it is useful in analyzing the nature of the COVID-19 across the globe. The current pandemic has attracted several researchers and scientists to help solve the problem by using X-ray image data provided by the John Hopkins University through creating models that classify the images, whether COVID-19 or not (Zhang et al., 2020). The image data is converted into metadata and is integrated with clinical reports in the form of text for easy categorizing of the disease to help to detect the type of coronavirus from early stages and symptoms.
The role of advanced analytics in the development of COVID-19 vaccine
COVID-19 is an immense case of applying machine learning and artificial intelligence to optimize clinical trials for its drug and vaccine. Researches have used these tools to optimize the whole process from tracking hospital capacity to identify the high-risk patient. The purpose of advanced analytics is to connect a variety of sources ranging from research papers, clinical trials, and drug development that influences a patented Natural Language Processing (NPL) techniques and curated classifications to extract context and broad insights and understand the spread of the disease. They believed that these technologies would help to formulate for similar circumstances in the future; however, the disease surpassed the technology and showed how technology still needs more effort and maturity to resolve the pandemic. The quality of data, methods of accessing data, and the network for sharing the data always has implications algorithm accuracy as well as determining the accuracy of an algorithm. Scientists and Experts across the technology and health research departments have been working together to find a vaccine for the COVID-19 Virus by incorporating Artificial Intelligence in their research to win the Virus.
Artificial Intelligence (AI) is being relied on, as the hope in clinical trials to make a difference and learn the behaviour of the Virus in the body. NLP (natural language processing) is a branch of AI, allows the software to read and analyze a written and spoken word. In the case of healthcare and medicine, NLP allows a computer program to search doctors’ note and pathology reports for potential participants in a clinical trial (Charles & Emrouznejad, 2019). Unstructured data is the problem in this case. The text is usually free-flowing, and information might be implicit and require some background knowledge to be understood (Shi et al., 2020). Doctors have several ways of describing the same illness; for example, diabetes might be called malignant Mellitus or another example is a heart attack can be defined as a myocardial infarction, myocardial infarct MT. An NLP program can be trained to map the symptoms and group them all under one disease. This algorithm can then be used to interpret unannotated records. Many open-source web tools work on helping researchers and administrators to search databases without any need of a technical background. These programs are made by translating into standardized, ceded query format that the database can understand a lot of work and effort is being made to make this task easier.
Methodology
The procedure includes five steps, which are 1. Data collection, 2. Definition of data refining, 3. Preprocessing overview, 4. Feature extraction mechanisms, and 5. Traditional and ensemble machine learning algorithms. The data in the proposed methodology has been represented in charts and graphs.
- Data collection
The research centres and hospitals and other health facilities have given access to data about the pandemic via open source repositories like the GitHub for data collection and analysis in this research. We obtained data of about 212 patients who show the signs and symptoms of the coronavirus and other viruses. The data had several attributes, which include patient ID, age, temperature, name, lymphocyte count, neutrophil count, leukocyte count, offset, pO2 _saturation, sex, finding, survival, intubated, date, location, view, folder, went_ICU, needed_supplemental_O2, extubated, modality, and DOI.
- Relevant datasets
In the data collection and analysis processes, several algorithms were used for defining and refining the relevant datasets extracted from clinical notes and findings for preprocessing. In the datasets, there is ARDS, SARS, COVID and both (COVID, ARDS) as shown in the graph below.
- Preprocessing
It involves procedures required to refine the data to enable machine-learning process to be done through following various steps in a phased manner. The entire stage involves deleting unnecessary texts, punctuations, symbols, stop words, and links to enhance the accuracy of data, as shown in the image below.
After the refining and defining the data, it follows an extraction of specific features according to predetermined semantics then converted into probability values by use of TF//IDF techniques. In this case, 40 feature were identified then categorized for input in the machine learning algorithms.
The machine learning categorization
The categories are created to have four distinct types of viruses of the given text, which include ARDS, SARS, COVID (person with coronavirus) and both (COVID (has corona) and ARDS). This involves different supervised ML algorithms across all categories, which include Multinomial Naïve Bayes (MNB), the support vector machine (SVM), decision making, random forest, stochastic gradient boosting and logistic regression.
The logistic regression algorithm uses the relationship between the numerical variable class and its label to make predictions and calculate the probability of class membership using the formula below.
The Multinomial Naïve Bayes uses the Bayes rule to computes the class probabilities of the provided text.
The support vector machine (SVM) it an ML supervised algorithm for grouping the text into various categories to construct a classifier. In this study, the 40 features selected during engineering with values and represented in the form of a table and used as input.
Result and conclusion
Tables
Charts
The entire system used to during the research had the following specifications; Microsoft Windows operating system, 3.88GHz processor, and 6GB RAM to carry out the entire process. The Scikit learn tool used to execute the machine learning categorization with the help of the several libraries such as STOPWORDS among others to improve the accuracy and correctness of the algorithms pipeline used. The deeper insights of the data were obtained from the statistical computations of the datasets, with 70% of the data being used in model training and the other 30% used to test data for the model (Khanday et al., 2020). The classification used ML algorithms that supplied features obtained from feature engineering step, besides, while exploring the generalization of the model from the training data to unseen data and minimize the chances of overfitting we split the original dataset into separate test and training subsets.
Each algorithm underwent the tenfold cross-validation approach five-six times independently to ensure that no biasness would arise during portioning of the data set in the validation process. Table 1 (above) provides a comparative analysis of all classical ML methods used in the task. In contrast, Table 2 (above) provides a comparative analysis of classical ML and Ensemble learning methods used during the classification of clinical text in the four groups. After training and testing the algorithms and models, it was revealed that the logistic regression and multinomial Naïve Bayesian classifiers give the best results with a 94% precision, 96% recall an f1-score of 95%and 96.2% accuracy. On the other hand, random forest, gradient boosting, had relatively goods result of 94% accuracy for all. The model was experimented in two stages to obtain real accuracy level having 75% accuracy in phase 1, where fewer data was used. The accuracy level raised in phase 2 where all data was used, and it is therefore clear that the more data provide the model with the more accurate results we obtain and the more the performance is improved.
Conclusion
Lack of a vaccine or a drug to treat the COVID-19 Virus has brought a lot of concerns and nightmares among researchers and scientists. However, various researchers and institutions are working closely to find a cure for the Virus. The 212 patients’ clinical reports data sample in the four categories (COVID, SARS, ARDS, and both [COVID, ARDS]) after running the ML algorithms and classifying the reports it was clear that; the logistic regression and multinomial Naïve Bayesian classifiers give the best results with a 94% precision, 96% recall an f1-score of 95%and 96.2% accuracy. The other algorithms showed better results, but they could not have relied on much. Increasing the data amount can enhance the models’ efficiency. The analytics and classification of information can be done on the basis of gender to determine the most affected gender (males or females) and find out the influencing factors and appropriate measures to take to counter them. In the future, there is a need to use more feature engineering to get better results and for deep learning for the models.
References
Khanday, AMUD, Rabani, S.T., Khan, Q.R. et al. Machine learning based approaches for detecting COVID-19 using clinical text data. Int. j. inf. tecnol. (2020). https://doi.org/10.1007/s41870-020-00495-9
Charles, V., & Emrouznejad, A. (2019). Big Data for the Greater Good: An Introduction. In Big Data for the Greater Good (pp. 1-18). Springer, Cham.
Zhang, K., Liu, X., Shen, J., Li, Z., Sang, Y., Wu, X., … & Ye, L. (2020). Clinically applicable AI system for accurate diagnosis, quantitative measurements, and prognosis of covid-19 pneumonia using computed tomography. Cell.
Shi, F., Wang, J., Shi, J., Wu, Z., Wang, Q., Tang, Z., … & Shen, D. (2020). Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19. IEEE reviews in biomedical engineering.
Battineni, G., Chintalapudi, N., & Amenta, F. (2019). Machine learning in medicine: Performance calculation of dementia prediction by support vector machines (SVM). Informatics in Medicine Unlocked.