Award Date
August 2025
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computer Science
First Committee Member
Kazem Taghva
Second Committee Member
Laxmi Gewali
Third Committee Member
Wolfgang Bein
Fourth Committee Member
Mingon Kang
Fifth Committee Member
Emma Regentova
Number of Pages
110
Abstract
This dissertation demonstrates that carefully adapted language-model pipelines can transform unstructured clinical-trial and pharmacological prose into reliable, low-latency structured data. Four interconnected studies support this claim.Tri-AL platform. An open-source dashboard ingests all 440 k+ ClinicalTrials.gov records—including every historical revision—into a normalized schema and parses the 20 GB XML archive over 10x faster than a BeautifulSoup baseline, while exposing hooks for demographic analytics and supporting integration of user-defined modules. Clinical trial summarization. An encoder–decoder model is trained on 57k description–summary pairs to condense clinical trials into a few sentences. ROUGE evaluation shows a 20% improvement over the baseline, while graph-based evaluation indicates the model preserves 71% of critical biomedical entities, yielding concise yet informative summaries suitable for evidence scans. MoA classification. A collection of models—including traditional classifiers (decision trees, random forests, XGBoost) and contrastively fine-tuned masked-language-model variants—achieves a macro F1 of 97%, effectively handling class imbalance and drug-class sparsity while also providing interpretable insights. Scalable medical NER pipeline. A dynamic and scalable pipeline is introduced for training lightweight Named Entity Recognition (NER) models adaptable to different entity types. Knowledge distillation compresses the large teacher model into a 110M-parameter student that retains 70% of gold-label accuracy (F1=0.61) while running 1000x faster and consuming just 6% of the memory. Collectively, these contributions provide scalable tools and empirical evidence that domain-specific NLP methods can be integrated to accelerate trial discovery, enhance drug-development analytics, and support data-driven clinical decision-making.
Keywords
clinical trials; information extraction; tri-al
Disciplines
Artificial Intelligence and Robotics | Computer Engineering
File Format
Degree Grantor
University of Nevada, Las Vegas
Language
English
Repository Citation
Nahed, Pouyan, "Investigating Information Extraction and Language Models in Medical Domain Text Processing" (2025). UNLV Theses, Dissertations, Professional Papers, and Capstones. 5391.
http://dx.doi.org/10.34917/39385617
Rights
IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/