Award Date

8-15-2025

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computer Science

First Committee Member

Kazem Taghva

Second Committee Member

Laxmi Gewali

Third Committee Member

Wolfgang Bein

Fourth Committee Member

Mingon Kang

Fifth Committee Member

Emma Regentova

Number of Pages

110

Abstract

This dissertation demonstrates that carefully adapted language-model pipelines can transform unstructured clinical-trial and pharmacological prose into reliable, low-latency structured data. Four interconnected studies support this claim.Tri-AL platform. An open-source dashboard ingests all 440 k+ ClinicalTrials.gov records—including every historical revision—into a normalized schema and parses the 20 GB XML archive over 10x faster than a BeautifulSoup baseline, while exposing hooks for demographic analytics and supporting integration of user-defined modules. Clinical trial summarization. An encoder–decoder model is trained on 57k description–summary pairs to condense clinical trials into a few sentences. ROUGE evaluation shows a 20% improvement over the baseline, while graph-based evaluation indicates the model preserves 71% of critical biomedical entities, yielding concise yet informative summaries suitable for evidence scans. MoA classification. A collection of models—including traditional classifiers (decision trees, random forests, XGBoost) and contrastively fine-tuned masked-language-model variants—achieves a macro F1 of 97%, effectively handling class imbalance and drug-class sparsity while also providing interpretable insights. Scalable medical NER pipeline. A dynamic and scalable pipeline is introduced for training lightweight Named Entity Recognition (NER) models adaptable to different entity types. Knowledge distillation compresses the large teacher model into a 110M-parameter student that retains 70% of gold-label accuracy (F1=0.61) while running 1000x faster and consuming just 6% of the memory. Collectively, these contributions provide scalable tools and empirical evidence that domain-specific NLP methods can be integrated to accelerate trial discovery, enhance drug-development analytics, and support data-driven clinical decision-making.

Keywords

clinical trials; information extraction; tri-al

Disciplines

Artificial Intelligence and Robotics | Computer Engineering

File Format

pdf

File Size

3700 KB

Degree Grantor

University of Nevada, Las Vegas

Language

English

Repository Citation

Nahed, Pouyan, "Investigating Information Extraction and Language Models in Medical Domain Text Processing" (2025). UNLV Theses, Dissertations, Professional Papers, and Capstones. 5391.
http://dx.doi.org/10.34917/39385617

Rights

IN COPYRIGHT. For more information about this rights statement, please visit http://rightsstatements.org/vocab/InC/1.0/

Download

Included in

Artificial Intelligence and Robotics Commons, Computer Engineering Commons

COinS

UNLV Theses, Dissertations, Professional Papers, and Capstones

Investigating Information Extraction and Language Models in Medical Domain Text Processing

Award Date

Degree Type

Degree Name

Department

First Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Fifth Committee Member

Number of Pages

Abstract

Keywords

Disciplines

File Format

File Size

Degree Grantor

Language

Repository Citation

Rights

Included in

Author Corner

Browse

Search

UNLV Theses, Dissertations, Professional Papers, and Capstones

Investigating Information Extraction and Language Models in Medical Domain Text Processing

Author

Award Date

Degree Type

Degree Name

Department

First Committee Member

Second Committee Member

Third Committee Member

Fourth Committee Member

Fifth Committee Member

Number of Pages

Abstract

Keywords

Disciplines

File Format

File Size

Degree Grantor

Language

Repository Citation

Rights

Included in

Share

Author Corner

Browse

Search