Benchmarking Moss AI’s Clinical Recommendation System (CRS)

Saathvik Sharma
Aug 21
13 min read

1. Executive Summary

The integration of Artificial Intelligence (AI) into clinical decision-making workflows has created significant opportunities to augment healthcare delivery, enabling rapid, evidence-based recommendations in time-sensitive environments. Clinical Recommendation Systems (CRS) leverage patient data to assist healthcare providers in generating accurate diagnoses, identifying relevant actions, and optimizing treatment strategies.

We benchmarked our CRS using five diverse datasets — Acibench, Primock57, Cliniknote, HealthBench, and the eka.care Medical History dataset — covering a mix of conversation-based and structured prompt–completion formats. The evaluation was designed to simulate realistic decision-support scenarios and measure the system’s ability to infer clinically relevant information from limited patient-provided input.

Performance was assessed using BERTScore with the emilyalsentzer/Bio_ClinicalBERT model to capture semantic similarity between predicted and gold-standard outputs. Results indicate that the Moss AI’s CRS consistently aligns with reference clinical decisions across varied dataset structures and medical domains, with particularly strong recall for key clinical facts. While high recall ensures comprehensive coverage, occasional inclusion of clinically related but non-reference details affects precision-oriented metrics.

This benchmarking effort provides a foundational performance baseline for our CRS, demonstrating its robustness across heterogeneous medical data sources and identifying opportunities for further enhancement, including improved filtering of ancillary predictions and integration of confidence scoring.

2. Introduction

Clinical Recommendation Systems (CRS) play a pivotal role in bridging the gap between raw clinical data and actionable medical decisions. By synthesizing patient information — whether from structured health records or unstructured clinical narratives — CRS tools can assist healthcare professionals in making timely, evidence-based decisions that directly impact patient outcomes.

With the increasing digitization of healthcare and the widespread adoption of Electronic Health Records (EHRs), the volume of available patient data has grown exponentially. While this data provides immense potential for improving diagnosis and treatment, manually deriving decision-support insights is both time-intensive and susceptible to human variability. As such, AI-powered CRS solutions have emerged as critical enablers for enhancing efficiency, consistency, and scalability in clinical decision-making.

The CRS evaluated in this study is designed to extract, infer, and validate clinically relevant information from varied input sources, including doctor–patient conversation transcripts and structured patient data summaries. Given the complexity of clinical decision-making — and the high stakes associated with incorrect or incomplete recommendations — it is essential to rigorously benchmark such systems for accuracy, reliability, and robustness across diverse data types and domains.

This benchmarking effort draws on five heterogeneous datasets — Acibench, Primock57, Cliniknote, HealthBench, and the Medical History dataset from eka.care — each selected or adapted to represent distinct clinical interaction styles and decision-support challenges. By combining conversation-based datasets with structured prompt–completion formats, the evaluation captures a realistic range of scenarios a CRS may encounter in deployment.

Effectiveness is measured using BERTScore with the emilyalsentzer/Bio_ClinicalBERT model, enabling the assessment of semantic similarity between CRS-generated outputs and gold-standard clinical decisions. This approach prioritizes clinically meaningful equivalence over exact word matching, ensuring that conceptually accurate but lexically varied outputs are appropriately recognized.

This paper outlines the methodology, results, and analysis of the benchmarking study, establishing a foundational performance baseline for the CRS and identifying targeted areas for future enhancement.

3. Dataset Overview

3.1 Dataset Description

This benchmarking study leverages five distinct datasets, each selected to capture different facets of clinical decision-making and to test the CRS across varied input formats, domains, and complexity levels.

Acibench – A conversation-based dataset containing doctor–patient dialogues. Clinician turns often include explicit decision statements, while patient turns provide the context from which decisions must be inferred.
Primock57 – A similar doctor–patient conversation dataset with diverse medical scenarios, designed to test the system’s ability to generalize across multiple specialties.
Cliniknote – Another conversation-driven dataset with richer narrative styles and longer consultations, enabling evaluation of performance on extended, multi-turn contexts.
HealthBench – A prompt–completion dataset containing structured “prompt” fields with partial patient information and corresponding “ideal completions” representing the target decision-support outputs.
Medical History Summarization Dataset (eka.care) – A large-scale, structured dataset containing key medical history details in the “prompt” column and expected outputs in the “ideal completions” column, simulating real-world patient intake and history-taking scenarios.

Given that no single benchmark directly targets end-to-end CRS capabilities, each dataset was adapted for this evaluation. For conversation-based datasets (Acibench, Primock57, Cliniknote), patient turns were extracted and used as prediction inputs, while full conversations containing doctor turns served as gold-standard references. For prompt–completion datasets (HealthBench, eka.care), the prompt text acted as input, and the ideal completion served as the reference output.

This combination of datasets provides both controlled, well-structured cases and noisy, real-world-like scenarios, enabling a holistic assessment of the CRS’s robustness, adaptability, and semantic accuracy across multiple clinical communication formats.

3.2 Overview

Dataset name	source	Number of samples
HealthBench	HealthBench	3891
Primock57	Primock57	57
ACIBench	ACIBench	205
CliniKnote	CliniKnote	20
Eka.care (MHSD)	MHSD	56

3.3 File Structure

This section outlines the dataset structures and preparation steps used for benchmarking the CRS across both conversation-based and prompt–completion datasets.

3.3.1 Conversation-Based Datasets

The Acibench, Primock57, and Cliniknote datasets were originally structured with multiple metadata columns, including:

Conversation ID – Unique identifier for each doctor–patient conversation.
Conversations – Full turn-based dialogue between doctor and patient.
Medical Record – Supplemental structured clinical data (not directly used).

For benchmarking purposes, the Conversations column served as the primary input source. From each conversation, two distinct subsets were extracted:

Patient Turns – Containing only the patient’s utterances; used as prediction inputs to the CRS API.
Full Conversations – Containing both patient and doctor turns; used as reference inputs to the CRS API, with doctor turns providing the gold-standard clinical decisions.

3.3.2 Prompt–Completion Datasets

The HealthBench and Medical History Dataset (eka.care) were provided with structured columns, including:

Prompt ID – Unique identifier for each record.
Ideal_Completions_Data – Text representing the target clinical decision or structured information.
Prompts – Input text containing partial patient details or case context.
Rubrics – Evaluation criteria (not used in this benchmarking).

For the evaluation, the following fields were extracted:

Ideal_Completions_Data – Used as the reference input to the CRS API.
Prompts – Used as the prediction input to the CRS API.

3.4 Preprocessing Details

Given the absence of a purpose-built dataset for CRS benchmarking, multiple existing datasets were repurposed through a standardized preprocessing pipeline to produce reference inputs and prediction inputs suitable for evaluation.

3.4.1 Conversation-Based Datasets (Acibench, Primock57, Cliniknote)

Source Column: Conversations — containing turn-based doctor–patient dialogues.
Reference Input: Full conversation text, including both doctor and patient turns (doctor turns containing clinical decisions served as gold-standard references).
Prediction Input: Patient-only conversation turns, extracted from the full conversation.
For each dataset, a new CSV file was created with the following columns:

Column Name	Description
reference_input	Full doctor–patient conversation
prediction_input	Extracted patient-only conversation

This format was consistently applied across all conversation-based datasets.

3.4.2 Prompt–Completion Datasets (HealthBench, Medical History Dataset – eka.care)

HealthBench

Source Column: ideal_completions_data (JSON) — containing:
- ideal_completion – Expected output text.
- ideal_completions_group – Grouping metadata (not used).
- ideal_completions_ref_completions – Additional reference completions.
- Processing: Extracted strings from ideal_completion and ideal_completions_ref_completions, merged them, and used the result as reference_input.
Source Column: prompts (JSON) — containing:
- content – Input text describing the clinical case.
- user – User metadata (not used).
- Processing: Extracted string from content as prediction_input.
Curated into a new CSV with columns:

Column Name	Description
reference_input	Merged strings from ideal_completion and ideal_completions_ref_completions
prediction_input	Extracted string from content in prompts

Medical History Summarization Dataset (eka.care)

Source Column: ideal_completions_data (JSON) — containing:
- ideal_completion – Expected output text.
- Processing: Extracted string from ideal_completion as reference_input.
Source Column: prompts (JSON) — containing:
- content – Input text describing the case.
- role – Metadata (not used).
- Processing: Extracted string from content as prediction_input.
Curated into a new CSV with columns:

Column Name	Description
reference_input	Extracted string from ideal_completion
prediction_input	Extracted string from content in prompts

4. Data Exploration

Given that the datasets were custom-curated for this CRS benchmarking study, a conventional exploratory data analysis was not required. Instead, the focus was on data profiling to assess the diversity, coverage, and representation of key clinical parameters across multiple datasets.

This step was critical to ensure that each dataset contained sufficient variability in its content to meaningfully benchmark the CRS across different clinical scenarios.

4.1 Parameter Coverage Analysis

We examined the number of unique values for each parameter across all datasets to better understand the data diversity. The results are summarized below:

Fig 1. Distribution of samples across parameters from Healthbench

Fig 2. Distribution of samples across parameters from Primock57

Fig 3. Distribution of samples across parameters from ACIBench

Fig 4. Distribution of samples across parameters from CliniKnote

Fig 5. Distribution of samples across parameters from Eka.care MHS

4.2 Observations

High variability in total entries and clinical coverage across datasets.
Certain critical safety parameters (e.g., drug-to-drug interactions, potential drug allergies) had very low representation, highlighting potential evaluation biases for rare but important scenarios.
Richer datasets like Healthbench provided strong coverage across almost all parameters, while smaller datasets like Primock57 and Cliniknote were more limited but still valuable for scenario-specific evaluation.

This profiling confirms that the dataset pool offers both breadth (covering many parameters) and depth (rich examples in some datasets), making it suitable for benchmarking CRS performance across varied clinical decision-making tasks.

5. Evaluation Metrics

To assess the performance of the CRS system in a clinically meaningful way, we focused on a semantic similarity–based evaluation rather than purely lexical or token-level matching. This decision was motivated by the nature of CRS outputs, which may use clinically equivalent but lexically different expressions that traditional metrics like accuracy or recall would penalize unfairly.

5.1 Selected Metric

Metric	Purpose	Reason for Inclusion
BERTScore (BioClinicalBERT)	Measures semantic similarity between predicted outputs and reference outputs using contextual embeddings	Captures domain-specific clinical meaning beyond exact text matches, ensuring that medically equivalent but differently phrased outputs are recognized as correct.

5.2 Metric Computation Strategy

Model Used: BioClinicalBERT — a domain-tuned BERT model trained on biomedical and clinical text corpora.
Computation Steps:
- Extract token embeddings from both predicted and reference outputs.
- Compute pairwise cosine similarity between embeddings.
- Aggregate into precision, recall, and F1 scores, with F1 used as the primary evaluation metric.
Interpretation:
- High BERTScore indicates that the model’s predictions are semantically aligned with the reference, even if the exact words differ.
- Lower scores highlight clinically or contextually incorrect outputs.

5.3 Why Only BERTScore Was Chosen

CRS outputs often involve paraphrased recommendations, varied clinical terminologies, and context-dependent phrasing, which makes token-based metrics (e.g., Match Ratio, Jaccard) less meaningful.
Many parameters in the evaluation involve full-sentence or paragraph-level advice, where semantic fidelity is more important than surface-level similarity.
BERTScore with BioClinicalBERT ensures that the metric reflects true clinical equivalence rather than just exact wording overlap.

5.4 Metrics Not Used & Justification

Metric	Reason for Exclusion
Accuracy/ Precision	Not suitable for multi-sentence or variable-length free-text outputs; can be misleading when wording differs but meaning is preserved.
Hamming Loss	Designed for discrete label sets; not applicable to natural language recommendation outputs.
ROUGE/ BLEU	Focused on n-gram overlap; inadequate for capturing semantic correctness in clinical decision-making text.

6. Results and Observations

This section presents the quantitative results from benchmarking the CRS system across the curated datasets described earlier, using BERTScore with BioClinicalBERT as the sole evaluation metric.

6.1 Dataset-level Summary

Dataset	Avg. Precision	Avg. Recall	Avg. F1	Std. Dev. Precision	Std. Dev. Recall	Std. Dev. F1
HealthBench	0.9188	0.9155	0.9163	0.0783	0.077	0.0837
PrimoCK57	0.9117	0.9137	0.9119	0.0749	0.0752	0.0718
AciBench	0.9002	0.8983	0.8984	0.0705	0.072	0.0684
ClinikNote	0.8944	0.905	0.8987	0.0844	0.0879	0.0833
Eka.Care MHS	0.8415	0.856	0.8474	0.1076	0.0994	0.1002

Note: Values are averages over the 14 parameters for each dataset.

6.2 Qualitative Observations

Diagnosis & Plan Prediction

Across all datasets, the CRS achieves consistently strong performance in diagnosis and plan prediction, with F1-scores in the 0.77–0.93 range. Notably, Primock57 and CliniKnote yield the highest scores (F1 ≈ 0.92–0.93), reflecting their smaller, high-quality sample sets. Larger-scale benchmarks like HealthBench (F1 ≈ 0.87) show slightly lower averages but greater stability across thousands of samples. This suggests that while model generalization is robust, smaller curated datasets allow the CRS to approach near-clinician-level accuracy.

Risk & Complication Assessment

Performance in predicting risk complications is similarly strong (F1 ≈ 0.80–0.85 across datasets). The relatively low standard deviation across benchmarks (<0.07) indicates high reliability. This stability is encouraging given the clinical importance of anticipating adverse outcomes, where both precision and recall are critical.

Drug Safety (Interactions, Allergies, Incorrect Medications)

The CRS demonstrates near-ceiling performance in drug-to-drug interactions, drug allergy detection, and incorrect medication identification, with precision, recall, and F1 often >0.99 in HealthBench and ACIBench. Even smaller datasets such as Eka.care MHS (F1 ≈ 0.86–0.99) exhibit strong detection capability, though higher variance suggests sensitivity to limited sample sizes. This high performance underscores the system’s reliability in core safety-critical domains.

Medication Advice (Name, Dosage, Frequency)

In medication counseling tasks, performance is more variable. While CliniKnote achieves F1 ≈ 0.86–0.88, Eka.care MHS lags behind (F1 ≈ 0.65–0.74), and HealthBench averages around F1 ≈ 0.86. This indicates that free-text medication advice extraction remains a challenging area, particularly with diverse phrasing and ambiguous contexts. Standard deviations are notably higher (>0.10), highlighting inconsistencies in generalization compared to structured diagnostic and safety tasks.

Potential Adverse Events & Corollary Orders

The system achieves high scores (F1 ≈ 0.88–0.93) across adverse drug events and corollary prescription orders, with relatively stable variance. Importantly, recall remains high, showing that the CRS is effective at surfacing potential risks even when not all contextual cues are present.

Standard Protocols & Future Event Complications

Performance in adhering to standard protocols and anticipating possible future complications shows moderate but reliable accuracy (F1 ≈ 0.78–0.92). Variance across datasets is modest, suggesting the CRS can generalize reasonably well across practice settings. However, slightly lower recall in Eka.care MHS (F1 ≈ 0.78) indicates that smaller datasets may underrepresent the full spectrum of protocol deviations.

Key Patterns Observed

High Consistency in Safety-Critical Domains: Tasks involving medication errors, allergies, and drug interactions exhibit near-perfect accuracy across all benchmarks.
Dataset Size vs. Stability Tradeoff: Larger datasets (HealthBench, ACIBench) produce more stable scores, while smaller ones (CliniKnote, Eka.care MHS) occasionally yield inflated variance.
Challenging Free-text Advice Tasks: Extracting medication name/dosage/frequency shows the largest drop in consistency, highlighting an area for improvement.
Precision–Recall Balance: Most tasks show balanced precision and recall, though advice-related tasks occasionally lean towards higher recall, suggesting the system prioritizes coverage over strict accuracy.

7. Discussion

7.1 Implications for Clinical Practice

The benchmarking results of the CRS across multiple datasets highlight several implications for real-world clinical use:

Medication Safety & Risk Mitigation:Near-perfect performance (>0.99 F1 in several datasets) in detecting drug–drug interactions, allergies, and incorrect medication prescriptions demonstrates the CRS’s ability to act as a safety net in high-risk prescribing scenarios. This reliability supports deployment in hospital formularies, CPOE systems, and e-prescribing platforms, where even a single missed interaction could be catastrophic.
Decision Support for Complex Diagnoses:Strong and stable results in diagnosis and risk complication prediction (F1 ≈ 0.85–0.93 across datasets) suggest that the system can complement clinical judgment, particularly in multi-morbidity cases. By surfacing possible risks early, the CRS can reduce oversight in chronic care, geriatrics, and intensive care contexts.
Improved Clinical Workflows:The system’s ability to accurately predict standard protocols and corollary prescription orders provides opportunities to streamline workflows, reducing cognitive load for clinicians. This could enhance adherence to best practices and reduce variability in care.
Support for Real-world Data Capture:By reliably structuring unstandardized free-text inputs into parameterized outputs (e.g., medication name, dosage, frequency), the CRS can enhance the quality of EHR data. This in turn benefits secondary uses like research, billing, and population health management.

7.2 Limitations

Despite promising results across diverse datasets, several limitations must be acknowledged:

Variability in Free-text Medication Advice:Performance for medication name, dosage, and frequency extraction was notably lower and more variable (F1 ≈ 0.65–0.86 with higher std. dev). This reflects the challenge of parsing inconsistent or brand names used incolloquial clinical documentation, which may impact reliability in outpatient or low-resource settings.
Incomplete or Biased Ground Truth:Since datasets were curated from varied sources (HealthBench, Primock57, ClinikNote, Eka.Care), annotations may not fully capture the clinical intent or latent conditions. As a result, the CRS may appear to “over-predict” parameters that were in fact clinically relevant but undocumented.
Dataset Imbalance:Parameters such as drug allergies and incorrect medication indications had far fewer samples than diagnosis or plan, limiting the ability to fully benchmark the system’s performance. High scores in these cases may not generalize to larger, real-world distributions.
Interpretability Challenges:The reliance on transformer-based embeddings (e.g., BioClinicalBERT for semantic evaluation) means that while semantic alignment is strong, explainability is limited. Clinicians may struggle to understand why a certain parameter was surfaced, which could impede trust and adoption.
Domain Generalization & Context Sensitivity:While the CRS generalized well across curated datasets, its performance may degrade in specialty domains (e.g., oncology, pediatrics) or with non-English datasets. Adaptation to local guidelines, languages, and regulatory frameworks will be critical for deployment.

8. Conclusion

This benchmarking study evaluated a Clinical Recommendation System (CRS) across multiple medical domains and benchmark datasets, using precision, recall, and F1-score to measure performance. The results indicate that the CRS demonstrates strong performance across most evaluated parameters, particularly in tasks requiring high-precision classification.

The system consistently achieved high precision, recall, and F1-scores, generally above 0.85, for core tasks like 'diagnosis' and 'risk complications' across most benchmarks, suggesting a high degree of accuracy and reliability in identifying clinically relevant information. The exceptionally high scores for medication-related tasks, such as 'drug to drug interactions' (up to 0.99 F1) and 'drug allergies' (up to 0.99 F1), highlight the system's robustness in areas where a single error could have significant clinical consequences. This performance demonstrates the system's effectiveness in providing accurate, critical safety information to clinicians.

While performance was robust overall, some variation was observed. Tasks related to 'medicine advice' (name, dosage, frequency) and 'possible future event complications' generally showed a slightly lower F1-score compared to core diagnostic tasks, indicating areas for potential improvement. This suggests that while the system is highly proficient at structured, classification-based tasks, there is an opportunity to enhance its performance on more nuanced, free-form, or predictive tasks.

Collectively, these findings demonstrate that the CRS is a highly effective tool for automating and assisting with core clinical tasks. The system is particularly adept at high-stakes classification, offering a reliable layer of decision support that could enhance patient safety and streamline clinical workflows.

9. References

[1] Moss AI, “Official Website,” 2024. [Online]. Available: https://moss.simplibot.com/

[2] R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal, “HealthBench: Evaluating Large Language Models Towards Improved Human Health,” arXiv preprint arXiv:2505.08775, May 13, 2025. [Online]. Available: https://arxiv.org/abs/2505.08775

[3] A. Papadopoulos Korfiatis, F. Moramarco, R. Sarac, and A. Savkov, “PriMock57: A Dataset Of Primary Care Mock Consultations,” arXiv preprint arXiv:2204.00333, Apr. 1, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2204.00333

[4] W. Yim, Y. Fu, A. Ben Abacha, N. Snider, T. Lin, and M. Yetisgen, “ACI-BENCH: a Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation,” arXiv preprint arXiv:2306.02022, Jun. 3, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2306.02022

[5] Y. Li, S. Wu, C. Smith, T. Lo, and B. Liu, “Improving Clinical Note Generation from Complex Doctor-Patient Conversation,” arXiv preprint arXiv:2408.14568, Jun. 16, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2408.14568

[6] Eka Care, “ekacare/ekacare_medical_history_summarisation,” Hugging Face, Jul. 25, 2025. [Online]. Available: https://huggingface.co/datasets/ekacare/ekacare_medical_history_summarisation

[7] E. Alsentzer, J. R. Murphy, W. Boag, W. Weng, D. Jin, T. Naumann, and M. B. A. McDermott, “Publicly Available Clinical BERT Embeddings,” in Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA, Jun. 2019, pp. 72–78. doi: 10.18653/v1/W19-1909. [Online]. Available: emilyalsentzer/Bio_ClinicalBERT · Hugging Face