top of page

Benchmarking Moss AI: Advancing Clinical Intelligence through Automated, High-Fidelity Clinical Note Generation

1. Summary


The emergence of Clinical Ambient Intelligence (ACI) has enabled a paradigm shift in medical documentation by leveraging AI to transcribe and structure doctor-patient conversations automatically. Moss AI is a cutting-edge automated note-generation system built to address the growing need for efficient, accurate, and semantically faithful clinical documentation.


To rigorously assess Moss AI’s capabilities, we conducted a comprehensive benchmarking initiative using a diverse suite of publicly available datasets: ACI-BENCH, MTS-Dialogue, Primock57, and CliniKnote(sample pipeline). These datasets span various clinical domains, specialties, and note formats, offering a robust foundation to evaluate Moss AI’s generalizability and performance.


The evaluation focused on both quantitative and qualitative metrics, analyzing how well Moss AI reproduces structured clinical notes from naturalistic dialogue inputs. Across multiple benchmarks, Moss AI exhibited strong performance in capturing essential clinical content, especially in sections like Chief Complaint, Social History, Vitals, History of Present Illness (HPI), Assessment, and Plan, while handling contextual cues, medical abbreviations, and patient-specific details with high reliability.


Our results demonstrate Moss AI’s potential to significantly reduce the documentation burden for clinicians, enhance note consistency, and improve the overall efficiency of electronic health record (EHR) workflows. While the system shows high promise, we also identify areas for further refinement — particularly in  abstraction or generalization across semantically equivalent but slightly varied expressions or terminologies.


This white paper highlights Moss AI’s positioning as a state-of-the-art solution in automated clinical note generation, offering practical value in real-world medical settings while laying a solid foundation for future development in the domain of ambient clinical AI.



2. Introduction


Clinical documentation is a critical yet time-consuming component of modern healthcare delivery. Physicians often spend substantial portions of their workday manually entering patient information into Electronic Health Records (EHRs), leading to cognitive overload, reduced patient interaction time, and clinician burnout. 


Automated clinical note generation—the ability of AI systems to transcribe, extract, and structure relevant medical information from doctor-patient dialogues—has emerged as a promising solution. By reducing the manual burden of documentation, these systems have the potential to enhance productivity, improve accuracy, and restore clinician focus to patient care.


Despite rapid advancements in large language models and speech processing technologies, deploying AI systems in clinical environments requires more than raw performance; it demands rigorous validation, transparency, and generalizability across diverse clinical scenarios. The complexity of medical language, presence of abbreviations, varied dialogue structures, and overlapping semantic content present significant challenges to automated systems.


To address these challenges and evaluate AI capabilities in a standardized manner, benchmarking against curated, domain-representative datasets is essential. ACI-BENCH has become one such benchmark in the field, offering structured clinical dialogue and reference notes for objective comparison. In this study, we extend this approach by incorporating additional datasets—MTS-Dialogue, Primock57, and CliniKnote(sample pipeline)—to capture a broader spectrum of clinical contexts and documentation styles.


This white paper presents an in-depth benchmarking of Moss AI, an automated note-generation system, across multiple datasets. We aim to assess its real-world applicability, robustness, and fidelity in replicating structured clinical notes from natural language dialogues, laying the groundwork for broader adoption in clinical ambient intelligence.



3. Background: Moss AI and Clinical Ambient Intelligence


3.1 Overview of Moss AI

Moss AI is an advanced mobile and API-based clinical ambient intelligence platform designed to automate and enhance physician documentation. Using state-of-the-art speech‑to‑text and natural language processing, it captures clinician–patient conversations in real time and automatically generates structured SOAP-style notes. Key features include:

  • Real-time transcription & note synthesis from clinician–patient dialogue

  • Customizable note templates, supporting sections like History, Objective, Assessment, Plan


3.2 Clinical Ambient Intelligence

Clinical Ambient Intelligence (ACI) refers to systems that unobtrusively “listen” to clinician–patient interactions and assist by capturing relevant information. Rather than burdening clinicians with manual note entry, ACI systems:

  • Significantly reduce documentation time, offering efficiency boosts of ~20% and enabling more time with patients

  • Lower cognitive load by shifting data entry from memory to structured automation

  • Foster better clinician–patient interactions by reducing screen exposure and allowing more focused engagement


3.3 Use Cases & Real-World Applications

  • Primary care follow-ups: Clinicians leverage Moss AI for chronic disease check-ups, medication reviews, and family history documentation

  • Integrations with EHR/EMR/HIMS systems: Moss output supports downstream workflows such as coding, billing, and decision prompts

  • Clinician feedback: Early adopters report that Moss AI-infused ambient scribe tools helped increase same-day note completion rates from 66% to over 72%, shortened after-hours work by ~30%, and were well-received after 7 weeks of use



4. Dataset Overview


4.1 Dataset Description

To evaluate Moss AI's ability to generate high-quality clinical notes from doctor-patient conversations, we employed a diverse set of benchmark datasets:

  • ACI-BENCH ACI-BENCH is a publicly available dataset designed for benchmarking automatic clinical note generation from spoken clinical interactions. It includes complete transcripts of clinical encounters paired with corresponding reference notes. ACI-BENCH serves as a strong foundation for evaluating systems in real-world, clinically relevant scenarios.

  • MTS-Dialogue MTS-Dialogue provides multi-turn medical conversations focused on common conditions and clinical reasoning. It captures dialogue-based structure, making it suitable for evaluating models' handling of interactive context and question-answer patterns.

  • Primock57 Primock57 contains 57 structured doctor-patient dialogues along with their corresponding clinical summaries. The dataset emphasizes concise consultations and poses challenges in summarization and information prioritization.

  • CliniKnote(sample pipeline) CliniKnote is a diverse dataset containing synthetic and curated clinical conversations with detailed SOAP-format reference notes. It serves well for assessing template adherence, completeness, and note structuring.

These datasets were selected to collectively represent varying conversation styles, clinical domains, and documentation patterns, enabling robust benchmarking of Moss AI’s generative capabilities.


4.2 Data Sources and Collection

  • ACI-BENCH is a publicly curated dataset. The conversations are semi-synthetic but validated by medical professionals to ensure clinical realism.

  • Primock57 and MTS-Dialogue contain expert-authored content reflecting realistic diagnostic and consultative exchanges.

  • CliniKnote only provides a sample pipeline for now but offers well structured notes. The conversations are role-played but validated by medical professionals to ensure clinical realism.

  • Three of the four datasets include high-quality reference notes either manually written by experts or automatically generated and validated to ensure fidelity to actual clinical documentation styles.


4.3 Data Exploration (EDA)

Due to variation in dataset design and note structuring, we performed dataset-specific EDA for a clearer understanding of the data distribution. ACI-BENCH and CliniKnote(sample pipeline) contain rich, structured SOAP-style notes, while MTS-Dialogue and Primock57 contain shorter, often unstructured interactions.

We present both individual and comparative summaries to showcase the range of challenges posed across datasets.

  • Conversation and Note Length:

    • Ranges from brief symptom checks to extended diagnostic interviews.

  • Sectional Distribution:

Dataset

Sections

ACI Bench

  • Chief Complaint

  • History of Present Illness

  • Review Of System

  • Physical Examination

  • Assessment and Plan

  • Results

MTS Dialogue

  • Family History

  • History of Present Illness

  • Past Medical History

  • Chief Complaint

  • Past Surgical History

  • Allergy

  • Review of System

  • Medications

  • Assessment

  • Exam

  • Diagnosis

  • Disposition

  • Plan

  • Emergency Department Course

  • Immunization

  • Imaging

  • Gynecologist History

  • Procedures

  • Other History

  • Labs

PriMock57

  • HPI

  • PMH

  • SH

  • FH

  • ROS

  • Assessment

  • Plan.

CliniKnote(sample pipeline)

  • Chief Complaint

  • History Of Presenting Illness

  • Past Medical History

  • Past Surgical History

  • Family History

  • Allergies

  • Social History

  • Medication List

  • Immunization History

  • Review of Systems

  • Vital Signs

  • Physical Exam

  • Diagnostic Tests

  • Assessment

  • Plan


  • Clinical Entities:


Fig 1. Frequency of each complaint across ACIBench dataset
Fig 1. Frequency of each complaint across ACIBench dataset

Fig 2. Frequency of each complaint across MTS Dialogue dataset
Fig 2. Frequency of each complaint across MTS Dialogue dataset
Fig 3. Frequency of each complaint across PriMock57 dataset
Fig 3. Frequency of each complaint across PriMock57 dataset
Fig 4. Frequency of each complaint across CliniKnote dataset
Fig 4. Frequency of each complaint across CliniKnote dataset

5. Benchmarking Methodology


5.1 Metrics Used for Evaluation

To holistically assess the quality and clinical fidelity of automatically generated clinical notes, we initially considered a broad set of evaluation metrics across three dimensions: lexical overlap, semantic similarity, and clinical relevance.


However, for the final benchmarking analysis, we prioritized BERTScore (Precision, Recall, F1) as our primary evaluation metric, due to the following reasons:

  • Semantic Fidelity Over Surface Overlap: Unlike ROUGE, BLEU, or CHRF++, which focus on exact or partial lexical matches, BERTScore evaluates semantic similarity using contextual embeddings. This allows it to accurately capture the meaning of paraphrased or reordered text, which is common in real-world clinical documentation.

  • Better Alignment with Clinical Relevance: Clinical notes often vary in phrasing but must preserve essential medical information. BERTScore is more robust in detecting whether core clinical facts are retained—even when sentence structure or terminology differs from the reference.

  • Reduced Penalization for Valid Variation: Lexical metrics tend to penalize synonymous or stylistically different expressions. BERTScore’s embedding-based comparison avoids this, making it more suitable for diverse writing styles across datasets.

  • Strong Correlation with Human Judgments: BERTScore has been shown to correlate better with human ratings in summarization and paraphrasing tasks, making it a reliable proxy when human evaluation is not feasible.


Why Other Metrics Were Not Selected

  • ROUGE/BLEU/CHRF++: While useful for structured summarization tasks, these metrics fall short in assessing paraphrased or semantically equivalent variants, which are frequent in clinical settings.

  • BLEURT: Though powerful, BLEURT requires significant computational resources and may introduce variability due to its learned nature. In practice, it did not offer a significant advantage over BERTScore in our experiments.

  • MedCon: Although ideal for concept-level comparison, MedCon's performance is highly dependent on the accuracy of entity recognition, which varied significantly across datasets due to inconsistent annotation quality.


5.2 Benchmarking Process

Data Preparation

All datasets—including ACI-BENCH, MTS-Dialogue, Primock57, and CliniKnote—were preprocessed to match the expected input format for Moss AI’s note generation pipeline. This involved:

  • Standardizing section headers across datasets to align with the SOAP (Subjective, Objective, Assessment, Plan) format.

  • Removing artifacts, including control characters, incomplete transcriptions, and HTML or timestamp noise.

  • Tokenizing and flattening structured fields for compatibility with downstream evaluation (e.g., transforming dictionaries or nested fields into plain text).

  • Skipping incomplete entries—such as those with missing references, failed API calls, or blank dialogue segments—to maintain evaluation quality.


Each cleaned input (whether a doctor-patient conversation or structured case summary) was then passed through Moss AI’s inference engine, which generated structured notes in a consistent SOAP format.


Evaluation Protocol

To ensure fairness, reproducibility, and dataset-agnostic performance tracking, the following protocol was applied:

  • Symmetric preprocessing: Identical normalization steps (flattening, lowercasing, filtering) were applied to both reference and generated notes before running the evaluation.

  • Section-wise evaluation: Notes were evaluated both holistically and per section to better understand strengths and gaps in specific parts of the generation.

  • Dataset-wise aggregation: Results were computed individually for each dataset and later aggregated to derive macro-level performance insights.

  • Open-source tools: All scoring was done using publicly available Python libraries to ensure transparency and reproducibility.



6. Results


6.1 Quantitative Results


We evaluated Moss AI’s note generation performance across four benchmark datasets using a diverse set of metrics. The evaluation includes both a cross-model comparison table and granular metric breakdown for BERTScore.


Benchmark Comparison Across Models and Datasets

The table below compares Moss AI's BERTScore-F1 performance against strong baseline models reported in prior literature. All scores were computed using the roberta-large model configuration.

SNO

Model

BERTScore-F1






ACI-BENCH

MTS-DIALOGUE

PriMock57

CliniKnote

1

MediGEN

0.721




2

BART-finetuned


0.81

0.81


3

Cammel-13B




0.895

4

Moss AI

0.9324

0.9815

0.9417

0.8703


Detailed Semantic Similarity Scores

To further understand performance, the following table breaks down BERTScore Precision, Recall, and F1 across each dataset.

Dataset

BERTScore Precision

BERTScore Recall

BERTScore F1

Std. Dev.

MTS-DIALOG

0.9811

0.982

0.9815

0.1684

PriMock57

0.9424

0.941

0.9417

0.0464

ACI-BENCH

0.9322

0.9326

0.9324

0.0285

CliniKnote

0.8721

0.8686

0.8703

0.2645

Average

0.9320

0.9311

0.9315

0.1269

Interpretation

  • High BERTScore F1 across datasets indicates strong semantic alignment between Moss AI-generated notes and reference notes, even when surface form varies—an essential quality for clinical accuracy.

  • Precision vs Recall tradeoff offers insights into Moss AI’s ability to capture more information (Recall) vs avoiding hallucinations (Precision). Moss AI maintains a healthy balance across all datasets.

  • In most datasets, Moss AI outperforms or closely matches baseline systems, reinforcing its generalization ability across different note styles and dataset structures.


6.2 Qualitative Analysis


Strengths

  • Semantic Consistency: Moss AI captures key clinical facts like symptom duration, stool characteristics, past medical history, and treatment plans with high fidelity.

  • Section Structure: Outputs follow the expected SOAP format even when source transcripts or notes are unstructured.

  • Paraphrasing Flexibility: Generated notes use clinically valid paraphrases without losing critical information, improving readability.


Areas for Improvement

  • Detail Compression: In some instances, Moss AI omits secondary but relevant details (e.g., “family also unwell” became “family had mild symptoms”).

  • Precision in Temporal Phrasing: While timelines are generally correct, expressions like “last vomited 1 hour ago” may be generalized or dropped entirely.



7. Discussion


7.1 Implications for Clinical Practice

Time Savings:Moss AI has the potential to significantly reduce documentation burden. Based on industry benchmarks, clinical ambient intelligence systems like Moss AI can reduce documentation time by up to 70–78%, allowing clinicians to shift focus from typing notes to delivering patient care. This translates to potentially reclaiming several hours per week per provider.

Clinician Productivity:By automating note generation from conversations, Moss AI enables clinicians to maintain complete, structured documentation without manual transcription or post-visit dictation. This facilitates real-time note finalization, better adherence to templates and billing standards, and reduced cognitive load.


7.2 Limitations

Dataset Limitations: Although ACI-BENCH, MTS-Dialogue, PriMock57, and CliniKnote cover a broad range of clinical encounters, some datasets include role-played or synthetic dialogues, which may not fully reflect real-world variability. There may also be biases in note structure or over-representation of certain specialties. Another key limitation could be said to be the datasets being focused on English-Language dialogues and notes and mostly follow U.S. style documentation practices.


Model Limitations: Moss AI, like other generative systems, can occasionally omit low-frequency details, produce repetitive text, or struggle with rare abbreviations or disfluent speech. While performance is strong in common sections like CC, HPI, Assessment, Plan, and PMH, areas such as Review of Systems or Family History can show inconsistency depending on phrasing or note conventions.


Generalizability: The evaluation focused on English-language notes and U.S.-style documentation practices. The current benchmark may not reflect performance in non-English settings, pediatric or subspecialty domains, or non-SOAP-based formats. Future work in benchmarking could evaluate Moss AI’s robustness across more languages, EHR systems, and geographies.



8. Conclusion and Future Directions


This benchmarking study demonstrates the strong capabilities of Moss AI in generating high-quality, semantically faithful clinical notes across a diverse set of datasets. Leveraging the roberta-large configuration for BERTScore evaluation, Moss AI consistently outperforms or matches strong baseline models across various benchmarks, including ACI-BENCH, MTS-DIALOG, PriMock57, and CliniKnote(sample pipeline). Notably, it achieved a highest BERTScore-F1 of 0.9815, highlighting its robust semantic understanding and alignment with ground truth documentation.


Beyond quantitative performance, Moss AI also exhibits strengths in preserving structured note formats, minimizing hallucinations, and maintaining clinical relevance. These attributes make it a promising tool for easing the burden of documentation in real-world clinical environments.

Future Directions


To further evaluate and stress-test Moss AI’s performance, future benchmarking efforts can explore the following directions:

  • Broader Language Coverage: Incorporate non-English and multilingual clinical datasets to assess generalizability across global healthcare settings.

  • Domain Diversity: Include datasets from specialty areas (e.g., oncology, psychiatry, pediatrics) to test robustness in less-represented or high-complexity clinical domains.

  • Fact-Specific Evaluation: Integrate fine-grained factual accuracy benchmarks that target entity-level correctness and hallucination rates.

  • Human-Centered Evaluation: Conduct human-in-the-loop assessments involving clinicians to validate utility, readability, and trust in real-world settings.

  • Generalization to Noisy Inputs: Assess model robustness against less structured, noisy, or ASR-derived dialogues that mimic real-world conditions more closely.



9. References


[1] W.-W. Yim, Y. Fu, A. Ben Abacha, N. Snider, T. Lin, and M. Yetisgen, “ACI-BENCH: A Novel Ambient Clinical Intelligence Dataset for Benchmarking Automatic Visit Note Generation,” arXiv preprint arXiv:2306.02022, 2023. [Online]. Available: https://arxiv.org/abs/2306.02022/ [Dataset]

[2] A. Ben Abacha, W.-W. Yim, Y. Fan, and T. Lin, “An Empirical Study of Clinical Note Generation from Doctor-Patient Encounters,” in Proc. EACL, Dubrovnik, Croatia, May 2023, pp. 2291–2302. [Online]. Available: https://aclanthology.org/2023.eacl-main.168/ [Dataset]

[3] A. Papadopoulos Korfiatis, F. Moramarco, R. Sarac, and A. Savkov, “PriMock57: A Dataset of Primary Care Mock Consultations,” in Proc. ACL (Short Papers), Dublin, Ireland, May 2022, pp. 588–598. [Online]. Available: https://aclanthology.org/2022.acl-short.65/ [Dataset]

[4] Y. Li, S. Wu, C. Smith, T. Lo, and B. Liu, “Improving Clinical Note Generation from Complex Doctor-Patient Conversation,” arXiv preprint arXiv:2408.14568, 2024. [Online]. Available: https://arxiv.org/abs/2408.14568 [Dataset]

[5] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” in Proc. ICLR, Addis Ababa, Ethiopia, Apr. 2020. [Online]. Available: https://arxiv.org/abs/1904.09675

[6] C.-Y. Lin, “ROUGE: A Package for Automatic Evaluation of Summaries,” in Proc. ACL Workshop on Text Summarization Branches Out, Barcelona, Spain, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013/

[7] T. Sellam, D. Das, and A. Parikh, “BLEURT: Learning Robust Metrics for Text Generation,” in Proc. ACL, Online, Jul. 2020, pp. 7881–7892. [Online]. Available: https://aclanthology.org/2020.acl-main.704/

[8] L. Zhuang, W. Lin, Y. Shi, and J. Zhao, “A Robustly Optimized BERT Pre-training Approach with Post-training,” in Proc. CCL, Huhhot, China, Aug. 2021, pp. 1218–1227. [Online]. Available: https://aclanthology.org/2021.ccl-1.108/

[9] Moss AI, “Official Website,” 2024. [Online]. Available: https://moss.simplibot.com/

Comments


Access Moss AI

© 2025 Copyright of Suzami Tech Pvt Ltd

bottom of page