After testing a Metadata Extraction model for TMF Bot, there is a great deal of data available to help you evaluate the effectiveness of your Metadata Extraction models. This article explains how to use our Key Metrics to evaluate your Metadata Extraction models, defines each Metadata Extraction model artifact they provide, and describes how to identify issues and improve your model testing.

Metadata Extraction Evaluation Key Metrics

Key Metrics Definitions

There are three Key Metrics to ensure your Metadata Extraction Trained Model is effective: Extraction Coverage, Auto-classification Coverage, and Auto-classification Error Rate. This section contains a basic description of each and some suggested targets and real-world example values based on an actual Trained Model.

Metric Definition What drives this result? Suggested Target Example
Extraction Coverage The number of documents that had appropriate information to train the model Documents that fit our extraction criteria (extractable text, English text) 50–90%, with the lower end being appropriate for non-English customers A primarily English-based enterprise customer trained a model on 200,000 of their latest documents; their Extraction coverage was 85.34%
Study Metadata Population Coverage The number of documents in whose filename or content a single study was found The documents in your training set and patterns in their naming or content 40-60% in sponsor Vaults; possibly lower in CRO Vaults if internal codes are used for study names A customer with a large variety of studies across geographies saw a Study Metadata Population Coverage rate of 44%, whereas a customer with less variety in documentation and vendors saw population coverage of 52%
Study Metadata Population Error Rate The number of documents in whose filename or content a single study was found, and where that study did not match the value on the document The documents in your training set and patterns in their naming or content Your target for this will depend on how risk-averse your organization is. Typically, lower is better, but keep in mind:
  • Users can still update metadata that has been set by the TMF Bot
  • TMF Bot does this work automatically, saving users time
A customer with strong guidelines to include the study number in naming conventions saw a Study Metadata Population Error Rate of 1%, while another with a larger set of studies across geographies saw an error rate of 4%

Using Key Metrics for Evaluation

In the section above, we introduced three Key Metrics to evaluating the effectiveness of your Metadata Extraction model. You can locate them in the Training Summary Results field.

Extraction Coverage

Extraction coverage is the only Key Metric that you cannot improve. While this might be disconcerting, the purpose of this metric is to set the right expectations for documents added to the Document Inbox. If your company has many audio, video, or other non-text files; a significant number of non-English documents; or regular problems with blurry scans, this metric can help you understand why particular documents are not auto-classified in the Document Inbox.

Study Metadata Population Coverage

You can improve your Metadata Population Coverage metric via the following methods:

  • Review SOPs and/or internal best practices to recommend that the study name be included in the filename of documents.
  • Evaluate your Trained Model Performance Metrics for low-performing classifications. If certain classifications have low values for Precision, Recall, or F-1 Score, that will indicate that metadata extraction fails to find matches and/or finds erroneous matches. Creating an Excluded Classification record, in the Trained Model, for those document types will exclude them from your testing.

Study Metadata Population Error Rate

You can improve your Auto-classification Error Rate metric via the following methods:

  • Review SOPs and/or internal best practices to recommend that the study name be included in the filename of documents.
  • Evaluate your Trained Model Performance Metrics for low-performing classifications. If certain classifications have low values for Precision, Recall, or F-1 Score, that will indicate that metadata extraction fails to find matches or finds erroneous matches. Creating an Excluded Classification record, in the Trained Model, for those document types will exclude them from your testing.

Finishing Evaluation

Once you have evaluated the Key Metrics for your Trained Model, you can compare them against our suggested targets and form your own company targets. If your Trained Model meets or exceeds your set target, this would be a good Trained Model to deploy.

Trained Model Performance Metrics

Every Metadata Extraction model has a series of Performance Metrics records once testing has been completed. You can find these in the Model Performance Metrics section of your Metadata Extraction model record. There are two Metric Types for Metadata Extraction Models:

  • Global Weighted Average: Contains average Precision, Recall, and F1-Score weighted by the number of documents in each classification.
  • Classification Performance: Contains the Precision, Recall, and F1-Score for the document type listed in the Metric Subtype.
    • A special OTHERS UNKNOWN Classification Performance record catches the documents with classifications that didn’t meet the Minimum Documents per Document Type threshold. These documents are still used in training but are grouped together to better inform predictions on the valid classifications.

Each record shows the following metrics:

  • Precision: How often the prediction made was accurate

Precision

  • Recall: Percentage of items within this Metric Subtype that were correctly predicted

Image Name

  • F1-Score: The balance between Precision and Recall

Image Name

  • Testing Documents: The number of documents used for testing the model.
  • Correct Predictions: The total number of times the model correctly predicted the Study name regardless of the prediction confidence.

It is important to note that all predictions marked as correct assume the inputs have metadata that is correct. Any incorrect metadata on documents used to test the model may lead to inaccurate measurements of extraction performance in your Vault. The Trained Model Artifacts listed below can help reveal potential issues.

Trained Model Artifacts

Trained Models have a series of Trained Model Artifacts attached, each containing valuable data. You can find these on a Trained Model object record in the Trained Model Artifacts section. The artifacts include the following files for Trained Models using the Metadata Extraction type:

  • Document Set Extract Results (documentset_extract_results.csv): Extraction results for each document requested for training this model.
    • This file is most helpful in seeing why some documents were not used during the training process.
    • You can use the Document ID & Major/Minor Version within this file to view the appropriate document within Vault.
    • See Reasons for Extraction Failures for a list of potential failure reasons.