Creating & Testing Metadata Extraction Models

To use TMF Bot features like study metadata extraction, a Metadata Extraction model must be tested and deployed. This model does not create a custom machine learning model specific to your document types, but instead relies on a standard approach to search documents’ content and filenames for matches to records in your Vault. Testing a Metadata Extraction model allows you to assess potential metadata extraction performance prior to enabling it in your Vault.

How to Test a Model

The TMF Bot requires input to learn before performing tasks on its own. Generally, the larger and more accurate the inputs, the better the resulting model will be. Vault stores accumulated input in Trained Model object records.

Prediction Confidence Threshold

In machine learning models such Document Classification, Vault uses the Prediction Confidence Threshold field value on a Trained Model record to determine what score is required before the system will use that Prediction. Because metadata extraction follows a stricter logic searching for exact matches, this value will be set to .9999 for Metadata Extraction models.

Creating a Study Metadata Extraction Model

Before creating a Study Metadata Extraction model, carefully consider the following limitations:

Vault cannot extract Study data from certain categories of documents, nor can they be used for testing. These include:
- Video and audio files
- Non-text files, such as ZIP files, statistical files, or database files
- Documents where Vault cannot extract text, for example, if the text is too blurry.
We recommend using at least 3,000 documents in steady states, such as Approved or Final, to test the Study Metadata Extraction model. You may use TMF Bot on Vaults with fewer documents, but note that metrics measuring its potential performance will not be as robust.
If any of your inputs are documents that have an incorrect Study value, your performance metrics will be negatively impacted. For example, if several documents that should have a Study ID of NCT12345678, and include that Study ID in their content, but erroneously have an incorrect Study ID of NCT12345679, they will be counted as incorrect predictions in your Trained Model Performance Metrics, and may underestimate the effectiveness of metadata extraction once deployed in your Vault.

Creating the Study Metadata Extraction Object Record

Navigate to Admin > Business Admin and click into the Trained Model object.
Click Create.
For the Trained Model Type, select Metadata Extraction.
Enter a Prediction Confidence Threshold.
- TMF Bot will not use any predictions below this threshold for auto-classification. While Vault will accept any value between zero (0) and one (1), we recommend using a value of .9 or above.
- Once you have tested a Trained Model, you cannot change this value.
- Generally, the higher the number, the more accurate the classifications; however, you may extract metadata from fewer documents.
If you intend to use the Training Window training method, set the Training Window Start Date accordingly.
Under Model Parameters, the Minimum Documents per Document Type will default to 100. If you wish to adjust this, you can do so from a list of the Trained Models.
- All documents in your document set will be evaluated for metadata extraction. To have Performance Metric records for an individual classification, the document set must include, for that classification, at least the Minimum Documents per Document Type.
- Classifications with less than the Minimum Documents per Document Type will still be evaluated, but will be combined under a record with a Metric Subtype of “OTHERS UNKNOWN (Remaining Document Types that did not meet the Minimum Documents per Document Type)”
- If you find that you have a large number of classifications being combined under “OTHERS UNKNOWN,” you can lower the Minimum Documents per Document Type. Below are suggested values depending on the number of documents in your Vault at the time of model testing.
  - 1,000 to 10,000 documents = 10
  - 10,000 to 25,000 documents = 15
  - 25,000 to 50,000 documents = 25
  - 50,000 to 100,000 documents = 50
  - 100,000 to 150,000 documents = 75
  - 150,000 to 200,000 documents = 100
- The Advanced Model Parameters field is system-managed; you do not need to set anything here.
Click Save.

Choosing a Document Set Method

To test your model, you’ll need to choose a method to pull documents to use as input in this Metadata Extraction Trained Model. There are two options: Training Window Start Date and Attached CSV of Document IDs.

Training Window Start Date

The Training Window Start Date method ignores Archived documents; this is a known issue. If you want to train on Archived documents, you must use the Attached CSV of Document IDs method.

This method pulls all documents in a steady state, such as Approved or Final, with a Version Created Date value between the Training Window Start Date and the current date. If there are more than 40,000 documents that fit this criteria, Vault uses the 40,000 most recent documents. If you choose this method, ensure you have filled in a Training Window Start Date value on your Trained Model record.

Attached CSV of Document IDs

This method takes as input a list of Document IDs. A Document ID is Vault’s unique identifier for a document allowing admins to tailor the list of documents used to train the Metadata Extraction Trained Model. While you can use any process that results in a list of Document IDs, the following steps create a report to get a list of Document IDs:

Create a new report. Add filters to find the documents you wish to use.
Add the Document ID field as a column.
Run the report and export the results to CSV.
Open the exported file. Change the name of the Document ID column to id.
Save the file as documentset.csv.

Your CSV file cannot contain more than 40,000 Document IDs.

Using the Document ID method allows admins to select any documents to train the model. We strongly recommend that the IDs provided include all document types that users may send to the Document Inbox. The TMF Bot will try to classify every document that comes into the Inbox, and if it didn’t learn certain document types, the document is likely to be misclassified.

Once you have created the documentset.csv, upload it as a Trained Model Artifact to your Trained Model record.

Testing the Metadata Extraction Model

Once you have determined the appropriate Document Set Method, perform the Test Model action. Choose the appropriate Document Set method when prompted, then click Start. The Trained Model record will move to the In Training State.

An asynchronous job tracks two activities as part of testing:

Document Extraction: During this process, the system collects the data from the specified document set. The output is a CSV file (document_extract_results.csv) in which an Admin can see which documents were able to be used as input and which were not attached under Trained Model Artifacts. Vault sends a notification to the Admin who started the action when the extraction is complete.
Model Testing: During this process, the system will test metadata extraction using the specified document set (the documents in document_extract_results.csv where extraction was successful). The output is a number of performance metrics in the Trained Model Performance Metrics object. Vault sends a notification to the Admin who started the action when testing is complete.

The time required to complete these jobs varies depending on the number of documents used as input: About 1 hour for Vaults testing on 3,000 documents, to about 12 hours for Vaults testing on 40,000 documents.

Once Model Testing is complete, the Trained Model record will move to the Trained state.

Testing a Metadata Extraction Model in Pre-Release or Sandbox Environments with Production Data

You can test a Metadata Extraction model in your Pre-Release or Sandbox Vault with production documents for evaluation purposes. You cannot move the resulting Metadata Extraction model to your production environment.

Both methods for document selection are available. If you’re using the Attached CSV of Document IDs method, be sure to use Document IDs from your production Vault.

To test using production data, run the Test Model From Production Data action. This action is only visible in Pre-Release and Sandbox Vaults.

After evaluating your Metadata Extraction model, you’ll need to perform testing again in your production Vault to begin using TMF Bot features there. If you use the Attached CSV of Document IDs method of document selection, you can use this same list of documents to create a similar Metadata Extraction model in your production environment.

Evaluating the Trained Model

There are three Key Metrics to ensure your Metadata Extraction model is viable: Extraction Coverage, Study Metadata Extraction Coverage, and Study Metadata Extraction Error Rate. These can be found in the Training Summary Results field of your Trained Model record.

Deploying the Trained Model

Once you have evaluated your Metadata Extraction model, select the Deploy Model action from the Trained Model record, review the prompt to ensure you agree with the outcome and click Start. The Trained Model record will move to the In Deployment state.

An asynchronous job tracks the deployment of this Metadata Extraction model in your Vault. Vault sends a notification to the Admin who performed the action when deployment is complete.

Once the deployment job finishes, the Trained Model record will move to the Deployed state and Vault will immediately begin extracting metadata from documents added to the Document Inbox.

Only one (1) Trained Model per Trained Model Type can be deployed at a time.

Replacing a Deployed Trained Model

To replace a deployed model with a new Trained Model, simply deploy the new model. It will replace the currently active model, and metadata extraction will not be interrupted. This is the recommended method for replacing models.

Additional Trained Model Actions & Details

You can only have five (5) Trained Models per Trained Model Type. If you attempt to train a sixth, Vault will advise you to archive a model before training another. To do so, select the Archive Model action on a Trained Model record. The Trained Model record will move to the Archived state. Archived models are not recoverable.

You can also remove deployed models and disable metadata extraction by using the Withdraw Model action on a Trained Model in the Deployed state. Doing so will move the Trained Model record back to the Trained state.