# Creating & Testing Metadata Extraction Models

To use TMF Bot features like study metadata extraction, a Metadata Extraction model must be tested and deployed. This model does not create a custom machine learning model specific to your document types, but instead relies on a standard approach to search documents' content and filenames for matches to records in your Vault. Testing a Metadata Extraction model allows you to assess potential metadata extraction performance prior to enabling it in your Vault.

## How to Test a Model

The TMF Bot requires input to learn before performing tasks on its own. Generally, the larger and more accurate the inputs, the better the resulting model will be. Vault stores accumulated input in _Trained Model_ object records.

## Prediction Confidence Threshold

In machine learning models such Document Classification, Vault uses the Prediction Confidence Threshold field value on a _Trained Model_ record to determine what score is required before the system will use that Prediction. Because metadata extraction follows a stricter logic searching for exact matches, this value will be set to .9999 for Metadata Extraction models.

## Creating a Study Metadata Extraction Model

Before creating a Study Metadata Extraction model, carefully consider the following limitations:

* Vault cannot extract _Study_ data from certain categories of documents, nor can they be used for testing. These include:
    * Video and audio files
    * Non-text files, such as ZIP files, statistical files, or database files
    * Non-English documents, unless your Vault has the Multilingual Model feature enabled.
    * Documents where Vault cannot extract text, for example, if the text is too blurry or if the file is password-protected or encrypted.
* We recommend using at least 3,000 documents in steady states, such as _Approved_ or _Final_, to test the Study Metadata Extraction model. You may use TMF Bot on Vaults with fewer documents, but note that metrics measuring its potential performance will not be as robust.
* If any of your inputs are documents that have an incorrect _Study_ value, your performance metrics will be negatively impacted. For example, if several documents that should have a _Study_ ID of NCT12345678, and include that _Study_ ID in their content, but erroneously have an incorrect _Study_ ID of NCT12345679, they will be counted as incorrect predictions in your _Trained Model Performance Metrics_, and may underestimate the effectiveness of metadata extraction once deployed in your Vault.

<div class="note-border alert-info">
  <div class="alert alert-info" role="alert">
    <div><i class="far fa-info-circle"></i></div>
    <div class="alert-text">
      <p><strong>Note</strong>: To enable auto-classification and metadata extraction functionality for documents in languages other than English, contact Veeva Services to enable the Multilingual Model feature.</p>
    </div>
  </div>
</div>



## Creating the Study Metadata Extraction Object Record

1. Navigate to **Admin > Business Admin** and click into the _Trained Model_ object.
2. Click **Create**.
3. For the **Trained Model Type**, select **Metadata Extraction**.
4. Enter a **Prediction Confidence Threshold**.
    * TMF Bot will not use any predictions below this threshold for auto-classification. While Vault will accept any value between zero (0) and one (1), we recommend using a value of .9 or above.
    * Once you have tested a _Trained Model_, you cannot change this value.
    * Generally, the higher the number, the more accurate the classifications; however, you may extract metadata from fewer documents.
5. If you intend to use the Training Window training method, set the **Training Window Start Date** accordingly.
6. Under **Model Parameters**, the **Minimum Documents per Document Type** will default to 100. If you wish to adjust this, you can do so from a list of the _Trained Models_.
    * All documents in your document set will be evaluated for metadata extraction. To have _Performance Metric_ records for an individual classification, the document set must include, for that classification, at least the _Minimum Documents per Document Type_.
    * Classifications with less than the _Minimum Documents per Document Type_ will still be evaluated, but will be combined under a record with a Metric Subtype of "OTHERS UNKNOWN (Remaining Document Types that did not meet the Minimum Documents per Document Type)"
    * If you find that you have a large number of classifications being combined under "OTHERS UNKNOWN," you can lower the _Minimum Documents per Document Type_. Below are suggested values depending on the number of documents in your Vault at the time of model testing.
        * 1,000 to 10,000 documents = 10
        * 10,000 to 25,000 documents = 15
        * 25,000 to 50,000 documents = 25
        * 50,000 to 100,000 documents = 50
        * 100,000 to 150,000 documents = 75
        * 150,000 to 200,000 documents = 100
    * The _Advanced Model Parameters_ field is system-managed; you do not need to set anything here.
7. Click **Save**.

### Choosing a Document Set Method

To test your model, you'll need to choose a method to pull documents to use as input in this Metadata Extraction _Trained Model_. There are two options: **Training Window Start Date** and **Attached CSV of Document IDs**.

#### Training Window Start Date

The **Training Window Start Date** method ignores Archived documents; this is a known issue. If you want to train on Archived documents, you must use the **Attached CSV of Document IDs** method.

This method pulls all documents in a steady state, such as _Approved_ or _Final_, with a Version Created Date value between the Training Window Start Date and the current date. If there are more than 40,000 documents that fit this criteria, Vault uses the 40,000 most recent documents. If you choose this method, ensure you have filled in a _Training Window Start Date_ value on your _Trained Model_ record.

#### Attached CSV of Document IDs

This method takes as input a list of _Document IDs_. A _Document ID_ is Vault's unique identifier for a document allowing admins to tailor the list of documents used to train the Metadata Extraction _Trained Model_. While you can use any process that results in a list of _Document IDs_, the following steps create a report to get a list of _Document IDs_:

1. Create a new [report](/en/lr/3606/). Add filters to find the documents you wish to use.
2. Add the _Document ID_ field as a column.
3. Run the report and export the results to CSV.
4. Open the exported file. Change the name of the _Document ID_ column to id.
5. Save the file as documentset.csv.

Your CSV file cannot contain more than 40,000 Document IDs.

Using the Document ID method allows admins to select any documents to train the model. We strongly recommend that the IDs provided include all document types that users may send to the Document Inbox. The TMF Bot will try to classify every document that comes into the Inbox, and if it didn't learn certain document types, the document is likely to be misclassified.

Once you have created the documentset.csv, upload it as a _Trained Model Artifact_ to your _Trained Model_ record.

### Testing the Metadata Extraction Model

Once you have determined the appropriate Document Set Method, perform the **Test Model** action. Choose the appropriate Document Set method when prompted, then click **Start**. The _Trained Model_ record will move to the In Training State.

An asynchronous job tracks two activities as part of testing:
1. **Document Extraction**: During this process, the system collects the data from the specified document set. The output is a CSV file (`document_extract_results.csv`) in which an Admin can see which documents were able to be used as input and which were not attached under _Trained Model Artifacts_. Vault sends a notification to the Admin who started the action when the extraction is complete.
2. **Model Testing**: During this process, the system will test metadata extraction using the specified document set (the documents in `document_extract_results.csv` where extraction was successful). The output is a number of performance metrics in the _Trained Model Performance Metrics_ object. Vault sends a notification to the Admin who started the action when testing is complete.

The time required to complete these jobs varies depending on the number of documents used as input: About 1 hour for Vaults testing on 3,000 documents, to about 12 hours for Vaults testing on 40,000 documents.

Once Model Testing is complete, the _Trained Model_ record will move to the Trained state.

### Testing a Metadata Extraction Model in Pre-Release or Sandbox Environments with Production Data

You can test a Metadata Extraction model in your Pre-Release or Sandbox Vault with production documents for evaluation purposes. You cannot move the resulting Metadata Extraction model to your production environment.

Both methods for document selection are available. If you're using the _Attached CSV of Document IDs_ method, be sure to use Document IDs from your production Vault.

To test using production data, run the **Test Model From Production Data** action. This action is only visible in Pre-Release and Sandbox Vaults.

After evaluating your Metadata Extraction model, you'll need to perform testing again in your production Vault to begin using TMF Bot features there. If you use the _Attached CSV of Document IDs_ method of document selection, you can use this same list of documents to create a similar Metadata Extraction model in your production environment.

### Evaluating the Trained Model

There are three Key Metrics to ensure your Metadata Extraction model is viable: Extraction Coverage, Study Metadata Extraction Coverage, and Study Metadata Extraction Error Rate. These can be found in the Training Summary Results field of your _Trained Model_ record.

### Deploying the _Trained Model_

Once you have evaluated your Metadata Extraction model, select the **Deploy Model** action from the _Trained Model_ record, review the prompt to ensure you agree with the outcome and click **Start**. The _Trained Model_ record will move to the In Deployment state.

An asynchronous job tracks the deployment of this Metadata Extraction model in your Vault. Vault sends a notification to the Admin who performed the action when deployment is complete.

Once the deployment job finishes, the _Trained Model_ record will move to the Deployed state and Vault will immediately begin extracting metadata from documents added to the Document Inbox.

Only one (1) _Trained Model_ per _Trained Model Type_ can be deployed at a time.

### Replacing a Deployed Trained Model

To replace a deployed model with a new _Trained Model_, simply deploy the new model. It will replace the currently active model, and metadata extraction will not be interrupted. This is the recommended method for replacing models.

### Additional _Trained Model_ Actions & Details

You can only have five (5) _Trained Models_ per _Trained Model Type_. If you attempt to train a sixth, Vault will advise you to archive a model before training another. To do so, select the **Archive Model** action on a _Trained Model_ record. The _Trained Model_ record will move to the Archived state. Archived models are not recoverable.

You can also remove deployed models and disable metadata extraction by using the **Withdraw Model** action on a _Trained Model_ in the Deployed state. Doing so will move the _Trained Model_ record back to the Trained state.
