# Evaluating TMF Bot Auto-classification Models



<div class="note-border alert-info">
  <div class="alert alert-info" role="alert">
    <div><i class="far fa-info-circle"></i></div>
    <div class="alert-text">
      <p><strong>Note</strong>: If you are unable to locate the answers you need, please see our <a href="/en/gr/73297/">TMF Bot FAQ</a>
.</p>
    </div>
  </div>
</div>





<div class="note-border alert-info">
  <div class="alert alert-info" role="alert">
    <div><i class="far fa-info-circle"></i></div>
    <div class="alert-text">
      <p><strong>Note</strong>: This article’s contents are specific to TMF Bot and do not apply to RIM Bot, a separate Vault AI feature. Users seeking information for the RIM Bot can reference <a class="external-link " href="https://regulatory.veevavault.help/en/gr/518092/" target="_blank" rel="noopener">Evaluating RIM Bot Auto-Classification Models<i class="fa fa-external-link" aria-hidden="true"></i></a>.</p>
    </div>
  </div>
</div>



After <a href="/en/gr/72747/">training an auto-classification model for TMF Bot</a>
, there is a great deal of data available to help you evaluate the effectiveness of your _Trained Models_. This article explains how to use our Key Metrics to evaluate your _Trained Models_, defines each _Trained Model Artifact_ they provide, and describes how to identify issues and improve your model training.

## Auto-classification Evaluation Key Metrics

### Key Metrics Definitions

There are three Key Metrics to ensure your _Trained Model_ is effective: Extraction Coverage, Auto-classification Coverage, and Auto-classification Error Rate. This section contains a basic description of each and some suggested targets and real-world example values based on an actual _Trained Model_.

<table>
  <tr>
   <td><strong>Metric</strong>
   </td>
   <td><strong>Definition</strong>
   </td>
   <td><strong>What drives this result?</strong>
   </td>
   <td><strong>Suggested Target</strong>
   </td>
   <td><strong>Example</strong>
   </td>
  </tr>
  <tr>
   <td>Extraction Coverage
   </td>
   <td>The number of documents that had appropriate information to train the model
   </td>
   <td>Documents that fit our extraction criteria (extractable text)
   </td>
   <td>50–90%, with the lower end being appropriate for non-English customers
   </td>
   <td>An enterprise customer trained a model on 200,000 of their latest documents; their Extraction coverage was 85.34%
   </td>
  </tr>
  <tr>
   <td>Auto-classification Coverage
   </td>
   <td>The number of documents that had a prediction above your <em>Prediction Confidence Threshold</em>
   </td>
   <td><em>Your Prediction Confidence Threshold</em>, as well as the number and accuracy of documents used for training
   </td>
   <td>45–95%, with the lower end being appropriate for customers who train with a small number of documents (<5,000)
   </td>
   <td>A customer with a .90 <em>Prediction Confidence Threshold</em> was able to achieve 94% Auto-classification Coverage, whereas the same customer with a .99 <em>Prediction Confidence Threshold</em> had 89.65% Auto-classification Coverage
   </td>
  </tr>
  <tr>
   <td>Auto-classification Error Rate
   </td>
   <td>Documents with predictions above your <em>Prediction Confidence Threshold</em> that were incorrectly classified
   </td>
   <td>Your <em>Prediction Confidence Threshold</em>, as well as the number and accuracy of documents used for training
   </td>
   <td>Your target for this will depend on how risk-averse your organization is. Typically, lower is better, but keep in mind:
    <ul>
    <li>Users can still reclassify documents that have been auto-classified</li>
    <li>TMF Bot is not meant to be perfect, but instead more accurate than manual classification</li>
    <li>TMF Bot does this work automatically, saving users time on both classification and surfacing classification issues
    </li>
    </ul>
  </td>
   <td>A customer with a .90 <em>Prediction Confidence Threshold</em> had a .58% Auto-classification Error Rate, whereas the same customer with a .99 <em>Prediction Confidence Threshold</em> was able to achieve a .28% Auto-classification Error Rate
   </td>
  </tr>
</table>

### Using Key Metrics for Evaluation

In the section above, we introduced three Key Metrics to evaluating the effectiveness of your _Trained Model_. You can locate them in the _Training Summary Results_ field.

### Extraction Coverage

Extraction coverage is the only Key Metric that you cannot improve. While this might be disconcerting, the purpose of this metric is to set the right expectations for documents added to the Document Inbox. If your company has many audio, video, or other non-text files; a significant number of non-English documents; or regular problems with blurry scans, this metric can help you understand why particular documents are not auto-classified in the Document Inbox.

### Improving Auto-classification Coverage

You can improve your _Auto-classification Coverage_ metric via the following methods:

* Lower your _Prediction Confidence Threshold_: A lower _Prediction Confidence Threshold_ may allow more documents to be covered by auto-classification, but be aware that it may raise your _Auto-classification Error Rate_.
* Evaluate outliers within the _Model Result Confusion Matrix_: Outliers are not on the matrix's diagonal. You may find that there is regularly occurring confusion for some document types that you can reduce by reclassifying documents within your Vault or by excluding a particular document type from training.  You will need to train a new _Trained Model_ record to capture these changes.



<div class="note-border alert-info">
  <div class="alert alert-info" role="alert">
    <div><i class="far fa-info-circle"></i></div>
    <div class="alert-text">
      <p><strong>Note</strong>: Users cannot edit the <em>Prediction Confidence Threshold</em> for system-trained models.</p>
    </div>
  </div>
</div>



### Improving Auto-classification Error Rate

You can improve your _Auto-classification Error Rate_ metric via the following methods:

* Raise your _Prediction Confidence Threshold_: A higher _Prediction Confidence Threshold_ may reduce your error rate as the model will be more confident in its auto-classification, but be aware that it may reduce your _Auto-classification Coverage_ as well.
* Evaluate documents within the **_Model Results Individual Predictions_** CSV that are above your _Prediction Confidence Threshold_ but are misclassified. Evaluating these documents in your Vault may reveal that the TMF Bot was correct, and the document was actually misclassified. Or, if the TMF Bot was still wrong, it may help you understand why that was the case.

### Finishing Evaluation

Once you have evaluated the Key Metrics for your _Trained Model_, you can compare them against our suggested targets and form your own company targets. If your _Trained Model_ meets or exceeds your set target, this would be good _Trained Model_ to deploy.

## _Trained Model Performance Metrics_

Every _Trained Model_ has a series of _Trained Model Performance Metrics_ records once training has been completed. You can find these in the _Model Performance Metrics_ section of your _Trained Model_ record. There are three _Metric Types_ for _Document Classification Trained Models_:

* Global Weighted Average: Contains average _Precision_, _Recall,_ and _F1-Score_ weighted by the number of documents in each classification.
* Global Non-Weighted Average: Contains average _Precision_, _Recall,_ and _F1-Score_ across all classifications regardless of the number of documents in each classification.
* Classification Performance: Contains the _Precision_, _Recall,_ and _F1-Score_ for the document type listed in the _Metric Subtype_.
    * A special OTHERS UNKNOWN _Classification Performance_ record catches the documents with classifications that didn't meet the _Minimum Documents per Document Type_ threshold. These documents are still used in training but are grouped together to better inform predictions on the valid classifications.

Each record shows the following metrics:

* **Precision**: How often the prediction made was accurate

<a href="https://platform.veevavault.help/assets/images/precision.png" data-lightbox="precision.png" data-title="" data-alt="Precision">
  <img class="docimage" src="https://platform.veevavault.help/assets/images/precision.png" alt="Precision" style=""  />
</a>

* **Recall**: Percentage of items within this _Metric Subtype_ that were correctly predicted

<a href="https://platform.veevavault.help/assets/images/recall.png" data-lightbox="recall.png" data-title="" data-alt="Image Name">
  <img class="docimage" src="https://platform.veevavault.help/assets/images/recall.png" alt="Image Name" style=""  />
</a>

* **F1-Score**: The balance between _Precision_ and _Recall_

<a href="https://platform.veevavault.help/assets/images/f1.png" data-lightbox="f1.png" data-title="" data-alt="Image Name">
  <img class="docimage" src="https://platform.veevavault.help/assets/images/f1.png" alt="Image Name" style=""  />
</a>

* **Training Documents**: The number of documents used for training the model. It will be 80% of the total documents used as input. This 80% is randomly selected within each classification.
* **Testing Documents**: The number of documents used for testing the model. This is the remaining 20% of the total documents used as input.
* **Correct Predictions**: The total number of times the model correctly predicted a classification regardless of the prediction confidence.
* **Predictions above Threshold**: The total number of predictions above the _Prediction Confidence Threshold_ selected on this _Trained Model_.
* **Correct Predictions above Threshold**: The total number of correct predictions above the _Prediction Confidence Threshold_ selected on this _Trained Model_.

It is important to note that all predictions marked as correct assume the inputs are classified correctly. Any misclassified documents used to train the model may lead to inaccurate auto-classification. The _Trained Model Artifacts_ listed below can help reveal potential issues.

## _Trained Model Artifacts_

_Trained Models_ have a series of _Trained Model Artifacts_ attached, each containing valuable data. You can find these on a _Trained Model_ object record in the _Trained Model Artifacts_ section. The artifacts include the following files for _Trained Models_ using the _Document Classification_ type:

* **Document Set Extract Results** (documentset_extract_results.csv): Extraction results for each document requested for training this model.
    * This file is most helpful in seeing why some documents were not used during the training process.
    * You can use the _Document ID_ & _Major/Minor Version_ within this file to view the appropriate document within Vault.
    * See Reasons for Extraction Failures below for a list of potential failure reasons.
* **Model Results Confusion Matrix** (model_results_confusion_matrix.csv): Compares the actual classification of documents (the X-axis) to TMF Bot's predicted classification (the Y-axis)
    * The diagonal should have the highest numbers, as this is where the true classification and predicted classification intersect.
    * Numbers above and below the diagonal indicate confusion. You should investigate classifications with larger numbers of incorrect predictions to understand the reason for the _Trained Model_ confusion.
* **Model Results Document Type Frequency** (model_results_doctype_frequency.csv): Lists all the document types used, the total documents used from each, and the numbers used for training (80%) and testing (20%), respectively. Classifications below the _Minimum Documents per Document Type_ are grouped into OTHERS UNKNOWN.
* **Model Results Individual Predictions** (model_results_individual_predictions.csv): Shows the true document type, encoded document type, top three predictions, and the top three prediction scores for each document, and if the document was misclassified.
    * The _True Document Type_ column lists the actual classification from Vault. The _Encoded Document Type_ column shows what you provided to the model: The actual classification or OTHERS UNKNOWN, the latter being for those document types with less than the _Minimum Documents per Document Type_ on the _Trained Model_.
    * The file shows three _Prediction Scores_, one for each _Document Type Prediction_. Auto-classification only uses the _First Prediction Score_. If this score is above the _Prediction Confidence Threshold_, that document would have been auto-classified. The second and third scores are for informational purposes only.
    * Lastly, _Misclassified_ shows if the _Encoded Document Type_ and the _First Document Type Prediction_ match or not. Filtering to see the _Misclassified_ items can quickly reveal potential issues with your existing documents. For example, if the _Trained Model_ has a _First Prediction Score_ of .9999887 and was misclassified, the document may actually be misclassified in your Vault.
* **Model Results Performance Metrics** (model_results_performance_metrics.csv): A CSV version of the _Model Performance Metrics_ data on this record.
* **Model Results Training Set** (model_results_training_set.csv): Lists the individual documents used for training (the 80% of the entire set of documents) and their classifications. This file can be helpful if you want to review the documents used to train a specific classification, especially if you notice that classification is often misclassified.

### Reasons for Extraction Failures

The **Document Set Extract Results** CSV file gives one of the following reasons for an extraction failure:

* **Language Not Detected**: The system could not detect a language
* **Language Not Supported**: The language detected was not English
* **Not Confident in Language Detection**:  The system was not confident in its language detection
* **No Text Available**: The document had no extractable text
* **OCR failed for PDF**: Optical character recognition (OCR) could not complete for a PDF file
* **OCR failed for Complex Image**: OCR could not complete for a complex image format, such a TIFF
* **OCR failed for Simple Image**: OCR could not complete for a simple image format, such as PNG or JPG
* **PDF Rendering failed**: TMF Bot could not render the document as a PDF
* **Current Document Type is Inactive**:  The document type for this document is no longer active
* **Steady State not Found**: The document does not have a steady state version
* **This Document Type is Intentionally Excluded**: Document was a binder or in the TMF Document or Final CRF Document type

<div class="note-border alert-info">
  <div class="alert alert-info" role="alert">
    <div><i class="far fa-info-circle"></i></div>
    <div class="alert-text">
      <p><strong>Note</strong>: To enable auto-classification and metadata extraction functionality for documents in languages other than English, contact Veeva Services to enable the Multilingual Model feature.</p>
    </div>
  </div>
</div>




[comment]: # Images in this file are also used in the RIM version of this article (518092)
