To use certain TMF Bot features like document auto-classification, a Trained Model must be trained and deployed. This training allows the machine learning model to learn from your inputs, preparing it to intelligently process data.

Vault automatically creates a Trained Model, of type Document Classification, in all eTMF Vaults with at least 1,500 Steady state documents. As long as a Trained Model is not already deployed, Vault also deploys the model for you. This process occurs once per release, so if you wish to update your Trained Model at any time (to reflect new document types, or to attempt to improve your results, for example), you must follow the process described here to train, evaluate, and deploy it.

Trained Models with a Trained Model type of Document Classification allow auto-classification of documents in the Document Inbox, and quality control of manual document classification.

How Auto-trained Models Work

The TMF Bot is auto-on for all eTMF customers with more than 1,500 Steady state documents. The process the system uses to create, train and deploy a Document Classification Trained Model is as follows:

  1. The Auto-Train Models job runs each night at 1:00am EST on all production and pre-release1 Vaults. This job will check that:
    1. No system-created model has been created since the last major release
    2. There are at least 1,500 Steady state documents in this Vault
  2. The job creates a Trained Model of Trained Model Type Document Classification with .95 as the Prediction Confidence Threshold and an appropriate value for Minimum Documents per Document Type based on the number of documents that will be used for training
  3. The latest document versions (based on the Version Created Date) that fall into the following categories are used to train this model:
    1. In a Steady state (Approved/Final)
    2. Not a Binder
    3. Not in the TMF Document or Final CRF document type
    4. Not in a document type mapped to “Sites Evaluated but not Selected”
    5. The document has pages
  4. If there is already a deployed model of the same Trained Model Type in this Vault, the auto-trained model stays in the training state, otherwise it will be automatically deployed after it finishes training.
  5. Once the auto-trained model is deployed, any documents uploaded to the Inbox may be auto-classified by the TMF Bot.

With each major release a new auto-trained model will be created and trained. If you are currently using the previous release’s auto-trained model, Vault automatically deploys this new model to replace the old Trained Model. This ensures the system is training on the latest documents that represent your document hierarchy.

Given the number of eTMF customers, auto-training and deploying models in your Vault may take one to three days after each major release.

How To Train a Model

Like all machine learning tools, the TMF Bot requires input to learn before performing tasks on its own. Generally, the larger and more accurate the inputs, the better the resulting model will be. Vault stores accumulated input in Trained Model object records.

Prediction Confidence

Vault uses a Prediction Confidence score to indicate how certain TMF Bot is that its prediction is correct. This value is between 0 (likely wrong) and 1 (likely correct). The better your inputs, the higher the Prediction Confidence will be. Vault stores Prediction Confidence scores in Prediction object records.

Prediction Confidence Threshold

Vault uses the Prediction Confidence Threshold field value on a Trained Model record to determine what score is required before the system will use that Prediction. For example, in the case of auto-classification, if your Prediction Confidence Threshold value is .95 and the Prediction Confidence for the document uploaded to the Document Inbox is .9728, Vault auto-classifies that document.

Creating a Document Classification Trained Model

Before creating a Trained Model, carefully consider the following limitations:

  • Vault allows Admins to train models in Pre-release or Sandbox environments using their production environment documents, verifying the training process. These models, however, cannot be moved to your production Vault, so Trained Models must be created and trained in the production environment as well.
  • Certain categories of documents cannot be auto-classified or used in model training. These include:
    • Video and audio files
    • Non-text files, such as ZIP files, statistical files, or database files
    • Non-English documents. You may use documents that are only partially in English for model training.
    • Documents where Vault cannot extract text, for example, if the text is too blurry.
  • We recommend using at least 3,000 documents in steady states, such as Approved or Final, to train the machine learning model. You may use TMF Bot on Vaults with 1,000-3,000 documents, but note that it may limit the quality of your predictions.
  • If any of your inputs are misclassified documents, your predictions may be negatively impacted. For example, if several documents that should have been classified as Legal > Contracts > Vendor were classified as Legal > Agreements > External, TMF Bot will be less confident about predictions for those document types.

Creating the Trained Model Object Record

  1. Navigate to Admin > Business Admin and click into the Trained Model object.
  2. Click Create.
  3. For the Trained Model Type, select Document Classification.
  4. Enter a Prediction Confidence Threshold.
    • TMF Bot will not use any predictions below this threshold for auto-classification. While Vault will accept any value between zero (0) and one (1), we recommend using a value of .9 or above.
    • Once you have sent a Trained Model for training, you cannot change this value.
    • Generally, the higher the number, the more accurate the classifications; however, you may auto-classify fewer documents. See more details on evaluation.
  5. If you intend to use the Training Window training method, set the Training Window Start Date accordingly.
  6. Under Model Parameters, set the Minimum Documents per Document Type.
    • Any document types with less than this minimum number of documents will not be able to be auto-classified. Higher minimums may yield better Prediction Confidence but will exclude more document types from auto-classification.
      • 1,000 to 10,000 documents = 10
      • 10,000 to 25,000 documents = 15
      • 25,000 to 50,000 documents = 25
      • 50,000 to 100,000 documents = 50
      • 100,000 to 150,000 documents = 75
      • 150,000 to 200,000 documents = 100
    • The Advanced Model Parameters field is system-managed; you do not need to set anything here.
  7. Click Save.

After creating the Trained Model object record, choose a training method.

Training Model Filters

If desired, you can choose to customize Trained Models by compiling custom lists of documents to use for training. There are 2 possible methods.

Method 1 : Attach CSV of Documents IDs

You can attach a CSV with Document IDs to your Trained Model. Vault evaluates this list of documents when you train the model so that it knows which set of documents to use to train the model.

To add a CSV to a Trained Model:

  1. Under Document Trained Model Artifacts, click on Upload.
  2. Browse your computer and select your CSV file containing the desired document IDs.
  3. Train your model and select Attached CSV of Documents IDs as Document set Source.

Once you train the model, Vault automatically sets the Training Set Type to List of Document IDs.

Method 2 : VQL Query

You can add a custom VQL Query to your Trained Model. Vault evaluates this query when you train the model so that it knows which set of documents to use to train the model.

To add a custom VQL Query to a custom Trained Model:

  1. Under Document Criteria, enter your Document Criteria - VQL.
  2. Click Validate. Vault evaluates the VQL Query.
  3. When your VQL query is valid, train your model and select Training Window Start Date as Document Set Source.

When you Validate the Document Criteria - VQL field, Vault displays a green banner at the top of the screen if the query is valid.

Valid VQL Query

If the query is invalid, Vault displays an error message below the Document Criteria - VQL field.

Invalid VQL Query

Once you train the model, Vault automatically sets the Training Set Type to Document Criteria.

Choosing a Document Set Method

To train your model, you’ll need to choose a method to pull documents to use as input in this Trained Model. There are two options: Training Window Start Date and Attached CSV of Document IDs.

Training Window Start Date

The Training Window Start Date method ignores Archived documents; this is a known issue. If you want to train on Archived documents, you must use the Attached CSV of Document IDs method.

This method pulls all documents in a Steady State, such as Approved or Final, with a Version Created Date value between the Training Window Start Date and the current date. If there are more than 200,000 documents that fit this criteria, Vault uses the 200,000 most recent documents. If you choose this method, ensure you have filled in a Training Window Start Date value on your Trained Model record.

Attached CSV of Document IDs

This method takes as input a list of Document IDs. A Document ID is Vault’s unique identifier for a document allowing admins to tailor the list of documents used to train the Trained Model. While you can use any process that results in a list of Document IDs, the following steps create a report to get a list of Document IDs:

  1. Create a new report. Add filters to find the documents you wish to use.
  2. Add the Document ID field as a column.
  3. Run the report and export the results to CSV.
  4. Open the exported file. Change the name of the Document ID column to id.
  5. Save the file as documentset.csv.

Your CSV file cannot contain more than 200,000 Document IDs.

Using the Document ID method allows admins to select any documents to train the model. We strongly recommend that the IDs provided include all document types that users may send to the Document Inbox. The TMF Bot will try to classify every document that comes into the Inbox, and if it didn’t learn certain document types, the document is likely to be misclassified.

Once you have created the documentset.csv, upload it as a Trained Model Artifact to your Trained Model record.

Creating Excluded Classifications

You can define classifications that will be excluded from your Trained Model. The TMF Bot excludes the specified classification(s) from all extraction, training, and testing during model deployment. Additionally, later predictions from the TMF Bot, made as it processes documents, will not be actioned if a document is in (or predicted to be in) an excluded classification.

You can specify excluded classifications before or after a model is trained. If you add an excluded classification after the model’s training, the model is not automatically retrained; however, the TMF Bot will not take any action against documents of the excluded classification.

This exclusion applies only to the Trained Model to which the Excluded Classification belongs. For example, let’s say you have Trained Models deployed for Auto-Classification and Metadata Extraction, and your Metadata Extraction model has an Excluded Classification for “Protocol”. You then create a document that the TMF Bot thinks is a Protocol belonging to study AVEG 027. The TMF Bot will classify the document as a Protocol, but it will not set the Study field because the Protocol is an Excluded Classification for Metadata Extraction.

To create an excluded classification:

  1. Under Excluded Classifications, click Create.
  2. Select the Status of the Excluded Classification.
  3. Select the Classification you wish to exclude.
  4. Enter any relevant Comments.
  5. Click Save.

Training the Trained Model

Once you have determined the appropriate Document Set Method, perform the Train Model action. Choose the appropriate Document Set method when prompted, then click Start. The Trained Model record will move to the In Training state.

An asynchronous job tracks two activities as part of training:

  1. Document Extraction: During this process, the system collects the data from the specified document set. The output is a CSV file (document_extract_results.csv) in which an Admin can see which documents were able to be used as input and which were not attached under Trained Model Artifacts. Vault sends a notification to the Admin who started the action when the extraction is complete.
  2. Model Training: During this process, the system will use 80% of the extracted data to build a machine learning neural network model, then test that model using the remaining 20%. The output is a number of performance metrics in both the Trained Model Performance Metrics object and attached CSVs under Trained Model Artifacts. Vault sends a notification to the Admin who started the action when training is complete.

The time required to complete these jobs varies depending on the number of documents used as input: About 1 hour for Vaults training on 3,000 documents, to about 24 hours for Vaults training on 200,000 documents.

Once Model Training is complete, the Trained Model record will move to the Trained state.

Training a Trained Model in Pre-Release or Sandbox Environments with Production Data

You can train a Trained Model in your Pre-Release or Sandbox Vault with production documents for evaluation purposes. You cannot move the resulting Trained Model to your production environment.

Both methods for document selection are available. If you’re using the Attached CSV of Document IDs method, be sure to use Document IDs from your production Vault.

To train using production data, run the Train Model From Production Data action. This action is only visible in Pre-Release and Sandbox Vaults.

After evaluating your Trained Model, you’ll need to perform training again in your production Vault to begin using TMF Bot features there. If you use the Attached CSV of Document IDs method of document selection, you can use this same list of documents to create a similar Trained Model in your production environment.

Evaluating the Trained Model

There are three Key Metrics to ensure your Trained Model is viable: Extraction Coverage, Auto-classification Coverage, and Auto-classification Error Rate. These can be found in the Training Summary Results field of your Trained Model record. See the definition of these key metrics and how to improve them.

Deploying the Trained Model

Once you have evaluated your Trained Model, select the Deploy Model action from the Trained Model record, review the prompt to ensure you agree with the outcome and click Start. The Trained Model record will move to the In Deployment state.

An asynchronous job tracks the deployment of this Trained Model in your Vault. The time required to complete these jobs varies, and it can take anywhere from 30 minutes to 2 hours. Vault sends a notification to the Admin who performed the action when deployment is complete.

Once the deployment job finishes, the Trained Model record will move to the Deployed state and documents added to the Document Inbox will immediately begin being auto-classified.

Only one (1) Trained Model per Trained Model Type can be deployed at a time.

Replacing a Deployed Trained Model

To replace a deployed model with a new Trained Model, simply deploy the new model. It will replace the currently active model, and auto-classification will not be interrupted. This is the recommended method for replacing models.

Additional Trained Model Actions & Details

You can only have five (5) Trained Models per Trained Model Type. If you attempt to train a sixth, Vault will advise you to archive a model before training another. To do so, select the Archive Model action on a Trained Model record. The Trained Model record will move to the Archived state. Archived models are not recoverable.

You can also remove deployed models and disable auto-classification by using the Withdraw Model action on a Trained Model in the Deployed state. Doing so will move the Trained Model record back to the Trained state.

Document Quality Control with TMF Bot

Document Classification Trained Models can also be used for quality control purposes. TMF Bot can make predictions about a document as part of a workflow. If the existing classification does not match TMF Bot’s suggestion, the Document Info panel lets a user see the suggestion. The user can then reclassify the document if desired.

Document Quality Control can be used by itself or alongside auto-classification. This allows Vaults not using the Document Inbox to benefit from TMF Bot.

To enable this, first deploy a Trained Model using any automatic or manual training method. Then, include an AI Document QC system action step in any document workflow. You can add this to existing workflows, or design a new workflow that only performs this action. When the action is performed, predictions are generated and suggestions become available to the user.

About the Prediction Object

When a Trained Model is deployed and used to predict data for a document, the Prediction object keeps track of each individual prediction attempt. It’s unlikely that Admins will need to work with this object directly, but it may be useful to understand the object fields.

  • Prediction ID: Unique identifier for that prediction, automatically assigned by Vault
  • Related Record Unique ID: Identifier for the file being evaluated, automatically assigned by Vault
  • Related Record: Metadata for the document being evaluated, formatted as JSON. You can locate the Vault Document ID, Major version, and Minor version here if needed.
  • Predictions: The prediction data for this attempt from TMF Bot, formatted as JSON. You can use this field to understand if a prediction failed and why; which Trained Model was used to make the prediction; and, in the case of Document Classification, the first, second, and third top predictions from the model along with their Prediction Confidence scores. If the firstPrediction score is above the deployed Trained Model Prediction Confidence Threshold, the document will have been auto-populated with that prediction. This can also be seen with the auto-populated JSON parameter.
  • Feedback: Post-prediction activity. This field shows the current value for the data being predicted in the trueValue JSON parameter and if that value matches the corresponding first Prediction in the Predictions field in the trueValueMatch JSON parameter.
  • Additional Details: Lists from where Vault generates the prediction. This can include multiple sources. For example, if a bulk auto-classification generates a prediction and the document is sent for a QC check, the Additional Details field contains the values BULK and QC_CHECK.

About the Prediction Metrics Object

When a Trained Model is deployed and used to predict data for a document, the Prediction Metrics object keeps track of the model’s performance over time. The Prediction Metrics job generates records that track the overall numbers, as well as document Classification-specific performance. You can view this object from the Trained Model page layout and it contains the following fields:

  • Model Performance ID: Unique ID, assigned by Vault
  • Created Date: Date the prediction metric was calculated
  • Trained Model Type: the Trained Model Type (Auto-Classification or Metadata Extraction) being evaluated
  • Metric Type: Metric type presented
  • Metric Subtype: Subtype of the metric presented
  • Number of Documents: The number of documents used to test this model
  • Success Rate: The rate at which predictions on which the system acted were confirmed as true predictions (Correct Predictions divided by Number of Documents)
  • Correct Predictions: The number of times the prediction made was accurate (only applies to Auto-Classification models)
  • Correct Predictions Above Threshold: The number of times the prediction made was accurate and the Bot acted on it. With Auto-Classification models, this means it was above the selected Trained Model Confidence Threshold; with Metadata Extraction models, this means that the Bot attempted to set the corresponding field.
  • Predictions Above Threshold: The number of times the Bot acted upon the prediction made. With Auto-Classification models, this means it was above the selected Trained Model Confidence Threshold; with Metadata Extraction models, this means that the Bot attempted to set the corresponding field.
  • Partial Predictions Above Threshold: The number of times that some, but not all, of multiple values predicted were accurate (note: only applies to Metadata Extraction models.)
  • Field Extracted: The field extracted by the Metadata Extraction model (does not apply to Auto-classification models)
  1. Pre-release Vaults will use the production Vault’s documents to auto-train the model.