Training machine learning models

Can multiple Trained Models be deployed at the same time?

You can deploy multiple Trained Models if they have a different Trained Model Type. Currently there are two model types: Auto-classification models and Metadata Extraction models. You can deploy one of each of these model types, but you cannot have two Auto-classification models or two Metadata Extraction models deployed at the same time. Deploying a new Auto-classification model or a new Metadata Extraction model replaces the existing one.

Can we retrain the TMF Bot?

You can create as many Trained Models as you like, using different sets of documents, different Prediction Confidence Thresholds, or different Minimum Documents per Document Type values. These can all be reviewed and evaluated before deploying within the system, allowing you to select the best one to deploy.

If we train and deploy a model but then withdraw it, will the nightly job detect that a model is not deployed and attempt to train and deploy a model automatically?

The Auto-on TMF Bot feature will only create, train, and deploy one Auto-Classification Trained Model per major release. If you withdraw that model, it will not be re-deployed. In the next major release, a new model will be created, trained, and deployed if no other model is deployed, or if the previous release’s Auto-Trained Model is deployed.

Can the TMF Bot be trained on similar documents that file to different document types in different scenarios?

We have seen some success with filing similar documents into different classifications, though these documents often do not have a high enough confidence for TMF Bot to auto-classify them. Setting different classifications for similar documents with no data-driven reason (information in the file content or file name that specifies where it should go) does not perform well in TMF Bot.

Can we modify or add to the parameters used to train the machine learning models?

Currently, you can control two parameters: the Prediction Confidence Threshold (what confidence does the model need to have in its prediction before it will auto-classify the document) and the Minimum Documents per Document Type (how many documents need to be in a document type before the model predicts that document type). However, you cannot add or modify any other parameters.

Note that the Prediction Confidence Threshold (PCT) is not the same as accuracy. In our analysis of TMF documents, we have found that a PCT of 0.80 can result in predictions that are 96-99% accurate, depending on the customer. At this time, then, we recommend setting the PCT to 0.85. 

It is also important to note that the Bot does not need many examples of a classification to train on it and predict with accuracy. Setting Minimum Documents per Document Type to a higher value will result in the Bot training on fewer classification. Because the Bot can only predict a classification on which it has trained, if you do not train on a classification and a document of that type is loaded to the Inbox, the Bot can either fail to classify it, or misclassify it. For that reason, we recommend you set the Minimum Documents per Document Type to 10 to ensure the Bot trains on more classifications.

In the future, we may remove the option to adjust these parameters, and instead set them to values that ensure optimal performance.

How many documents do I need to train the TMF Bot?

We recommend at least 1,000 steady state (Approved or Final) documents. In general, the more documents you have, the more accurate the model will be.

At minimum, to train or test a Trained Model, your document set must have at least 3 document types with more documents than the Minimum Documents per Document Type selected on the Trained Model.

Is there a limit or an optimal number of documents that I can use to train the Auto-classification Trained Model?

Trained Models have a limit of 200,000 documents that you can use for training. The optimal number of documents is 200,000. If you have fewer than 200,000 documents, use as many as you can. Optimally, all of the documents sent to the Trained Model would be 100% correct in terms of classification and metadata assignment.

Will the Trained Model be more accurate if I give it more documents?

In general, yes, the more documents you provide the Auto-classification Trained Model, the more accurate the model will be. However, this depends on the Prediction Confidence Threshold and the accuracy of the classifications in your system.

Testing more documents on a Metadata Extraction Trained Model will not improve its accuracy, but it will provide a better representation, via Trained Model performance metrics, of the potential performance of metadata extraction in your Vault.

How can I make a Metadata Extraction Trained Model more accurate?

Metadata Extraction Trained Models will search the document filename and content for strings that match the name of Study, Study Country and Study Site records in your Vault. It will not interpret other values such as Country Code or Principal Investigator name.

To optimize TMF Bot  Metadata extraction, we recommend using site names of at least six (6) characters long and assigning a unique site name across studies. Also, you can increase metadata extraction coverage by updating your business process to include these values in filename as a best practice (note that you need a separator between the values).

Will training a Trained Model cause slow performance in my Vault?

No, it will not. The Trained Model uses queues and threads that are separate from those Vault uses on a daily basis.

Does Vault automatically split the document set into a training set and testing set while training Trained Models?

With Auto-classification Trained Models, Vault will automatically split the information pulled for training a Trained Model into 80% for the training set and 20% for the testing set. We stratify on document type to ensure we have the 80/20 split for each document type.

With Metadata Extraction Trained Models, the entire document set will be used for testing, up to a maximum of 40,000 documents.

Does the TMF Bot populate Study Country and Site?

The TMF Bot can currently populate the Study, Study Country, and Site values on documents if you have a Study Metadata Extraction model deployed. The TMF Bot only populates Site names that are at least three characters long. The TMF Bot will only populate Site names with fewer than five characters if there is a matching Study.

Evaluation and User Acceptance Testing

How do I evaluate the TMF Bot?

There are two main approaches to evaluating the TMF Bot:

  1. Create and train Trained Models in your Production environment and evaluate the results provided after training completes. If the results meet your expectations, deploy the Trained Model.
  2. Create, train, and deploy an Auto-classification Trained Model in your QA or Sandbox environment using the Train Model From Production Data action. Verify that the auto-classification works properly. Then create and train Trained Models in your Production environment and evaluate the provided results after training completes. If the results meet your expectations, deploy the Trained Model.

Can I pilot the TMF Bot within a Sandbox environment?

Yes, by using the Train Model From Production Data action (for Auto-Classification models) or the Test Model from Production Data action (for Metadata Extraction models) to train or test a Trained Model. You’ll still need to train or test a Trained Model in your production Vault, however, as the Trained Model cannot be moved from Sandbox or Pre-Release to production.

How do I disable TMF Bot if I do not wish to use it?

The TMF Bot is auto-on for all eTMF customers with more than 1,500 Steady state documents. To disable an automatically deployed model, you will need to withdraw it via the Withdraw Model user action.

Uploaded Documents

Can I still choose to classify documents immediately?

Yes, you can. The TMF Bot will work on auto-classifying documents within the Document Inbox. However, if you choose to use the Classify Now option on the upload screen, you can still classify the document yourself.

Can we have the TMF Bot only auto-classify certain document types or documents for certain Studies?

The TMF Bot will only auto-classify documents to document types on which you’ve trained it. However, we DO NOT recommend only training on a small number of document types as this will result in many misclassified documents. All documents that enter the Document Inbox when you deploy a Trained Model will be auto-classified. There is no concept of pilot studies; however, you could work with a few studies at first to exclusively use the Document Inbox.

Do I need to name the file in a specific way for the TMF Bot to auto-classify the document correctly?

No, you do not. We’ve seen success with the TMF Bot where there are no defined naming conventions. However, having a good naming convention is one of the factors that can help the TMF Bot be more accurate, even though it’s not a requirement.

Does the TMF Bot include scanned documents or any documents that need Optical Character Recognition (OCR)?

Yes, it does. The TMF Bot has an express pipeline where it performs Optical Character Recognition (OCR) on the uploaded document if necessary.

Does the TMF Bot use the text within a document when it auto-classifies?

Yes, it does. The TMF Bot uses the following information to auto-classify a document:

  • Number of pages
  • Number of characters
  • File type, file size
  • File name,
  • Extracted text from the document

If the document has hand-written text, is that ok?

Yes, it is. Our OCR generally ignores hand-written text, and TMF Bot uses all other text for auto-classification.

Can the TMF Bot classify emails?

Yes, it can. We have seen some great success in the TMF Bot auto-classifying emails into the appropriate Relevant Communication document type. As with all auto-classification scenarios, there are cases where the confidence isn’t high enough for TMF Bot to auto-classify the emails.

What happens after auto-classification?

After the TMF Bot auto-classifies a document, who updates the Name?

If you are using document type auto-naming, the name will be automatically updated. If the document’s name is manually controlled, you need to modify the name while you Complete the document from the Document Inbox.

How does TMF Bot consider Security Profiles?

A user must have the Create Document permission on the document type the TMF Bot is trying to use for auto-classification. A user must also have general classify permissions within their permission set.

If the TMF Bot cannot auto-classify a document, how are the owners notified?

Because we have a goal to auto-classify documents within 5 seconds, we will not notify the Owner. Instead, you can use the TMF Bot field to understand if the document has finished processing with the TMF Bot or not. If it has finished and there is no auto-classification, then the user can Complete the document, selecting the appropriate document type.

Is auto-classification captured in the audit trail?

Yes, it is. The audit trail will show that the system has updated the Type, Subtype, and Classification of the document.

Who is the Owner of a document that TMF Bot has classified?

The user who uploaded the document is still the document Owner. The TMF Bot simply updates the document after uploading.

Will sharing settings be applied to documents in the Document Inbox picked up by the TMF Bot?

Yes, they will. Sharing Settings are not affected by the TMF Bot.

Will the TMF Bot also promote my document to the Final or Approved state?

No, it will not. The TMF Bot auto-classifies documents, but those documents remain in the Document Inbox until completed by a user. Their state does not change.

How do documents leave the Document Inbox?

You must still Complete the document to move it out of your Document Inbox.

Do I have a chance to check the document type the TMF Bot provided?

You can check the document’s classification within the Document Inbox before Completing it.

Misclassifications

Can I reclassify a document if the TMF Bot selected the wrong classification?

Yes, you can. The Reclassify option is available within the Document Inbox. When you manually reclassify a document, Vault tags the document as TMF Bot Misclassified.

Can I report on documents that TMF Bot misclassified (documents a user reclassified after the TMF Bot auto-classified them)?

The Prediction Metrics object tracks information about a model’s performance and is available as a section on the Trained Model page layout. You can use this object to create Prediction Metrics reports at the Classification-specific and Global Weighted Average levels.

We also capture this information in the Predictions object in Business Admin. This object is not reportable within the system as Vault stores this data in JSON strings. However, this data can be exported to and manipulated within Excel.

You can also use the TMF Bot Misclassified tag to filter on documents that users have manually reclassified so that you can identify and analyze misclassified documents. The TMF Bot Misclassified tag only applies to documents loaded into the Inbox.

When a user reclassifies a document auto-classified by the TMF Bot, does the machine learning model learn from this?

The machine learning model does not learn from reclassifications due to the large amount of time required to train it. However, we will track this feedback so that we can recommend retraining your machine learning models in a future release.

What happens after Study Metadata Extraction?

How does the TMF Bot consider Security Profiles?

A user must have View permission on the Study that the TMF Bot is trying to set on the document. If the user cannot View that Study, the TMF Bot will not set that Study on the document.

If the TMF Bot cannot set the Metadata values on a document, how are the owners notified?

Because we have a goal to set metadata quickly, we will not notify the Owner. Instead, you can use the TMF Bot field to understand if the document has finished processing with the TMF Bot or not. If it has finished and there is no Study, then the user can Complete the document, selecting the appropriate Study.

Is Metadata Extraction captured in the audit trail?

Yes, it is. The audit trail will show that the system has updated the Study field on the document.

Will sharing settings be applied to documents in the Document Inbox picked up by the TMF Bot?

Yes, they will. Sharing Settings are not affected by the TMF Bot unless the TMF Bot updates metadata that is used to determine Sharing Settings on a document while it is in the Unclassified lifecycle.

Will the TMF Bot also promote my document to the Final or Approved state?

No, it will not. The TMF Bot acts on documents, but those documents remain in the Document Inbox until completed by a user. Their state does not change.

How do documents leave the Document Inbox?

You must still Complete the document to move it out of your Document Inbox.

Do I have a chance to check the Study the TMF Bot provided?

You can check the document’s Study within the Document Inbox before Completing it.

Future functionality

Does Veeva plan to have the TMF Bot used as part of quality control?

Yes, we still plan to allow the TMF Bot to be used as part of the QC process where the TMF Bot can check a document before it goes to the QC Reviewer and, if it finds issues, send those issues back to the Owner to resolve before the QC Reviewer proceeds.

Can I use the TMF Bot to check the classifications of my older documents?

This enhancement is on the roadmap. We plan to allow the TMF Bot to be run against an entire Study and provide details about which documents it believes are misclassified.

Does the TMF Bot support multiple languages?

In the initial release, only documents with English text can be used for training and auto-classification. TMF Bot may support non-English documents in a future release.

When will the TMF Bot be able to auto-classify non-English documents?

There is no specific timeline, though this is on the roadmap. Note that there are cases where a multilingual document in both English and another language may still be auto-classified by the TMF Bot.

Will document auto-classification be available in the Vault Platform to be leveraged by other Vault Applications?

As of right now, the document auto-classification capabilities are specific to Clinical Operations. We may see more capabilities in other Vault Applications in the future.

Can the TMF Bot determine if a document is a new version of an existing document?

It cannot. We have considered this for the future, but it’s much further away for now.