Training machine learning models

What data is used to train the TMF Bot?

The TMF Bot has two options for sourcing data to train its model. For Production Vaults, TMF Bot uses document content from the same Vault. In non-Production Vaults, Admins can choose to use content from the Production Vault or the same non-Production Vault. In all cases, the training data and AI model are fully encapsulated within your environment, just as documents and data are completely separated among customer environments.

Can multiple Trained Models be deployed at the same time?

You can deploy multiple Trained Models if they have a different Trained Model Type. Currently there are two model types: Auto-classification models and Metadata Extraction models. You can deploy one of each of these model types, but you cannot have two Auto-classification models or two Metadata Extraction models deployed at the same time. Deploying a new Auto-classification model or a new Metadata Extraction model replaces the existing one.

Can we retrain the TMF Bot?

You can create as many Trained Models as you like, using different sets of documents, different Prediction Confidence Thresholds, or different Minimum Documents per Document Type values. These can all be reviewed and evaluated before deploying within the system, allowing you to select the best one to deploy.

If we train and deploy a model but then withdraw it, will the nightly job detect that a model is not deployed and attempt to train and deploy a model automatically?

The Auto-on TMF Bot feature will only create, train, and deploy one Auto-Classification Trained Model per major release. If you withdraw that model, it will not be re-deployed. In the next major release, a new model will be created, trained, and deployed if no other model is deployed, or if the previous release’s Auto-Trained Model is deployed.

Can the TMF Bot be trained on similar documents that file to different document types in different scenarios?

We have seen some success with filing similar documents into different classifications, though these documents often do not have a high enough confidence for TMF Bot to auto-classify them. Setting different classifications for similar documents with no data-driven reason (information in the file content or file name that specifies where it should go) does not perform well in TMF Bot.

Can we modify or add to the parameters used to train the machine learning models?

Currently, you can control two parameters: the Prediction Confidence Threshold (what confidence does the model need to have in its prediction before it will auto-classify the document) and the Minimum Documents per Document Type (how many documents need to be in a document type before the model predicts that document type). However, you cannot add or modify any other parameters.

Note that the Prediction Confidence Threshold (PCT) is not the same as accuracy. In our analysis of TMF documents, we have found that a PCT of 0.80 can result in predictions that are 96-99% accurate, depending on the customer. At this time, then, we recommend setting the PCT to 0.85.

It is also important to note that the Bot does not need many examples of a classification to train on it and predict with accuracy. Setting Minimum Documents per Document Type to a higher value will result in the Bot training on fewer classification. Because the Bot can only predict a classification on which it has trained, if you do not train on a classification and a document of that type is loaded to the Inbox, the Bot can either fail to classify it, or misclassify it. For that reason, we recommend you set the Minimum Documents per Document Type to 10 to ensure the Bot trains on more classifications.

In the future, we may remove the option to adjust these parameters, and instead set them to values that ensure optimal performance.

How many documents do I need to train the TMF Bot?

We recommend at least 1,000 steady state (Approved or Final) documents. In general, the more documents you have, the more accurate the model will be.

At minimum, to train or test a Trained Model, your document set must have at least 3 document types with more documents than the Minimum Documents per Document Type selected on the Trained Model.

Is there a limit or an optimal number of documents that I can use to train the Auto-classification Trained Model?

Trained Models have a limit of 200,000 documents that you can use for training. The optimal number of documents is 200,000. If you have fewer than 200,000 documents, use as many as you can. Optimally, all of the documents sent to the Trained Model would be 100% correct in terms of classification and metadata assignment.

Will the Trained Model be more accurate if I give it more documents?

In general, yes, the more documents you provide the Auto-classification Trained Model, the more accurate the model will be. However, this depends on the Prediction Confidence Threshold and the accuracy of the classifications in your system.

Testing more documents on a Metadata Extraction Trained Model will not improve its accuracy, but it will provide a better representation, via Trained Model performance metrics, of the potential performance of metadata extraction in your Vault.

How can I make a Metadata Extraction Trained Model more accurate?

Metadata Extraction Trained Models will search the document filename and content for strings that match the name of Study, Study Country and Study Site records in your Vault. It will not interpret other values such as Country Code or Principal Investigator name.

To optimize TMF Bot Metadata extraction, we recommend using site names of at least six (6) characters long and assigning a unique site name across studies. Also, you can increase metadata extraction coverage by updating your business process to include these values in filename as a best practice (note that you need a separator between the values).

Will training a Trained Model cause slow performance in my Vault?

No, it will not. The Trained Model uses queues and threads that are separate from those Vault uses on a daily basis.

Does Vault automatically split the document set into a training set and testing set while training Trained Models?

With Auto-classification Trained Models, Vault will automatically split the information pulled for training a Trained Model into 80% for the training set and 20% for the testing set. We stratify on document type to ensure we have the 80/20 split for each document type.

With Metadata Extraction Trained Models, the entire document set will be used for testing, up to a maximum of 40,000 documents.

Does the TMF Bot populate Study Country and Site?

The TMF Bot can currently populate the Study, Study Country, and Site values on documents if you have a Study Metadata Extraction model deployed. The TMF Bot only populates Site names that are at least three characters long. The TMF Bot will only populate Site names with fewer than five characters if there is a matching Study.

For documents under Approved legal hold, the TMF Bot will not populate a Study value if the value to set matches the Study value on a legal hold record in an Approved state.

Evaluation and User Acceptance Testing

How do I evaluate the TMF Bot?

There are two main approaches to evaluating the TMF Bot:

Create and train Trained Models in your Production environment and evaluate the results provided after training completes. If the results meet your expectations, deploy the Trained Model.
Create, train, and deploy an Auto-classification Trained Model in your QA or Sandbox environment using the Train Model From Production Data action. Verify that the auto-classification works properly. Then create and train Trained Models in your Production environment and evaluate the provided results after training completes. If the results meet your expectations, deploy the Trained Model.

Can I pilot the TMF Bot within a Sandbox environment?

Yes, by using the Train Model From Production Data action (for Auto-Classification models) or the Test Model from Production Data action (for Metadata Extraction models) to train or test a Trained Model. You’ll still need to train or test a Trained Model in your production Vault, however, as the Trained Model cannot be moved from Sandbox or Pre-Release to production.

How do I disable TMF Bot if I do not wish to use it?

The TMF Bot is auto-on for all eTMF customers with more than 1,500 Steady state documents. To disable an automatically deployed model, you will need to withdraw it via the Withdraw Model user action.

Uploaded Documents

Can I still choose to classify documents immediately?

Yes, you can. The TMF Bot will work on auto-classifying documents within the Document Inbox. However, if you choose to use the Classify Now option on the upload screen, you can still classify the document yourself.

Can we have the TMF Bot only auto-classify certain document types or documents for certain Studies?

The TMF Bot will only auto-classify documents to document types on which you’ve trained it. However, we DO NOT recommend only training on a small number of document types as this will result in many misclassified documents. All documents that enter the Document Inbox when you deploy a Trained Model will be auto-classified. There is no concept of pilot studies; however, you could work with a few studies at first to exclusively use the Document Inbox.

Do I need to name the file in a specific way for the TMF Bot to auto-classify the document correctly?

No, you do not. We’ve seen success with the TMF Bot where there are no defined naming conventions. However, having a good naming convention is one of the factors that can help the TMF Bot be more accurate, even though it’s not a requirement.

Does the TMF Bot include scanned documents or any documents that need Optical Character Recognition (OCR)?

Yes, it does. The TMF Bot has an express pipeline where it performs Optical Character Recognition (OCR) on the uploaded document if necessary.

Does the TMF Bot use the text within a document when it auto-classifies?

Yes, it does. The TMF Bot uses the following information to auto-classify a document:

Number of pages
Number of characters
File type, file size
File name,
Extracted text from the document

If the document has hand-written text, is that ok?

Yes, it is. Our OCR generally ignores hand-written text, and TMF Bot uses all other text for auto-classification.

Can the TMF Bot classify emails?

Yes, it can. We have seen some great success in the TMF Bot auto-classifying emails into the appropriate Relevant Communication document type. As with all auto-classification scenarios, there are cases where the confidence isn’t high enough for TMF Bot to auto-classify the emails.

What happens after auto-classification?

After the TMF Bot auto-classifies a document, who updates the Name?

If you are using document type auto-naming, the name will be automatically updated. If the document’s name is manually controlled, you need to modify the name while you Complete the document from the Document Inbox.

How does TMF Bot consider Security Profiles?

A user must have the Create Document permission on the document type the TMF Bot is trying to use for auto-classification. A user must also have general classify permissions within their permission set.

If the TMF Bot cannot auto-classify a document, how are the owners notified?

Because we have a goal to auto-classify documents within 5 seconds, we will not notify the Owner. Instead, you can use the TMF Bot field to understand if the document has finished processing with the TMF Bot or not. If it has finished and there is no auto-classification, then the user can Complete the document, selecting the appropriate document type.

Is auto-classification captured in the audit trail?

Yes, it is. The audit trail will show that the system has updated the Type, Subtype, and Classification of the document.

Who is the Owner of a document that TMF Bot has classified?

The user who uploaded the document is still the document Owner. The TMF Bot simply updates the document after uploading.

Yes, they will. Sharing Settings are not affected by the TMF Bot.

Will the TMF Bot also promote my document to the Final or Approved state?

No, it will not. The TMF Bot auto-classifies documents, but those documents remain in the Document Inbox until completed by a user. Their state does not change.

How do documents leave the Document Inbox?

You must still Complete the document to move it out of your Document Inbox.

Do I have a chance to check the document type the TMF Bot provided?

You can check the document’s classification within the Document Inbox before Completing it.

Misclassifications

Can I reclassify a document if the TMF Bot selected the wrong classification?

Yes, you can. The Reclassify option is available within the Document Inbox. When you manually reclassify a document, Vault tags the document as TMF Bot Misclassified.

Can I report on documents that TMF Bot misclassified (documents a user reclassified after the TMF Bot auto-classified them)?

The Prediction Metrics object tracks information about a model’s performance and is available as a section on the Trained Model page layout. You can use this object to create Prediction Metrics reports at the Classification-specific and Global Weighted Average levels.

We also capture this information in the Predictions object in Business Admin. This object is not reportable within the system as Vault stores this data in JSON strings. However, this data can be exported to and manipulated within Excel.

You can also use the TMF Bot Misclassified tag to filter on documents that users have manually reclassified so that you can identify and analyze misclassified documents. The TMF Bot Misclassified tag only applies to documents loaded into the Inbox.

When a user reclassifies a document auto-classified by the TMF Bot, does the machine learning model learn from this?

The machine learning model does not learn from reclassifications due to the large amount of time required to train it. However, we will track this feedback so that we can recommend retraining your machine learning models in a future release.

What happens after Study Metadata Extraction?

How does the TMF Bot consider Security Profiles?

A user must have View permission on the Study that the TMF Bot is trying to set on the document. If the user cannot View that Study, the TMF Bot will not set that Study on the document.

If the TMF Bot cannot set the Metadata values on a document, how are the owners notified?

Because we have a goal to set metadata quickly, we will not notify the Owner. Instead, you can use the TMF Bot field to understand if the document has finished processing with the TMF Bot or not. If it has finished and there is no Study, then the user can Complete the document, selecting the appropriate Study.

Is Metadata Extraction captured in the audit trail?

Yes, it is. The audit trail will show that the system has updated the Study field on the document.

Yes, they will. Sharing Settings are not affected by the TMF Bot unless the TMF Bot updates metadata that is used to determine Sharing Settings on a document while it is in the Unclassified lifecycle.

Will the TMF Bot also promote my document to the Final or Approved state?

No, it will not. The TMF Bot acts on documents, but those documents remain in the Document Inbox until completed by a user. Their state does not change.

How do documents leave the Document Inbox?

You must still Complete the document to move it out of your Document Inbox.

Do I have a chance to check the Study the TMF Bot provided?

You can check the document’s Study within the Document Inbox before Completing it.

Future functionality

Can I use the TMF Bot to check the classifications of my older documents?

Yes. You can configure a workflow to include a step where the TMF Bot will check the classification and metadata of an existing document as it goes through the workflow.

Will document auto-classification be available in the Vault Platform to be leveraged by other Vault Applications?

As of right now, the document auto-classification capabilities are specific to Clinical Operations. We may see more capabilities in other Vault Applications in the future.

Can the TMF Bot determine if a document is a new version of an existing document?

It cannot. We have considered this for the future, but it’s much further away for now.

Training machine learning models

What data is used to train the TMF Bot?

Can multiple Trained Models be deployed at the same time?

Can we retrain the TMF Bot?

If we train and deploy a model but then withdraw it, will the nightly job detect that a model is not deployed and attempt to train and deploy a model automatically?

Can the TMF Bot be trained on similar documents that file to different document types in different scenarios?

Can we modify or add to the parameters used to train the machine learning models?

How many documents do I need to train the TMF Bot?

Is there a limit or an optimal number of documents that I can use to train the Auto-classification Trained Model?

Will the Trained Model be more accurate if I give it more documents?

How can I make a Metadata Extraction Trained Model more accurate?

Will training a Trained Model cause slow performance in my Vault?

Does Vault automatically split the document set into a training set and testing set while training Trained Models?

Does the TMF Bot populate Study Country and Site?

Evaluation and User Acceptance Testing

How do I evaluate the TMF Bot?

Can I pilot the TMF Bot within a Sandbox environment?

How do I disable TMF Bot if I do not wish to use it?

Uploaded Documents

Can I still choose to classify documents immediately?

Can we have the TMF Bot only auto-classify certain document types or documents for certain Studies?

Do I need to name the file in a specific way for the TMF Bot to auto-classify the document correctly?

Does the TMF Bot include scanned documents or any documents that need Optical Character Recognition (OCR)?

Does the TMF Bot use the text within a document when it auto-classifies?

If the document has hand-written text, is that ok?

Can the TMF Bot classify emails?

What happens after auto-classification?

After the TMF Bot auto-classifies a document, who updates the Name?

How does TMF Bot consider Security Profiles?

If the TMF Bot cannot auto-classify a document, how are the owners notified?

Is auto-classification captured in the audit trail?

Who is the Owner of a document that TMF Bot has classified?

Will sharing settings be applied to documents in the Document Inbox picked up by the TMF Bot?

Will the TMF Bot also promote my document to the Final or Approved state?

How do documents leave the Document Inbox?

Do I have a chance to check the document type the TMF Bot provided?

Misclassifications

Can I reclassify a document if the TMF Bot selected the wrong classification?

Can I report on documents that TMF Bot misclassified (documents a user reclassified after the TMF Bot auto-classified them)?

When a user reclassifies a document auto-classified by the TMF Bot, does the machine learning model learn from this?

What happens after Study Metadata Extraction?

How does the TMF Bot consider Security Profiles?

If the TMF Bot cannot set the Metadata values on a document, how are the owners notified?

Is Metadata Extraction captured in the audit trail?

Will sharing settings be applied to documents in the Document Inbox picked up by the TMF Bot?

Will the TMF Bot also promote my document to the Final or Approved state?

How do documents leave the Document Inbox?

Do I have a chance to check the Study the TMF Bot provided?

Future functionality

Can I use the TMF Bot to check the classifications of my older documents?

Will document auto-classification be available in the Vault Platform to be leveraged by other Vault Applications?

Can the TMF Bot determine if a document is a new version of an existing document?