Ten Mistakes to Avoid when Integrating Clinical Trials and Investigator Repositories

The best clinical trial plans rely on information from the best sources available. For some questions curated databases are not always available or accessible so public search portals are the best remaining option. However, insight comes from integration and public repositories by definition are not integrated. Thus integration across sources is challenging for even the most programmer savvy clinical trial professionals. 

The Project

We recently undertook a project to reconcile investigators records from ClinicalTrials.gov, FDA/BMIS and CMS/Sunshine. The problem we wished to solve is:

Can we build a coherent transcript of a clinical trial investigators activity by integrating data across these sources?

The key requirement is to merge records to eliminate redundancy and increase the value of information. Having 5 records of John Smith from Clinical Site X from Cambridge, Massachusetts is not as valuable as 1 consolidated record for John Smith with references to 5 trials he participated in. Of course, therein lies the rub as that is only valuable if we feel confident that the 5 records are referring to the same individual. If we have too aggressively merged 5 John Smiths into 1 composite record, where these are five separate individuals, we have not added value to the data but rather subtracted it. 

In general we want to merge the obvious records and situations where we have high confidence, and avoid overly aggressive merge strategies. 

In the process of building the database we learned a lot about the limits of the data. In this article we share some these in the form of mistakes to avoid when turning public domain data into actionable insights:

Mistake #1: Assuming basic data such as city and state names are entered correctly

Not surprisingly, there are many possible spellings for Saint Louis. We have seen St. Louis, Ste. Louis, St Louis, St. Louise etc. These are of course all Saint Louis, Missouri, or Saint Louis, MO but to the computer they are at least 5 different cities and possibly 10. 

In order to merge records based on Saint Louis, we need to make sure that it is spelled consistently within a data source, and across data sources.

Mistake #2: Failing to reconcile country/city/state names with the geographic coordinates of a country/city/state

Not all countries, cities and towns have verified latitude and longitude coordinates. If you intend to provide geographic sorting for search results or density statistics based on location, you need to be sure that the country/city/state is available. For example, the country 'czechia' is referred to in a clinical trial, but the geolocation database uses 'czech republic'.

We used SimpleMaps.com to verify the lookup coordinates for every record found.

Mistake #3: Assuming repository identifiers will be used consistently

The BMIS and CMS repositories attach an identifier to each record. It would seem that these could be used to de-duplicate and merge instances where an investigator is mentioned multiple times. It turns out this is NOT the case. We found that even within a given program year for both BMIS and CMS, multiple records for an investigator which can be obviously merged did not have a consistent identifier.

Repository identifiers are useful only for tracking back to the originating records. Do not rely on them for merging records. 

Mistake #4: Assuming investigative site names will be used consistently

A single physical site could be represented by as many names as there are trials participated in. For example, a site can be mentioned with its legal name in one trial, and with a department name as a suffix in another. Other word fragments found in legal entities are rarely used consistently. We found “hlth” for “health” and “mgmt” for “management” and “surg” for “surgery” just to name a few.

We overcome this by eliminating fragments and “stopwords” from site names and calculating the similarity of two site names for a given country/city/state combination. Using this method site names can be safely merged when the geographic location is confirmed to be the same.

Aggregate analysis of clinical trial sites remains elusive until we are able to curate the remaining site names. Until then, using these techniques it is possible to see the aggregate analysis of a given geographic location.

Mistake #5: Failing to look in multiple places within a document for records

For BMIS and CMS repositories, the records are investigator ordered in the files they provide. However, ClinicalTrials.gov documents contain investigators in multiple places. The files contain “Overall Official” and “Location” sections, each of which can be a source of investigator information. All of these need to be evaluated in order to extract comprehensive investigator listings.

Mistake #6: Merging records too aggressively

Obvious candidates for merge have identical values for first name, last name, city, country, state, site. The difficulty comes when one or more pieces of information are either missing or somehow different. 

There are cases where name but no geography are known. In this case, if the name is otherwise unique it can be safely merged. However, if names are identical but the geography is different, it’s likely that this person has moved over the course of a career. Despite the likelihood its the same person, it’s risky to let the machine do the merge as you will incur a high number of false merges.

Mistake #7: Merging Records Too Conservatively

In cases where mismatches seem minor, such as ‘first name’ spelling differences one can fall back to ‘first initial’ plus ‘middle initial’ in place of ‘first name’. If geography is consistent, this is generally safe. Also, additional pieces of information can be used to complete the merge, such as email and phone number.

Mistake #8: Failing to extract the core value from each data source

Each of the three data sources (ClinicalTrials.gov, BMIS, CMS) brings different value and its important to extract and retain that value as records are merged and deduplicated. Taken together, these provide a comprehensive and composite picture of an investigators activity.

The most valuable record of an investigators activity results from the aggregate analysis of the clinical trials that investigator have participated in. Thus, the trial identifiers hold the key to unlock that investigators clinical trial resume. 

The BMIS resource is the most challenging and least valuable resource in terms of building composite views of investigator trial activity. In addition to contact information, the content of the BMIS record provides just the submission date for results of a clinical trial. In order to get the value of that information we have to merge ALL BMIS records for the individual. Only then can a timeline of review dates be constructed to show the trial activity of that individual. Unfortunately, there is no information about what diseases were studied or what specific trials were connected to the result. This information is enhanced when combine with trial history from ClinicalTrials.gov and CMS. Absent that, if a sponsor is recruiting for specific geographic area and the investigator is shown to have recent activity, even if there is no corresponding transcript in the trial archive, it may be worth a contact as part of the recruitment process.

The CMS sunshine data provides sponsor payment information to the investigator together with clinical trial identifiers. For US investigators at least, this allows the reconstruction of the list of participating investigators for a given trial, where that information was absent from the clinical trial record available at ClinicalTrials.gov. In addition the volume of payments for a given program year, together with the condition and indication information from the trial documents themselves, we can bring great insight into an investigators activity and expertise.  

Mistake #9: Losing information as you merge across sources

As we merge and disambiguate investigator records from multiple sources, its important to retain a reproducible “track-back” to the originating source records. That way, if we merged in error, we can rewind the record back to the erroneous merge and restore the two records in place of the one. 

The basic algorithm for this requires that the identifier (ID) for the first occurrence of the investigator should prevail throughout all subsequent merge operations. Whereas the originating ID should prevail, updates to data items should reflect the last data found. For example, a latest email or phone number. The surviving record should contain links back to the “ancestor” records so that the merge history can be recreated.

Mistake #10: Forgetting that eventually human intervention may be needed

Many clinical trials include contacts with names like “Call Center at Pharma XYZ”. These names are consistent across trials, resulting in highly active yet phony investigator records. It’s impossible to ferret these out until the database is built and these records become visible.

A simple “Mark as Sponsor” button on the UI marks these for removal from the active database. However, we want to keep them around as a filter for subsequent database build operation. 

Other issues arise when two records persist when one is desired, for example when an investigator works for a sponsor and the sponsor address is captured. A simple UI sequence to select the two records followed by “Merge Selected” will perform the desired operation and provide the necessary audit trail in case we need to “undo” the manual merge operation.


The results are more than 300,000 investigators annotated with at least 1 trial to their credit. Interestingly, 221,043 of these have just ONE trial, 47,047 were added in the last year. 75,699 have 2-5 trials and 14,481 have more than 5 trials. 

The resulting archive of clinical trials and investigator records provides an enriched database for mining active and relevant investigators. The database can be used standalone or as a complement to existing CRO and sponsor databases.


With careful attention to detail, we can build a coherent transcript of clinical trial investigators activity useful for investigator identification and recruiting. By integrating data across many sources and annotating with detailed clinical trial transcripts we get an excellent means to search and identify relevant investigators. Moreover, using automation allows for continuous monitoring of changes to investigator profiles making this living transcript an excellent complement to curated and well annotated legacy databases.

Click here to see a short video of the data in action.