Importing data into Viral Host Range database

Users will find below the information for guidance to import data into the VHRdb. The label of the sections correspond to those displayed during the importation steps.

General information

Step: “Provide a new data source: General information”

Users need to provide a name for their data source, and indicate the corresponding life domain (Bacteria, Archaea or Eukaryote). Users should tick the box to make their data public. Public data means that every user of the VHRdb will be able to explore this data source. Note that all data that have been published should be public. Users choosing to keep their data private will nevertheless have the possibility to share their data source with collorators (see Visibility of your data source). The collaborators need to have a VHRdb account in order to get access to the data source. This option is useful for sharing work in progress within lab members.

Visibility of your data source

Step: “Provide a new data source: Adjusting its visibility”

When a data source is not public, users who own this data source can share it with collaborators (registered VHRdb users) by providing their first and last name, or their email used in the application. See example below:

Sharing a data source

Description of the data source

Step: “Provide a new data source: Description”

This step is crucial for the quality of the information that will be available in the VHRdb. Users should indicate as many experimental details as possible including, but not limited to: origin of viruses, origin of hosts, method used for the tests, who performed the experiment, where the experiment was performed… Providing a description of the data source is mandatory when public, recommended otherwise. It is strongly recommended that data published or related to a publication should be indicated here with a link to this publication. This will allows users to appropriately cite the data source.

How to provide the data

Step: “Provide a new data source: Way to provide it”

Users can either choose to download a compatible file or to generate a template to be filled offline and then uploaded.

Upload a compatible file

Step: “Provide a new data source: Upload a file”

Select the compatible file and upload it.

Generating a template to share your data

Step: “Provide a new data source: Preparing the template”

Users that have a limited dataset can directly copy/paste the list of viruses and hosts in the respective boxes following the examples shown. When available users are encouraged to indicate the NCBI identifiers using parentheses. Note that this can also be done after the data source has been uploaded as well as any time after identifiers have been assigned to new sequences of viruses or hosts.

Hereafter are examples of viruses and hosts using various accepted formats:

Example 1, with viruses:

Copy/pasted from a spreadsheet

T4

NC_000866.4

27

T7

NC_001604.1

MyVirus

Examples 2, with hosts:

One entry per line

  1. coli MG1655 (NC_000913.3)

  1. coli O157:H7 (AE005174.2)

Examples 3, with hosts:

Entries separated with a semi colon

E. coli MG1655 (NC_000913.3); E. coli O157:H7 (AE005174.2)

Filling and uploading the template

Step: “Provide a new data source: Uploading the template”

Users can download the template file, which is an Excel spreadsheet (xlsx) where hosts are indicated in columns, and viruses, in rows (see example below). Responses, which correspond to the result of interaction tests, can be entered in many numerical format as the VHRdb allows users to later map their data to the global three-state scheme (see Mapping responses to the global scheme). Any Excel file (preferably with the extension .xlsx) that is organized as suggested is a compatible file.

Below are shown examples of the template file before and after entering the responses:

  1. coli MG1655 (NC_000913.3)

  1. coli O157:H7 (AE005174.2)

T4 (NC_000866.4; HER:27)

<response>

<response>

T7 (NC_001604.1)

<response>

<response>

MyVirus

<response>

<response>

  1. coli MG1655 (NC_000913.3)

  1. coli O157:H7 (AE005174.2)

T4 (NC_000866.4; HER:27)

2

0

T7 (NC_001604.1)

1

2

MyVirus

2

1

A complete description of the properties of compatible files is available here.

Mapping responses to the global scheme.

Step: “Defining the mapping of data source …”

The heart of the VHRdb is to provide a tool to compare independent data sources. We defined a three-state global scheme to recapitulate virus-host interactions with “No infection” and “Infection” states, which are self-explanatory. We also included an “Intermediate” state to reflect situations where virus-host interactions are clearly different from the two other states. Note that users can submit data using one, two or three states.

The result of virus-host interactions, hereafter called the response, must be a digit. This is the only requirement. In order to easily accept data sources from different users, any digit can be uploaded. For example users can submit data with a two-state scheme (0/1), or a continuous scale from 0 to 4, or from 1 to 0.0001. Any scheme can be uploaded. The next step allows users to map the original scheme of the data source to the global scheme.

In the example shown below, original responses range from 0 to 4. To distribute the original responses, users can adjust the thresholds between the global scheme. In the example shown below, any value below 0.58 will correspond to “No infection” state, any value between 0.58 and 2.3 will correspond to “Intermediate” state and any value higher than 2.3 will correspond to “Infection” state.

Example

Mapping example with float responses

Computing details about the mapping tresholds

To compute the thresholds, a clustering algorithm is run on the original responses of the data source with three clusters \(c_1\), \(c_2\) and \(c_3\), then the threshold between \(c_k\) and \(c_{k+1}\) is the mean of the centroid of the these two clusters. Implementation code can be found in the RangeMappingForm implementation.