# Importing data into Viral Host Range database

Users will find below the information for guidance to import data into the VHRdb. The label of the sections correspond to those displayed during the import steps.

## General information

Step: “Provide a new data source: General information”

Users need to provide a name for their data source, and indicate the corresponding life domain (Bacteria, Archaea or Eukaryote). To make the imported data publicly visible, users should tick the corresponding box. This option allows all VHRdb users to search/explore the data source. Note that all data that have been published should be public. Users choosing to keep their data private will nevertheless have the possibility to share their data source with collaborators (see Visibility of your data source). The collaborators need to have a VHRdb account in order to get access to the data source. This option can for instance be useful for sharing work in progress within lab members.

## Visibility of your data source

Step: “Provide a new data source: Adjusting its visibility”

When a data source is not public, users who own this data source can share it with collaborators (registered VHRdb users) by providing their first and last name, or their email used in the application. See example below:

## Description of the data source

Step: “Provide a new data source: Description”

This step is crucial for the quality of the information that will be available in the VHRdb. Users should indicate as many experimental details as possible including, but not limited to: origin of viruses, origin of hosts, method used for the tests, who performed the experiment, where the experiment was performed… This description is mandatory for public data sources, recommended otherwise. It is strongly recommended that data published or related to a publication should be indicated here with a link to this publication. This will allows users to properly cite the data source.

## How to provide the data

Step: “Provide a new data source: Way to provide it”

Users can either choose to download a compatible file or to generate a template to be filled offline and then uploaded.

Step: “Provide a new data source: Upload a file”

Select the compatible file and upload it.

## Generating a template to share your data

Step: “Provide a new data source: Preparing the template”

Users that have a limited dataset can directly copy/paste the list of viruses and hosts in the respective boxes following the examples shown. When available users are encouraged to indicate the NCBI identifiers using parentheses. Note that this can also be done after the data source has been uploaded as well as any time after identifiers have been assigned to new sequences of viruses or hosts. More information in Identifying viruses and hosts.

Hereafter are examples of viruses and hosts using various accepted formats:

### Example 1, with viruses:

 T4 NC_000866.4 27 T7 NC_001604.1 MyVirus

### Examples 2, with hosts:

One entry per line

 coli MG1655 (NC_000913.3) coli O157:H7 (AE005174.2)

### Examples 3, with hosts:

Entries separated with a semi colon

E. coli MG1655 (NC_000913.3); E. coli O157:H7 (AE005174.2)

Users can download the template file, which is an Excel spreadsheet (xlsx) where hosts are indicated in columns, and viruses, in rows (see example below). Responses, which correspond to the result of interaction tests, can be entered in many numerical format as the VHRdb allows users to later map their data to the global three-state scheme (see Mapping responses to the global scheme). Any Excel file (preferably with the extension .xlsx) that is organized as suggested is a compatible file.

Below are shown examples of the template file before and after entering the responses:

 coli MG1655 (NC_000913.3) coli O157:H7 (AE005174.2) T4 (NC_000866.4; HER:27) T7 (NC_001604.1) MyVirus
 coli MG1655 (NC_000913.3) coli O157:H7 (AE005174.2) T4 (NC_000866.4; HER:27) 2 0 T7 (NC_001604.1) 1 2 MyVirus 2 1

A complete description of the properties of compatible files is available here.

## Mapping responses to the global scheme.

Step: “Defining the mapping of data source …”

One of the main goals of VHRdb is to enable the comparisation of independent data sources. We defined a three-state global scheme to recapitulate virus-host interactions with “No infection” and “Infection” states, which are self-explanatory. We also included an “Intermediate” state to reflect situations where virus-host interactions are clearly different from the two other states. Note that users can submit data using one, two or three states.

The result of virus-host interactions, hereafter called the response, must be a digit. This is the only requirement. In order to easily accept data sources from different users, any digit can be uploaded. For example users can submit data with a two-state scheme (0/1), or a continuous scale from 0 to 4, or from 1 to 0.0001. Any scheme can be uploaded. The next step allows users to map the original scheme of the data source to the global scheme.

In the example shown below, original responses range from 0 to 4. To distribute the original responses, users can adjust the thresholds between the global scheme. In the example shown below, any value below 0.58 will correspond to “No infection” state, any value between 0.58 and 2.3 will correspond to “Intermediate” state and any value higher than 2.3 will correspond to “Infection” state.

### Computing details about the mapping thresholds

To compute the thresholds, a clustering algorithm is run on the original responses of the data source with three clusters $$c_1$$, $$c_2$$ and $$c_3$$, then the threshold between $$c_k$$ and $$c_{k+1}$$ is the mean of the centroid of the these two clusters. The source code corresponding to this algorithm can be found in the RangeMappingForm implementation.