What is a compatible file?

Short version

A compatible file is an Excel spreadsheet (.xlsx) in which hosts are indicated in columns and viruses are indicated in rows. Each cell is filled with the response (a digit) of the interaction between one host and one virus (see example below).

  1. coli MG1655 (NC_000913.3)

  1. coli O157:H7 (AE005174.2)

T4 (NC_000866.4; HER:27)

2

0

T7 (NC_001604.1)

1

2

MyVirus

2

1

Warning

  • Only data located in the first Excel sheet will be taken into account.

  • No duplication is allowed for host and virus.

  • At least three empty lines must remain between the main table and any other information you may want to write after, such as the legend in a generated template.

User can also download the xlsx version here.

Detailed version

Values accepted as responses

Each user has its own way to record the output of host range tests. From basic two-state (infection/no infection) to more detailed information such as numerical values corresponding to efficiency of plating calculations. The VHRdb will accept any response that is a digit. It is highly recommended to allocate the lowest value to the no infection state and the highest value to the infection state. Therefore, optionals intermediates states should correspond to any value between the lowest and the highest. The range between the lowest and the highest values can be as large as you wish. For example, you can upload data source ranging from 0 to 100 or from 0.0001 to 1. The VHRdb mapping scheme allow users to freely modify the threshold values defining the possible states according to the global scheme (No infection, Intermediate, Infection).

Providing the identifier of a virus or a host

Users can add identifiers for viruses and hosts in two different ways. When filling the spreadsheet users can add identifiers (between parentheses) in the same heading cell than the virus or host name (see example below). Identifiers can be NCBI identifier and/or HER identifier, and or any custom identifier. Multiple identifiers must be separated by a semicolon ;. More documentation can be found here.

The second way to enter identifiers is to edit the VHRdb data after the source table has been uploaded. This is particularly convenient for adding NCBI identifiers sometimes obtained after the uploading the source table.

Examples :

How to compare data across several data sources

Every data source is processed using the same mapping procedure linked to a simplified three-state global scheme. Therefore, all data sources can be compared.

Can cells of the Excel source file be colored?

Cell colors in the source file are not taken into account. Therefore, if you ranked your responses by using a color scheme only, you must convert it to digits before submission to the VHRdb. If you are using colors in addition to digits, you don’t need to remove the colors, they will not interfere with the uploading process. Note that colors are also used in exported file when possible to improve readability.

Variants in the file disposition

  1. The header of a row (or column) in the spreadsheet can be preceded by one or many columns (or rows). Only the last column (or row) will be taken into account. In the following example, only the cell in bold will be uploaded into the VHRdb.

Obtained from John doe

Obtained from Jane doe

On 2009-02-21

On 2012-03-21

E. coli MG1655

E. coli O157:H7

Virus

T4 (NC_000866.4)

2

0

Virus

T7 (NC_001604.1)

1

2

Virus

MyVirus

2

1

  1. The data must be on the first sheet of the file.

  2. There can be additional rows after the responses, they will not be imported as long as there is nothing written where the virus (or host) name is expected. In the following example, the last row will not be imported.

E. coli MG1655

E. coli O157:H7

Virus

T4 (NC_000866.4)

2

0

Virus

T7 (NC_001604.1)

1

2

Virus

MyVirus

2

1

Infection Ratio

<Must be blank>

100%

50%

  1. There can be any additional rows after the responses as long as there is at least three empty line after the responses. In the following example, Batch will not be considered as a virus.

E. coli MG1655

E. coli O157:H7

T4 (NC_000866.4)

2

0

T7 (NC_001604.1)

1

2

MyVirus

2

1

Batch

11603

Jane Doe

Duplication of virus or host

It is not allowed to provide file with host or virus duplicated. If you do so the import will only keep one version of it. Whether the first or last occurrence is kept depends on implementation details and can be changed at any moment. When importing a file with duplication, a warning is rendered indicating which occurrence is kept.

Empty virus name or host name

A virus, or a host, must have at least a name or an identifier, ideally both. You are not allowed to have empty name/identifier, and import will prevent such data to be imported.

Robustness of file import

The resilience of the import module to read and interpret the file is of a paramount importance, we generated multiple configurations in which a file could be written and how we should read it. At each change in the programme we test that each file is still read as expected. The file collection can be browsed at https://gitlab.pasteur.fr/hub/viralhostrangedb/tree/master/src/viralhostrange/test_data, where for an input file <filename>.xlsx the data we extract from it is <filename>.xlsx.json.

If you tried to import a file which should work but did not, please to not hesitate to submit an issue at https://gitlab.pasteur.fr/hub/viralhostrangedb/issues with the file cleaned of its private data. You can also e-mail the file and steps to reproduce the bug at viralhostrangedb@pasteur.fr. We will correct the import module so that either it can process succesfully the file, or it returns a clear and understandable error message to users.