Data Requirements
There are only two pieces of information you need for dms-viz
:
- You'll need input data. Your data should have a quantitative metric associated with mutations to a protein sequence.
- You'll need a 3D structure for your protein. The structure can be associated with an RCSB ID or provided in a custom
.pdb
file.
Everything beyond these two requirements is optional.
There are certain cases where you will need to provide additional information. For example, if the reference positions in your input data don't match the reference positions in your .pdb
file, you'll need to specify a sitemap
. Also, if you have data from another dataset that you wish to include in filters or tooltips, you can provide join data
to merge with the input data.
The formatting requirements for the input data and optional data are explained below in detail.
Input Data
The Input Data is the mutation-based data that you'd like to summarize and visualize on an interactive protein structure. It must contain a column with a quantitative metric that's associated with mutations in a protein sequence. For example, this data could be a fitness score associated with mutations to a protein, or a score that represents how a mutation changes antibody binding to an antigen. For detailed examples of use cases for dms-viz
, check out these Vignettes.
Important!
The input data must be in .csv
format. If your data is tabular but in another format, please convert it to .csv
.
The input data must contain the following columns with exactly these names:
site
orreference_site
This column should contain the site in the protein at which each measurement was made. This column can be numeric (i.e.,
[1, 2, 3, 4]
) or it can contain strings (i.e.,[1, 2, 2a, 2b, 3]
). Additionally, the sites do not need to be continuous (i.e.,[1, 4, 5, 8]
). The order of your sites is assumed to be their order in the data unless it is specified in the Sitemap using thesequential_site
column. These reference sites will label the x-axis of all summary plots indms-viz
. In addition, thesite
orreference_site
in the input data is assumed to match the position in the provided protein structure. If the sites are numbered differently between your data and protein structure, you must specify the correct mapping in theprotein_site
column of the Sitemap.For more details on what we mean by 'reference_site', check out the description of the sitemap file.
mutant
This column should contain the identity of the mutation that each measurement is associated with. These mutations should be represented using the IUPAC single-letter codes along with symbols for stop codons and gaps (i.e.,
R, M, P, *, -
). If you need to extend or shrink this alphabet, you can do so using the--alphabet
flag ofconfigure-dms-viz
.wildtype
This column should contain the wildtype identity of residues at a given site in the protein. For example, if a Proline (
P
) was mutated to an Alanine (A
) at position 120 in the protein (P120A
), there should be aP
in the wildtype column for every row where the value of the site column is 120. This column will also be used to check how well the sequence of the protein structure you provided matches your data. Significant discrepancies can indicate that you'rereference
,sequential
, andprotein
sites are misaligned.
In addition to these three mandatory columns, you will also need to specify a metric
column. The identity of this column is specified with the --metric
flag of configure-dms-viz
, and it can have any name:
<metric>
This column should contain the quantitative metric that you'd like to summarize and view on a protein structure. For example, this column could be called
fitness
and contain a score that reflects how individual mutations alter a protein's fitness.
Optionally, depending on the design of your experiment, you can also include a "condition column" that specifies how your data is grouped if there are multiple conditions. In other words, you are required to specify this column if there are multiple measurements for the same mutations.
condition
This column should only be included if there are multiple measurements in the
<metric>
column for the samesite
/mutation
combinations. For example, you'll need a condition column if your data contains a measurement like an antibody's escape for multiple 'epitopes' in an antigen. This column contains a unique identifier that's used to delineate between these measurements for each mutation. This 'identifier' will show up in an interactive legend next to the visualization.
Sitemap
The Sitemap is a tabular .csv
file that specifies the order of the site
(reference_site
) column in your input data and, optionally, how the site
column corresponds to the numbering in the protein structure you provide.
Important!
The sitemap must be in .csv
format. If your data is tabular but in another format, please convert it to .csv
.
reference_site
This column must correspond to the
site
orreference_site
column in your input data. If theprotein_site
isn't provided, this column is also assumed to correspond to the identity of the sites in the protein structureThe
reference_site
refers to the identity of the sites that are mutated in your dataset. These sites will ultimately label the x-axis of the visualization. These 'reference' sites can sometimes differ from thesequential_site
(described below); for example, the current SARS-CoV-2 Spike protein variants have insertions and deletions that cause the widely used Wuhan-Hu-1 'reference' numbering to differ from the sequential, numeric order of the data.sequential_site
This column is the sequential order of the reference sites and must be a numeric column. This will determine the order of the protein sites in the visualizations.
protein_site
Optionally, this column is only necessary if the
reference_site
sites are different from the sites (residue numbering) in your provided protein structure. If they are different, this column is the position in the protein structure that corresponds to thereference_site
values in your data.chains
Optionally, this column is only necessary if you've provided the protein_site
column and there are multiple reference_site
sites for the same value of protein_site
. This might be the case if your data corresponds to discontinuous chains in the protein structure. For example, if your data is measured over two separate chains with overlapping numbering schemes. For example, Influenza HA protein structures usually have separate chains with overlapping numbering for the stalk and the head. So the reference sites 102 and 30(HA) might both correspond to the residue number 102 in the PDB file. In that case, the only way to distinguish between them on the structure is with the identity of the chain (i.e. A vs. B). This column should have chains in the same format as the chains provided to --included-chains
(i.e. a space-separated string of chains: "A B C D").
Join Data
Optionally, you might have some additional data that you want to combine with your Input Data. You do this so you can include columns from this Join Data in the filters or tooltips of your visualization. This option helps streamline that workflow.
Important!
The join data must be in .csv
format. If your data is tabular but in another format, please convert it to .csv
.
You can specify more than one .csv
file if there are multiple sources of data that you want to take columns from. Check out the API reference entry on the --join-data
flag for more details.
The Join Data must contain a site
, wildtype
, and mutant
column, as these are used to join your incoming data with the Input Data.