User guide

Obtaining an account for the server

We have an on-line form to sign up for an account. It asks for your academic or non-profit contact details. Users from other sectors who wish to have access should send email to cdweb@mail.cryst.bbk.ac.uk .

Return to top

UserID

This is your user identification, you will have been informed of this when you received confirmation that your account had been created. It is case-sensitive.

Return to top

IDpassword

This is your password, you will have been informed of its value when you received confirmation that your account had been created. It is case-sensitive.

Return to top

Protein Name

The Protein Name element of the form acts as an identifier and becomes the prefix of any input and output files produced from an analysis. It is restricted to 12 alphanumeric characters only.

Return to top

File Location

The file location is the path name to the CD data file on your local computer. This string is checked for errors and if the server cannot locate the file then the analysis will be terminated and an error message generated. It is advisable to use the browse button as it will specify the correct file location automatically.

The files that you upload should be in text format, for example as raw text (.txt) DichroWeb will not accept non-text file formats such as .exe, .gif, .jpg, .doc or .ppt

Also, to accommodate that some users utilise the comma as a decimal separator, DichroWeb does not interpret comma separated value (.csv) files.

Return to top

File Format

The select options in the file format field are derived from the file formats output from different CD spectroscopy machines. Mainly, the formats differ in the size of the header and the column layout of the data.

Example file formats can be viewed below:
Aviv 60 DS v4.1* Aviv1.txt header: 25 lines data columns: 2
Aviv CDS AvivC.txt header:  14 lines data columns: 2
Aviv v2.86 Aviv2.txt header:  19 lines data columns: 2
Jasco 1.30 Jasco.txt header: 19 lines data columns: 2
YY YY.txt header: 4 lines data columns: 1 or 5 (reading accross rows)
DRS DRS.txt header: 13 lines data columns: 7
BP (2nd column)** BP2.txt header: 22 lines data columns: 4
BP (4th column)** BP.txt header: 22 lines data columns: 4
* The 60 DS format may be obtained, even in later versions of the software, by choosing the "export to 60 DS format" option in the instrument data browser window, from the "export data set" pulldown.
** It has been reported that the dichroism data can appear in either column 2 or column 4 for the BP format. Please check which format your BP file is in and select BP (data in col. 2) or BP (data in col. 4) accordingly.

If your data exists in some other format please edit it to match one of the above file formats or use the FREE format option which requires two columns, wavelength and CD data respectively. The data may begin with either high or low wavelength. If the format has been incorrectly chosen an error message will be generated stating that the file uploaded was not suitable for analysis.

Return to top

Input Units

Circular dichroism can be measured in several ways. Within the literature their are several conflicting measures and definitions. Most of these have been accommodated in the select box, but for clarity, the conversion equations used are detailed below:
 

Delta Epsilon Δε
The per residue molar absorption units of circular dichroism measured in M -1cm-1 . Δε is sometimes referred to as molar circular dichroism. Data peaks are usually in the range of 0 - 10

All of the analysis programmes accept these input units except K2D. So if your data is in Δε then no conversions are required.
 

Mean Residue Ellipticity MRE [θ]
Mean residue ellipticity is the most commonly reported unit and is measured in degrees cm2 dmol-1 residue-1 . Data peaks are usually in their 10,000's and the relationship between [θ] and Δε is shown below:

Δε =  [θ] / 3298

Theta Machine Units θ
To convert from machine units in millidegrees, to delta epsilons, the following equation is applied. Machine units measure the difference in molar extinction coefficients between left and right handed light, usually between 1 and 100, and need to be corrected to account for the amount of protein used in the sample.

Note: on selection of this option you will be asked to specify the mean residue weight (MRW = protein mean weight (in atomic mass units/daltons) / number of residues) amu for the protein,  path length (P) in cm and protein concentration (CONC) in mg/ml.

Δε = θ    X  ( 0.1 * MRW)
                           ( P * CONC) * 3298

DRS yy units
Often, CD data units are particularly large measurements and in order to acheive accurate data measures after unit conversion, it may be necessary to multiply the machine values. These units are commonly used at Daresbury with the yy file format. The data is usualy in the range 0.001- 0.01.

DRS-yy units are Theta machine units multiplied by a factor of 100. Therefore, the relationship with Delta epsilons is as follows:

   Δε =  ( θ * 100     X ( 0.1 * MRW )
(P * CONC) * 3298

DRS units
These are standard Daresbury units (machine units that have been divided by a factor of 10,000). The relationship with delta epsilons is shown below:

Δε=    θ     X   ( 0.1 * MRW)
10 000 (P * CONC) * 3298

 

Molar Ellipticity (θ)m
Molar ellipticity is a little used unit which has the dimensions degrees decilitres mol -1 decimeter-1 . DichroWeb does not accept data in units of (θ)m, but such data may be converted to units of Δε by using the following formula, where Nr represents the number of amino acids in the protein :

Δε = (θ)m * Nr / 3298

If you have data in units of (θ)m, please convert the values to units of Δε and then submit to DichroWeb.

Return to top

Initial Wavelength

The initial wavelength should correspond to the first wavelength that appears in your data file (i.e. towards the top). This could be either the numerically highest or lowest. If in doubt, open up your data file in a text editor and take a look.

Return to top

Final Wavelength

The final wavelength should correspond to the last wavelength that appears in your data file (i.e. towards the bottom). This could be either the numerically highest or lowest. If in doubt, open up your data file in a text editor and take a look.

Return to top

Wavelength Step

CD spectrophotometers can be set to record data at various wavelength intervals. All of the DichroWeb-supported analysis programmes accept data at 1nm interval only and so all other data points will be discarded. DichroWeb performs no smoothing of the data, if you believe that smoothing is required, you must perform this yourself beforehand. If the wrong wavelength step is specified the server will detect this and return an error message stating that your file is unsuitable for analysis.

Return to top

Lowest Datapoint

Sometimes part of a data set may be collected under conditions which are less than optimal. In these cases, it is desirable to remove the block of unreliable data points from the dataset and avoid trying to use them in any analysis. The "lowest wavelength datapoint" box allows for this without the need to edit the input file which is being submitted to DichroWeb. Just enter the wavelength of the last data point which is of good quality and DichroWeb will ensure that any data below that value cannot be submitted in an analysis. The suspect data is always taken as being the wavelengths below the entered value as the low wavelength data is generally the problematic area of a CD spectrum.

Why would data be unreliable?
With a conventional radiation source (such as a Xenon lamp), the intensity of the emitted signal drops significantly towards the lowest wavelengths in its range. The lower intensities can still be collected and utilised, but in order to compensate for the loss of signal strength, the detector (typically a photomultiplier unit) has to increase its sensitivity and consequently requires an increased high tension voltage. There is a maximum high tension voltage at which a photomultiplier unit can accurately record transmitted radiation, and when this is approached, the readings become unreliable. Data collected when the high tension voltage is abnormally high, should not be used in the analysis and the "lowest wavelength datapoint" box allows a convenient method for truncating a dataset for this purpose. After applying this cut off criterion, if your data does not extend to sufficiently low wavelengths to enable the various databases and methods to be used for the analyses, then it is suggested that you re-collect the data changing the conditions - i.e. using shorter pathlengths, lower concentrations of buffers/additives or different buffers/additives. As a good practice guideline, the high tension voltage should not be above 550 mV at 190 nm for the sample or not above 500 mV at all for the baseline.

Return to top

Analysis Programmes

CONTINLL

The original version of CONTIN implemented the ridge regression algorithm of Provencher & Glockner, 1982. The latest version incorporates the locally linearised model (Van Stokkum et al) in selecting basis set proteins from the reference database.
 

Average run time: < 1 minute
Graphical output produced
Choice between 7 reference datasets


SELCON3

Selcon was designed by N Sreerama & Woody 1993, and incorporates the self-consistent method together with the SVD algorithm to assign protein secondary structure. The programme analyses results from a number of stages in the analysis. The first stage assigns an initial guess at the fractional composition. The first stage result corresponds to the Hennesey & Johnson method using SVD. In the second stage, the SVD calculations are iterated until a convergent solution is produced (equivalent to the original self-consistent method ). The third stage selects a number of likely solutions from the calculations of the basis set by constraining the summed fractional contents to equal one and each individual fraction to be greater than -0.05. The fourth stage applies a fourth constraint: the helix limit theorem, from which a range for helix content is determined and results screened. The range is taken from the solution using the Hennesey and Johnson method.
 

Average run time: <1 min
Graphical output is produced
Choice between 7 reference datasets


CDSSTR

This programme is a modification of the original Varslc written by WC Johnson. It implements the variable selection method by performing all possible calulations using a fixed number of proteins from the reference set. The algorithm recognises proteins posessing characteristics not reflected by the test protein or proteins not reflecting the characteristics of the test protein, and removes them from the basis set. The SVD algorithm assigns secondary structure.

This method probably produces the most accurate analysis results, but can take up to 15 minutes to run due to the sheer volume of calculations. It will however produce results where other methods fail to analyse proteins.
 
 
 

Average run time ~5min
Graphical output
7 Reference datasets


VARSLC

The original implementation of the variable selection method. The programme is flexible in that the user may configure input data files to specify the number of proteins to be selected from the reference set, the number of proteins to eliminated at a time from the reference set, and the total number of calculations tried before selecting solutions. The constraints applied can also be configured, for example, results are selected if their rmsd, sum squares error, individual fractional content and summed total content are within sertain limits.

To incorporate some of this flexibility into the website, several configuration files have been set up. The first follows the guidelines set out in the readme.txt that comes with the programme. It is recommended ~500 iterations with a basis set of 5-7 proteins, removing 1-2 per iteration.

Details of the settings files:
 
 

Choice RMSD max Individual Fraction min Total sum of Fractions No. proteins removed No. basis proteins No. Calculations
 Default 0.55 -0.15 0.95 - 1.14 1 6 300
 Settings 1 0.55 -0.15 0.95 - 1.30 1 20 528
 Settings 2 0.55 -0.20 0.95 - 1.40 1 30 700
 Settings 3 2.55 -0.20 0.95 - 1.20 1 6 900

The second settings file reflects the recommended values for accurate protein analysis. The default settings exists for quicker analyses where only 300 calculations are performed. When testing this programme with various different CD data files, it was found that in the majority of cases results are overlooked due to the total fraction of secondary structures being significantly greater than 1. Therefore settings file 2 exists for cases where the default and settings 1 have not produced valid results, and it is of use to the user to look at the kind of values resulting from the analysis as a rough guide. Settings 3 is an extension of settings 1 with 900 calculations and a high maximum RMSD value. If no results are obtained with any of the settings files, then CDSSTR uses the same method but with no restrictions on the number of calculations.

There is only one reference database that comes with the programme containing 33 reference proteins. This programme doesn't produce reconstructed spectra data and therefore no graphical output exists.


K2D

K2D is one of a few neural network programmes. The neural network operates via an input layer with interconnecting neurons to the output layer. The output layer (secondary structure) is calulated as a function of the input layer (CD data) via assigning weightings to each neuron. The weightings are assigned random values in a training phase. Each of the layers are fed large volumes of CD and structural data (equivalent to reference proteins) and the weightings are adjusted in an iterative process until an accurate secondary structure profile is obtained.

In K2D the weights file is fixed and therefore there is no choice of reference dataset. Accuracy is calculated by , and results for beta sheet and mixed proteins tend to be far less accurate than for helical proteins, although when compared with other methods (Greenfield 1996) these results are an improvement.
 

Return to top

Reference Set

All of the programmes except K2D rely upon reference datasets of proteins, from which a set of basis spectra will be selected for the analysis. CONTIN SELCON3 and CDSSTR offer a choice of reference database which should be chosen in accordance with the range of input data. It should also be noted that the choice of reference dataset affects the analysis results, particularly if there is mixed or high beta sheet content. The reference set that represents the characteristics of the protein of interest is likely to give the most accurate result.

A full breakdown of the contents of the reference sets can be found here.

Return to top

Optional Scaling Factor

The scaling factor allows the user to modify the experimental data by small amounts in order to try to compensate for errors in the intensity of the spectra and to hopefully thus improve the fit. It is possible that some spectrometers have incorrect intensity calibration and where this is known, a scaling factor may be applied to compensate for such errors.

The scaling factor is applied to all data points, and has a default value of 1.0, meaning no scaling. It would be highly unusual to require a large scaling factor and typical scaling values would be in the range 0.95 - 1.05. Scaling factors which are outside of the range 0.5 - 1.5 are unfeasibly large and will be ignored by Dichroweb.

WARNING

Scaling factors should only be applied to data where there is a known reason for doing so. It is possible to improve the NRMSD of an analysis by tweaking the scaling factor randomly, but this does not necessarily mean that the structure assignment is improved. Scaling factors should be used with caution.

Return to top

Output Units

Output units are selectable, irrespective of the input units. The options for output units are the same as those for the input units.

Return to top