|
User guide |
CD
Deconvolution Algorithms
Singular value deconvolution
A single matrix is constructed in the form a x b where a is the number of reference proteins and b is the number of CD data points (C). This matrix is then multiplied by its transpose Ct and diagonalized to produce a matrix of a eigenvectors U, and a matrix of a eigenvalues E.
(C x Ct) U = U x E
The matrix of the orthogonal basis CD spectra B is then calculated as follows:
B = Ut C
The relative importance of each basis CD spectrum with regard to the test protein spectrum is represented by the square root of the eigenvalue associated with the the corresponding protein from the orthogonal basis spectra matrix B.
The number of basis spectra considered necessary to accurately reconstruct the experimental protein CD spectrum is defined by m.
s 2m = (1/n(m- m)) | m
( S ei ) i=m+1 |
n = number of data points in each original vector | s = the standard deviation |
m = the number of original vectors | ei = the eigenvalues arranged in decreasing magnitude |
m values of 5 or 6 are generally found to resonstruct spectra with good accuracy.
The CD data matrix (C) and orthogonal basis matrix B can then both be truncated (Ctr and Btr) to include only information important in reconstructing the experimental data.
Ctr = UtrBtr
Each CD spectrum is then described as a function of Btr, the truncated B matrix, and each protein in the Btr matrix has known secondary structure from crystallographic or NMR data. The structural data accompanying the truncated basis set of proteins is then exploited in order to derive structural fractions for the experimental protein. As the relationship between conformation and CD data is direct, the original equations are substituted with the structural data and the same relationships apply. Bi where each i indexes the protein vectors and coefficients and represents a mix of secondary structures corresponding to D'i.
The vector D' is derived (similarly to B) from the relationship:
D' = Ut S'
where, S is the matrix of secondary structures from the
reference set of proteins c x d where c is the number of proteins in the
set, and d is the number of secondary structures.
The structural data is truncated corresponding to the truncated CD matrix derived earlier matrices.
S'tr = Utr D'tr
Finally, the conformational fractions of the experimental protein are found by solving the equation below ( given that there are 5 secondary structures):
Si' = mi1D1' + mi2D2' + mi3D3' + mi4D4' + mi5D5'
The SVD algorithm does not constrain the sum of secondary structure fractions to equal one within the algorithm, a constraint adopted by most other methods.
Ref: Hennessey & Johnson 1981
Neural Network Analysis
In this case CD wavelengths form the input layer neurons and secondary structure information form the output neurons. The patterns represent the relationship between CD spectra and secondary structure for the proteins used in the training phase. There are currently two neural network applications available, a single hidden layered approach, CDNN (Bohm et al 1992) using a linear transfer function to transfer information between layers and K2D using a Kohonen neural network with two dimensional output layer, (Andrade et al 1996). Both algorithms consider secondary structure as a superposition of secondary structure information, cysteine absorption, aromatic side chain absorption, random noise from CD measurement, and non-random errors in protein concentration measurement.
In CDNN the error propagation during the training phase (t) is calculated using the following equation:
D Wij(t) = s . di. Oi + m.DWij(t-1)
W
= weight between two connected elements i & j
di
= error
at processing element i
Oi
= output of element j
s =
learning constant
m
= momentum
Weightings are adjusted using the following calculation:
Wij (t+1) = Wij(t) + DWij(t)
Ref: Bohm et al 1996
Least squares
analysis
The algorithms minimize the total deviation between the experimental and calculated model protein spectra. The difference between the two algorithms being that in the constrained version, the total of conformation fractions must equal 1 and the individual fractional content must be greater than 0.
The nature of circular
dichroism analysis is such that the accuracy of analysis is sensitive to
protein concentration, and the unconstrained least squares method accounts
better for these errors and normalization can be applied after the analysis.
In the constrained fit mechanism, the normalization is carried out before
results are obtained, thus subjecting the method to over or underestimation
due to protein concentration error.
n
where rsq=residual squares error, n=number of data points, f=fractional weighting, qh=helix contribution ,qs = sheet contribution and so forth for all secondary structure classes considered.
Ref: Mao et al 1984,
Yang et al 1978
Ridge regression analysis
The equation below is in keeping with the concept of cirular dichroism where a CD spectrum is expressed as a weighted sum of model or reference protein spectra.
y( l) = S f i ri (l )
i=1
f - fraction of residues in secondary structure class i
y( l) - experimental CD spectrum
In order to deconvolute the experimental protein CD spectrum, the reference protein data is considered. The experimental CD spectrum is now expressed as a weighted sum of the reference protein spectra Rj(l).
N
g
y(l)
= S gj
Rj
(l
)
j=1
By finding the g values for a set of reference protein CD spectra Rj( l), the experimental CD spectrum y(l ) can be reconstructed. Each fraction of secondary structure (f i) can be found by multiplying the weight coefficient gj by the fraction of residues of protein j in conformational class i.
fi = S gj Fij
j=1
The
g
values
are then determined from the Ng
measured
CD data values, yobsd (l
k).
Ng
Without the left hand term in the above equation, the procedure would be identicle to least squares deconvolution. When a = 0, the solution is equivalent to the least squares solution, but when a>0, the second term on the left of the equation (the reqularizer) stabilizes the solution, keeping gj small and therefore optimising the contributions from different proteins from the reference set in the final solution. For example, if a particular protein conformation and therefore spectral characteristics are contributing either negatively or not at all to the experimental protein, then the a coefficient becomes zero or negative which can destabilize the solutions obtained. This confers advantages over the linear least squares method which assumes that each of the reference protein spectra exhibit structural and spectral characteristics in common with the experimental protein.
Ref: Provencher & Glockner 1987
Principle component analysis
Principal component
factor analysis was applied to circular dichroism analysis in order to
identify a number of subspectra that reflect the variations in fractional
secondary structure within the proteins in the reference database. The
process expresses the reference set as a small number of characteristic
parameters, by treating the CD of each reference protein at a particular
wavelength as vectors within a functional space. The subspectra are similar
to the orthogonal basis set as in the SVD algorithm.
qi (v) = S aij Sj(v)
j=1
qi is the whole set of reference protein spectra CD data represented as a linear combination of p<<n (p is the number of proteins selected from j proteins in the reference set to derive the subspectra set) orthogonal functions, Sj(v) subspectra. aij is the fraction of each subspectrum Sj in the test protein spectrum.
The vectors are then normalised and the correlation coefficient between the test protein CD spectra and each reference protein is evaluated. This produces an R matrix. The R matrix is diagonalised, which defines the dimensionality of the dataset p(p<<n) as the number of significant eigen values.
The subspectra matrix is then derived by using the projection matrix P:
A complete back transformation is then performed relating the weighting coefficients to secondary structure in a similar manner to the SVD algorithm.
Another implementation of this method has been described by Pancoska et al using clustering techniques instead of a back transformation in order to assign the experimental protein secondary structure. The reference proteins are clustered according to their calculated PC/FA spectra coeeficients and similarly the reference data are also clustered according to their known structure fractions. A combination of multi and linear regression analysis enables the fractional content of an unknown protein to be evaluated by systematically searching for the combinations of coefficients that produce the best regression scores. In this way the algorithm assumes that not all spectral and structural properties of proteins are correlated. Using multilinear regression analysis allows for the fact that a particular subspectra may include information regarding more than one structural class or more than one subspectra may contain information conceerning a particular conformation.
Ref: Pancoska et al 1996
Convex constraint analysis
The following simplified equation is used to generate the CD spectra of hypothetical proteins assuming 3 classes of structure (p=3):
f hi (l) = C ahelix x g ahelix + C bsheet x g bsheet + C bturn x g bturn
Ci is the weight of the ith component of the curve, gi(l). The data sets are represented in P dimensional space (Euclidean space) by pinpointing coefficients derived from their conformational weights.
The hypothetical f hj(l) curves are then fit by a calculated curve f cj(l) where i iterates the proteins in the reference data set and j iterates the weights for each component. The fcj(l) curve is of the form:
fcj (l) = S Cij x gi(l)
i=1
The gi(l) is the component curve for a particular protein. Theoretically the algorithm at this point is searching systematically for the most likely candidates acting as component spectra. A protein is defined as a pure spectra prototype if its CD curve is identicle with gi(l).
The algorithm does
not distinguish between the test protein and the proteins in the reference
set and constrains the sum of the weight coefficients C( i, j) for each
secondary structure to be equal to one, and each individual weighting coefficient
must be a positive integer.
Once the pinpointing coefficients of the experimental spectra have been defined, the structure fractions are known and a solution obtained. The best solution is selected by applying volume minimization algorithm to the Euclidean space. The smaller this volume, the better the approximation of results and therefore assignment of pinpointing coefficients.
The volume minimization algorithm allows a finite number of component curves to be extracted from the reference spectra. It is effectively a measure of maximum likelihood ensuring sensible values for the gi(l) functions are obtained in the following equation,
f(l) = S wi gi(l) + noise
where wi is the weighting of the secondary structure component in the protein.
The method is proposed
to be good for monitoring the effect upon conformation of protein and peptide
in solution as a function of temperature, pH or ligand binding as it can
be used effectively for deconvoluting large datasets.
Ref: Perczel et al 1992
This method estimates quantitatively, the amounts of each secondary structure present in the test protein, with accuracy of prediction comparable to that of conventionally predicted average fractional secondary structures (Pancoska et al 1999). The method uses matrix descriptors to represent the proteins of the basis set, indicating the number of segments of each secondary structure within the protein and the number of residues contributing to joins between secondary conformations.
Analysis begins by clustering the reference set proteins in terms of structure content and spectral similarity. The spectra are then processed by neural networks that are optimized to predict protein structure of a particular structural class, to which the spectra of the protein of interest belongs.
The matrix descriptor
results from the neural network are then normalised according to the structural
class data. The most succesful analyses required neural networks with 100
input neurons, 27 hidden layers and 8 output neurons to assign the descriptor
values.
A hyperbolic transfer
function and normalised cumulative delta rule algorithm was used to update
the weightings of the layers during the training phase.
Example matrix descriptor:
Protein A
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
H number of helical
segments he helix to sheet connections
E number of sheet
segments ec sheet to coil connections
C number of random
coil segments ch coil to helix connections
The theory behind this type of analysis is that two proteins with identical fractional components of helix sheet and random coil can yield different spectral effects as a function of the average strand/helix length, and also because of the fact that the phi/psi torsion angles in the peptide backbone are often distorted at the ends of secondary structural elements, and particularly where secondary structures are adjacent. This can lead to changes in the ordered arrays of chromophores in secondary structural positions, and therefore spectral effects. Individual effects are thought to be small, yet 20% variability between the spectra of two helical peptides of identical fractional content has been observed, where the peptides differed 3 fold in helix length.
Ref: Pancoska et al
1999