 User guide

# CD Deconvolution Algorithms

## Singular value deconvolution

Singular value decomposition (Hennesey & Johnson 1981) is mathematically similar to principal component factor analysis, and is a multi-component analysis using eigenvector equations (Greenfield 1996). Orthogonal basis curves are extracted from the reference protein set. Each basis curve must be different with unique maxima and minima and is related to a known mix of secondary structures. The sum of secondary structures is not constrained to equal one in this method.

A single matrix is constructed in the form a x b where a is the number of reference proteins and b is the number of CD data points (C). This matrix is then multiplied by its transpose Ct and diagonalized to produce a matrix of a eigenvectors U, and a matrix of a eigenvalues E.

(C x Ct) U  = U x E

The matrix of the orthogonal basis CD spectra B is then calculated as follows:

B = Ut C

The relative importance of each basis CD spectrum with regard to the test protein spectrum is represented by the square root of the eigenvalue associated with the the corresponding protein from the orthogonal basis spectra matrix B.

The number of basis spectra considered necessary to accurately reconstruct the experimental protein CD spectrum is defined by m.

 s 2m   =   (1/n(m- m)) m ( S ei )    i=m+1
 n = number of data points in each original vector s = the standard deviation m = the number of  original vectors ei = the eigenvalues arranged in decreasing magnitude

m values of 5 or 6 are generally found to resonstruct spectra with good accuracy.

The CD data matrix (C) and orthogonal basis matrix B can then both be truncated (Ctr and Btr) to include only information important in reconstructing the experimental data.

Ctr = UtrBtr

Each CD spectrum is then described as a function of Btr, the truncated B matrix, and each protein in the Btr matrix has known secondary structure from crystallographic or NMR data. The structural data accompanying the truncated basis set of proteins is then exploited in order to derive structural fractions for the experimental protein. As the relationship between conformation and CD data is direct, the original equations are substituted with the structural data and the same relationships apply. Bi where each i indexes the protein vectors and coefficients and represents a mix of secondary structures corresponding to D'i.

The vector D' is derived (similarly to B) from the relationship:

D' = Ut S'

where, S is the matrix of secondary structures from the reference set of proteins c x d where c is the number of proteins in the set, and d is the number of secondary structures.

The structural data is truncated corresponding to the truncated CD matrix derived earlier matrices.

S'tr = Utr D'tr

Finally, the conformational fractions of the experimental protein are found by solving the equation below ( given that there are 5 secondary structures):

Si' = mi1D1' + mi2D2' + mi3D3' + mi4D4' + mi5D5'

The SVD algorithm does not constrain the sum of secondary structure fractions to equal one within the algorithm, a constraint adopted by most other methods.

Ref: Hennessey & Johnson 1981

## Neural Network Analysis

Neural network methods are becoming increasingly common in applications developed for protein secondary structure analysis. The first neural network method (Bohm et al 1992) demonstrates assignment of secondary structure via the non-algorithmic deconvolution of protein CD spectra. The neural network system is organised into input, output, and hidden layers linked by nodes or neurons. Connections between the neurons are assigned numerical weights (usually via a back propagation algorithm) and then trained. In the training phase, the networks are fed large volumes of information. The network learns by assigning weights to connected information and iterating patterns between an input and output layer. Random weights are assigned to the neural connections, and for each input pattern, and the output pattern is calculated as a function of the input. The calculated output weights are residual errors and are used to adjust the original weightings until the residual error is minimised, and accurate results are obtained.

In this case CD wavelengths form the input layer neurons and secondary structure information form the output neurons. The patterns represent the relationship between CD spectra and secondary structure for the proteins used in the training phase.   There are currently two neural network applications available, a single hidden layered approach, CDNN (Bohm et al 1992) using a linear transfer function to transfer information between layers and K2D using a Kohonen neural network with two dimensional output layer, (Andrade et al 1996). Both algorithms consider secondary structure as a superposition of secondary structure information, cysteine absorption, aromatic side chain absorption, random noise from CD measurement, and non-random errors in protein concentration measurement.

In CDNN the error propagation during the training phase (t) is calculated using the following equation:

D Wij(t) = s . di. Oi  +  m.DWij(t-1)

W = weight between two connected elements i & j
di  = error at processing element i
Oi = output of element j
s = learning constant
m  = momentum

Weightings are adjusted using the following calculation:

Wij (t+1) = Wij(t) + DWij(t)

Ref: Bohm et al 1996

## Least squares analysis

Circular dichroism curves can be deconvoluted by two different types of least squares analysis, constrained least squares and non constrained least squares. Both algorithms are based upon the concept of the experimental protein spectrum as the sum of a set of reference or model protein spectra, each fraction of protein spectra in the sum is representative of the proportion of secondary structure present.

The algorithms minimize the total deviation between the experimental and calculated model protein spectra. The difference between the two algorithms being that in the constrained version, the total of conformation fractions must equal 1 and the individual fractional content must be greater than 0.

The nature of circular dichroism analysis is such that the accuracy of analysis is sensitive to protein concentration, and the unconstrained least squares method accounts better for these errors and normalization can be applied after the analysis. In the constrained fit mechanism, the normalization is carried out before results are obtained, thus subjecting the method to over or underestimation due to protein concentration error.

rsq = S ( f 1 x qh(i) + f qh(i) + f qt(i))2
i=1

where rsq=residual squares error, n=number of data points, f=fractional weighting, qh=helix contribution ,qs = sheet contribution and so forth for all secondary structure classes considered.

Ref: Mao et al 1984, Yang et al 1978

## Ridge regression analysis

This algorithm is based on the least squares method but applies an additional statistical constraint that is principly similar to ridge regression. The added regularizer acts to stabilize the error prone least squares method (as defined in the Gauss Markov model).
The equation below is in keeping with the concept of cirular dichroism where a CD spectrum is expressed as a weighted sum of model or reference protein spectra.
Nf
y( l) = S f i ri (l )
i=1
ri( l)- Reference CD spectra structural content (as determined from x ray data)
f      - fraction of residues in secondary structure class i
y( l) - experimental CD spectrum

In order to deconvolute the experimental protein CD spectrum, the reference protein data is considered. The experimental CD spectrum is now expressed as a weighted sum of the reference protein spectra Rj(l).

N g
y(l) = S gj Rj (l )
j=1

By finding the g values for a set of reference protein CD spectra Rj( l), the experimental CD spectrum y(l ) can be reconstructed. Each fraction of secondary structure (f i) can be found by multiplying the weight coefficient gj by the fraction of residues of protein j in conformational class i.

N g
fi = S gj Fij
j=1

The g values are then determined from the Ng measured CD data values, yobsd (l k).
Ng

S = [y(lk) - yobsd(lk)]2 + a S (gj-1/Ng)2 = minimum
j=1

Without the left hand term in the above equation, the procedure would be identicle to least squares deconvolution. When a = 0, the solution is equivalent to the least squares solution, but when a>0, the second term on the left of the equation (the reqularizer) stabilizes the solution, keeping gj small and therefore optimising the contributions from different proteins from the reference set in the final solution. For example, if a particular protein conformation and therefore spectral characteristics are contributing either negatively or not at all to the experimental protein, then the a coefficient becomes zero or negative which can destabilize the solutions obtained. This confers advantages over the linear least squares method which assumes that each of the reference protein spectra exhibit structural and spectral characteristics in common with the experimental protein.

Ref: Provencher & Glockner 1987

## Principle component analysis

Principal component factor analysis was applied to circular dichroism analysis in order to identify a number of subspectra that reflect the variations in fractional secondary structure within the proteins in the reference database. The process expresses the reference set as a small number of characteristic parameters, by treating the CD of each reference protein at a particular wavelength as vectors within a functional space. The subspectra are similar to the orthogonal basis set as in the SVD algorithm.

p
qi (v) = aij  Sj(v)
j=1

qi is the whole set of reference protein spectra CD data represented as a linear combination of p<<n (p is the number of proteins selected from j proteins in the reference set to derive the subspectra set) orthogonal functions, Sj(v) subspectra. aij is the fraction of each subspectrum Sj in the test protein spectrum.

The vectors are then normalised and the correlation coefficient between the test protein CD spectra and each reference protein is evaluated. This produces an R matrix. The R matrix is diagonalised, which defines the dimensionality of the dataset p(p<<n) as the number of significant eigen values.

PT  R P = L
The aij values relating the experimental protein spectra and subspectra are determined from matrix P by:
a = PT
The first p rows of the of matrix PT which represent the most important components of the experimental spectra are used to determine a. Other subspectra represent uncorrelated noise.

The subspectra matrix is then derived by using the projection matrix P:

S = qP
Using the S matrix, and a coefficients of linear combinations, the experimental spectrum for each studied sample can be reconstructed by summing the linear combinations of the Sj functions after multiplying aij by their corresponding numbers to account for the CD components normalised out.

A complete back transformation is then performed relating the weighting coefficients to secondary structure in a similar manner to the SVD algorithm.

Another implementation of this method has been described by Pancoska et al using clustering techniques instead of a back transformation in order to assign the experimental protein secondary structure. The reference proteins are clustered according to their calculated PC/FA spectra coeeficients and similarly the reference data are also clustered according to their known structure fractions. A combination of multi and linear regression analysis enables the fractional content of an unknown protein to be evaluated by systematically searching for the combinations of coefficients that produce the best regression scores. In this way the algorithm assumes that not all spectral and structural properties of proteins are correlated. Using multilinear regression analysis allows for the fact that a particular subspectra may include information regarding more than one structural class or more than one subspectra may contain information conceerning a particular conformation.

Ref: Pancoska et al 1996

## Convex constraint analysis

This algorithm is based on three constraints and can deconvolute any set of CD spectra into its component curves and conformational weights. The method does not rely directly upon x-ray crystallographic data for the reference set of proteins, instead hypothetical protein data sets are constructed with systematically varying conformational weights, and the goal is to assign conformational weights of a set of subspectra to the protein of interest.

The following simplified equation is used to generate the CD spectra of hypothetical proteins assuming 3 classes of structure (p=3):

f hi  (l) = C ahelix  x  g ahelix  + C bsheet  x  g bsheet  +  C bturn  x  g bturn

Ci is the weight of the ith component of the curve, gi(l). The data sets are represented in P dimensional space (Euclidean space) by pinpointing coefficients derived from their conformational weights.

The hypothetical f hj(l) curves are then fit by a calculated curve f cj(l) where i iterates the proteins in the reference data set and j iterates the weights for each component. The fcj(l) curve is of the form:

P
fcj (l) = S Cij x gi(l)
i=1

The gi(l) is the component curve for a particular protein. Theoretically the algorithm at this point is searching systematically for the most likely candidates acting as component spectra. A protein is defined as a pure spectra prototype if its CD curve is identicle with gi(l).

The algorithm does not distinguish between the test protein and the proteins in the reference set and constrains the sum of the weight coefficients C( i, j) for each secondary structure to be equal to one, and each individual weighting coefficient must be a positive integer.

Once the pinpointing coefficients of the experimental spectra have been defined, the structure fractions are known and a solution obtained. The best solution is selected by applying volume minimization algorithm to the Euclidean space. The smaller this volume, the better the approximation of results and therefore assignment of pinpointing coefficients.

The volume minimization algorithm allows a finite number of component curves to be extracted from the reference spectra. It is effectively a measure of maximum likelihood ensuring sensible values for the gi(l) functions are obtained in the following equation,

f(l) = S wi gi(l) + noise

where wi is the weighting of the secondary structure component in the protein.

The method is proposed to be good for monitoring the effect upon conformation of protein and peptide in solution as a function of temperature, pH or ligand binding as it can be used effectively for deconvoluting large datasets.

Ref: Perczel et al 1992

Matrix Descriptors

This method estimates quantitatively, the amounts of each secondary structure present in the test protein, with accuracy of prediction comparable to that of conventionally predicted average fractional secondary structures (Pancoska et al 1999). The method uses matrix descriptors to represent the proteins of the basis set, indicating the number of segments of each secondary structure within the protein and the number of residues contributing to joins between secondary conformations.

Analysis begins by clustering the reference set proteins in terms of structure content and spectral similarity. The spectra are then processed by neural networks that are optimized to predict protein structure of a particular structural class, to which the spectra of the protein of interest belongs.

The matrix descriptor results from the neural network are then normalised according to the structural class data. The most succesful analyses required neural networks with 100 input neurons, 27 hidden layers and 8 output neurons to assign the descriptor values.
A hyperbolic transfer function and normalised cumulative delta rule algorithm was used to update the weightings of the layers during the training phase.

Example matrix descriptor:

Protein A

 3 0 3 H he hc 1 9 8 eh E ec 2 9 13 ch ce C

H  number of helical segments   he  helix to sheet connections
E  number of sheet segments   ec  sheet to coil connections
C  number of random coil segments  ch  coil to helix connections

The theory behind this type of analysis is that two proteins with identical fractional components of helix sheet and random coil can yield different spectral effects as a function of the average strand/helix length, and also because of the fact that the phi/psi torsion angles in the peptide backbone are often distorted at the ends of secondary structural elements, and particularly where secondary structures are adjacent. This can lead to changes in the ordered arrays of chromophores in secondary structural positions, and therefore spectral effects. Individual effects are thought to be small, yet 20% variability between the spectra of two helical peptides of identical fractional content has been observed, where the peptides differed 3 fold in helix length.

Ref: Pancoska et al 1999