Max-Planck-Institut für Informatik
max planck institut
informatik
mpii logo Minerva of the Max Planck Society
 

Recco - Recombination Analysis Using Cost Optimization

Make sure to read the Walkthrough Guide - it summarizes my experience using Recco!

Contents

  1. Download and Installation
  2. The Graphical User Interface
    1. Main View
    2. Alignment View
    3. The P-Value Inspector - How to Keep Complexity at Bay
    4. File Menu
    5. View Menu
    6. Settings Menu
      1. Mutation Cost
      2. Gap Cost
  3. Command-line Arguments

1. Download and Installation

Recco was tested with Java 1.4.2 and 1.5.0, but might also work with older versions of Java. Please download and install the Java Runtime Environment, if you do not have it already.

 

2. The Graphical User Interface

2.1 Main View

An example of the output of Recco for the dataset R1R2.fa. Description:

Most results are visualized interactively by the GUI. View setting (in the view menu) usually affect the display immediately, while settings affecting the computation (in the settings menu or in the alignment or parametric view) update the results and schedule new computation jobs. These jobs are then immediately processed, if autostart is enabled.

 

2.2 Alignment View

The alignment view falls into three parts, the names, the sequences and the positions. The name component has the following tasks:

The position and sequences component do not accept mouse input. The sequences component shows either (depending on the setting in the view menu):

  1. optimal solution
    • the sequence strips that are part of an optimal explanation of the putative recombinant are shown with a red background.
  2. cost measure cip
    • the color of a nucleotide visualizes the total cost of an explanation forced through that nucleotide
    • red is low cost and blue is high cost. Bright red nucleotides visualizes the optimal solution as in 1.
  3. robustness measure rip
    • the color of a nucleotide visualizes the robustness score.
    • red
  4. breakpoint p-values
    • does not refer to an optimal solution (!) and is not very helpful
    • red visualizes low (=more significant) p-values, blue visualizes high p-values

The computation of the optimal solution always refers to a setting of alpha as shown e.g. in the parametric view. A white foreground color in the alignment view highlights mutations with respect to the putative recombinant sequence.

Examples:

The following image shows cip with R1 as the putative recombinant and R2 excluded from the analysis:

The same analysis result as obtained by selecting View=>Optimal Solution Only:

2.3 P-Value Inspector - How to Keep Complexity at Bay

You can find a lot more information on how to use the P-Value Inspector in the Walkthrough Guide.
The P-Value Inspector condenses the information of an analysis and displays the discovered recombination events. It is only shown if View=>Show P-Value Inspector is selected:

The table shows the recombination events in the dataset that satisfy the filter criteria. Each row in the table describes a single recombination event. For more details on the computation of the p-values, see the paper and the following section..

Selecting a row (i.e. a recombination event): visualizes the recombination in the alignment view by setting the sequence and alpha value accordingly and centering the breakpoint position in the alignment view.

Other actions of the P-value Inspector:

 

2.3.1 How P-Values are Computed

The following exposition is for anybody that wants to know what happens behind the scene. You can savely skip this section.

Recco computes p-values for recombination in the whole dataset, for each sequence, at each position, and at each position for a specific sequence. The p-values are based on sij, the amount of mutation cost that can be saved by allowing for recombination at position i in the explanation of sequence j. By permuting the columns of the alignment and recomputing sij for the permuted dataset, we can estimate the distribution of sij under the null-hypothesis of no recombination. Now let Xij be the random variable (i.e. distribution) for sij under the null-hypothesis and xij be the values for the unpermuted dataset. Then we define:

p-value for the whole dataset
p-value for sequence j
p-value for position i
p-value for position i and sequence j

In the following, we focus on a single recombination event. We define a recombination event as some interval i1 i i2 for some sequence j where c := xij has a constant value. We then assign to each recombination event the following p-values:

dataset p-value
the p-value for recombination in the dataset if the recombination event was the strongest in the whole dataset
sequence p-value
the p-value for recombination in the sequence if the recombination event was the strongest in the sequence
sequence breakpoint p-value
the median of the p-values for sequence j and any position between i1 and i2. Please use this value as an indicator only, as it is statistically hard to justify taking the median of some p-values

 

2.4 File Menu

This is pretty self-explanatory:

 

2.5 View Menu

This menu is rather self-explanatory and changes how and which data is visualized. Be sure to enable Show P-value Inspector.

 

2.6 Settings Menu

Besides the "Stop Computation" menu item, greyed out menu items are not implemented. The settings menu is used to change the following input parameters:

 

2.6.1 Mutation Cost

The mutation cost m(a, b) defines the cost of matching a character a with a character b. Gaps '-' and unknown characters 'N', '?' are treated like any other character in the algorithm. Therefore, it is important to set the associated costs carefully. For example, we can avoid pairing gaps preferentially if we set the mutation cost m(-, a)=0 for any character a.

Predefined mutation cost matrices include:

Additionally, it is possible to create, load and save user defined mutation matrices by selecting the "User defined..." menu option. The file format is a pure text file and straigth-forward to adapt.

 

2.6.2 Gap Cost

As we use a multiple alignment including gaps as an input, we have to decide how to score gaps. Consider this example:

recombinant  AC----GT----CTGGTAGCGCT
explanation  ACGAGCCT----CCT----GCGC

The upper sequence shows the (putative) recombinant that we seek to explain by recombination and mutation. The explanation is the sequence that is obtained by recombination of the other sequences in the alignment. In our case, there are three different kinds of gaps (in order of appearance in the alignment above):

  1. a gap in the recombinant
    As our goal is to explain the recombinant, this gap is only discarding information we do not need. As such it should only involve low or zero costs.
  2. a gap in the recombinant and the explanation
    This setting is a result of using a multiple alignment as input and does not constitute a real gap. Incorporating a paired gap in the solution does not involve any cost.
  3. a gap in the explanation
    An interpretation of this gap is that we do not have information to explain part of the recombinant. Consequently, it should be scored with a rather high cost.

In the gap cost dialog you can assign gap extension costs for gaps of type 1 or type 3 seperately. Biologicall, the cost for gaps of type 1 should be very small - a cost of 0 is appropriate, therefore. The cost for gaps of type 3 can also easily be changed in the toolbar, in case you need to experiment with it.

Gap open costs have been disabled as the permutation test for computing p-values reports wrong results in this case.

 

3 Command-line arguments

Recco runs an interactive GUI only if no command-line arguments are specified. A help text is displayed if you specify a single or an invalid command-line argument. The output format is the same as for the "Save Analysis as Text..." menu item.