Creating a Q-matrix to use in PopGenHelpR
Source:vignettes/articles/PopGenHelpR_createQmatrix.Rmd
PopGenHelpR_createQmatrix.Rmd
Purpose
To generate a q-matrix of ancestry coefficients for use in the
PopGenHelpR
functions Ancestry_barchart
and
Piechart_map
.
What is a Q-matrix?
A q-matrix is a matrix containing as many rows as individuals and columns as genetic clusters. Each cell represents an ancestry coefficient (also known as cluster assignments), which is the contribution of a genetic cluster to a particular individual. Q-matrices are commonly used in population genomics to evaluate gene flow between populations (e.g., admixture) or species (e.g., introgression). ADMIXTURE (Alexander et al., 2009) and sNMF (Frichot et al., 2014) are commonly used software to estimate the number of genetic clusters in data and generate ancestry bar charts with q-matrices.
Let’s generate some q-matrices now that you know what they are! We will create q-matrices using each of the programs mentioned above.
sNMF
We will start with sNMF because it is implemented in the R package LEA
(Frichot & Francois, 2015). After running sNMF (see this
tutorial if you need help) you just need to use the Q
function.
# If I have a sNMF project named sNMFobject with K number of ancestral populations (genetic clusters), and my best run is run 1 (determined as the run with the lowets cross-entropy)
Qmat <- Q(sNMFobject, K = K, run = 1)
All you need to do now is append the sample names to the q-matrix as
the first column (you can do this with cbind
or in any text
editor). Then you can use it in PopGenHelpR
. Note that you
must be careful that your order in the q-matrix is the same as the order
of the samples you are appending.
Example of formatting the q-matrix for PopGenHelpR
Here we show you how to format a q-matrix generated with the
Q
function from LEA for use in
PopGenHelpR
.
First, we will create a matrix that we may expect from LEA. We also need to create fake sample names Please note that this is only a toy example and is not real data.
# Create fake matrix
Qmat <- t(matrix(data = c(0.25, 0.4, 0.35), nrow = 3, ncol = 3))
Fake_inds <- c("FS_1", "FS_2", "FS_3")
Cool! We have data, so can we use it in PopGenHelpR
? No,
because Ancestry_barchart
and Piechart_map
need a data.frame or CSV; these functions also need the first column to
be the individual names. PopGenHelpR
uses the individual
names as a key to link the q-matrix data with populations and
coordinates.
Let’s add the individual names!
# Add the names
Qmat_wnames <- cbind(Fake_inds, Qmat)
So can we use this Qmat_wnames
now? No, because
Qmat_wnames
is still a matrix and let’s see what
cbind
did to our numeric data. Notice that
cbind
make everything a character, we need the cluster
contributions (columns 2 through 4 here) to be numeric. We will fix this
using the sapply
function.
# Check the structure of the Qmat_wnames
str(Qmat_wnames)
#> chr [1:3, 1:4] "FS_1" "FS_2" "FS_3" "0.25" "0.25" "0.25" "0.4" "0.4" "0.4" ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:4] "Fake_inds" "" "" ""
Qmat_df <- as.data.frame(Qmat_wnames)
Qmat_df[2:4] <- sapply(Qmat_df[2:4], as.numeric)
# Check again
str(Qmat_df)
#> 'data.frame': 3 obs. of 4 variables:
#> $ Fake_inds: chr "FS_1" "FS_2" "FS_3"
#> $ V2 : num 0.25 0.25 0.25
#> $ V3 : num 0.4 0.4 0.4
#> $ V4 : num 0.35 0.35 0.35
Notice that our cluster contribution columns are now numeric and that
our Qmat_df
object is a data.frame.
Now we can use it in PopGenHelpR
with a population
assignment file/data.frame to generate figures.
ADMIXTURE
ADMIXTURE is a little more complex because it is not associated with an R package, but it is nice because it gives us the q-matrix automatically. See this tutorial for more details.
In the example below, we tell ADMIXTURE to use a bed file as input and run the analysis with a K value of 5.
This will output a file with the .Q extension, which contains ancestry coefficients for each individual (our q-matrix).
Questions???
Please email Keaka Farleigh (farleik@miamioh.edu) if you need help generating a q-matrix or with anything else.
References
Alexander, D. H., Novembre, J., & Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9), 1655-1664.
Frichot, E., & François, O. (2015). LEA: An R package for landscape and ecological association studies. Methods in Ecology and Evolution, 6(8), 925-929.
Frichot, E., Mathieu, F., Trouillon, T., Bouchard, G., & François, O. (2014). Fast and efficient estimation of individual ancestry coefficients. Genetics, 196(4), 973-983.