R Generator and a Colorful PCA

Feb 23rd, 2014

As simple as it may seem, sample data generation is not a trivial task, especially when random landmarks are to be generated. Usually, one would use multivariate normal distribution-based generator (like mvrnorm in R`s MASS package) in order to generate correlated data. The following function was used to generate data similar to the real world datased, unfortunately based directly on it, by using the coefficients from regressions between successive columns in a data matrix, which represent landmark coordinates in XY data matrix. The following function uses this data matrix (446 individuals and 28 landmarks), and performs regressions between X-Y pairs for all coordinates. Finally it uses intercepts and slopes to infer mean and SD for rnorm function, random number generator.

Resampler function for random resamples of a real data matrix

resampler <- function(mat) #simulations based on rnorm random sampling
{
  x <- dim(mat)[1]
  y <- dim(mat)[2]
  indexrow <- c(2:x)
  combinations <- matrix(c(1:y,2:y), ncol = 2) #ignore the warning message
  slope <- numeric(y)
  intercept <- numeric(y)
  sds <- numeric(y)
  for(i in 1:y)
  {
    veca <- mat[,combinations[i,]][,1]
    vecb <- mat[,combinations[i,]][,2]
    model <- lm(veca~vecb)
    slope[i] <- coef(model)[2]
    intercept[i] <- coef(model)[1]
    sds[i] <- sd(mat[,combinations[i,]][,1])
  }
  coefs <- data.frame(intercept,slope,sds)
  cexox <- data.frame(c(1:x))
  for (i in 1:y)
  {
    dataCol <- rnorm(length(mat[,combinations[i,]][,2]),mean=intercept[i]+slope[i]*mat[,combinations[i,]][,2],sd=sds)
    cexox <- cbind(cexox, dataCol)
  }
  sampleMatrix <- as.matrix(cexox)
  sampleMatrix <- sampleMatrix[,-1]
  return(sampleMatrix)
}

resampledCap <- resampler(capreolusMatrix) #resample the original matrix-generate random coordinates

When the function finishes the output is also an XY matrix, which needs to be converted to an array in order to use it in gpagen function from the geomorph package. After that the procedure follows all the usual steps of the GM analysis, with the exception of factor levels generation in order to simulate grouping, and finally performing PCA on the Procrustes shape variables.

Basic GM procedures and factor level generation

library(geomorph)
capreolusArray <- arrayspecs(resampledCap, 28, 2, byLand = FALSE)
capreolusGPA <- gpagen(capreolusArray, ShowPlot = FALSE)
pop <- sample(5, 446, replace = TRUE) #generate random 5 population partition
pop[which(pop == 1)] <- "pop1"
pop[which(pop == 2)] <- "pop2"
pop[which(pop == 3)] <- "pop3"
pop[which(pop == 4)] <- "pop4"
pop[which(pop == 5)] <- "pop5"
capreolusGPA2d <- two.d.array(capreolusGPA$coords) #get the data in XY format for PCA

PCA is then done using the usual R`s prcomp function and ggplot2 for plotting the data points using fantastic ColorBrewer color schemes (which are the names of types and palettes in ggplot2 scale_color_brewer geom). In order to fine-tune the PCA figure, the ggplot2 can also use custom fonts for plot annotation. Prior to that, fonts must be imported and registered, which is greatly facilitated by using the extrafont library.

PCA and ggplot2 code for a PCA scatterplot

capPCAwhole <- prcomp(capreolusGPA2d)
capPCA <- data.frame(capPCAwhole$x[,1], capPCAwhole$x[,2], capPCAwhole$x[,3], capPCAwhole$x[,4])
capPCA <- data.frame(capPCA, pop)
names(capPCA) <- c("PC1","PC2","PC3","PC4","pop") #prepare a data.frame for ggplot2
meanPCA1 <- aggregate(capPCA[,1], mean, by = list(capPCA[,5])) #calculate average PC score per group for plotting
meanPCA2 <- aggregate(capPCA[,2], mean, by = list(capPCA[,5]))
meanPCA <- data.frame(meanPCA1, meanPCA2[,2])
names(meanPCA) <- c("pop","PC1","PC2")

library(extrafont) #for using i.e. Times New Roman Fonts in ggplots
font_import(pattern="[T/t]imes") #this imports Times font family
loadfonts(device="pdf")

library (ggplot2)
theme_set(theme_bw())
pcaplot <- ggplot(capPCA, aes(x=PC1, y=PC2, group = pop)) + geom_point(size = 7, shape = 19, aes(color=pop)) + scale_color_brewer(palette="Set1")
pcaplot <- pcaplot + theme(panel.grid.major = element_line(size = 0.8, linetype = 2)) + theme(panel.grid.minor = element_line(size = 1, linetype = 2))
pcaplot <- pcaplot + theme(text=element_text(size=20, family="Times New Roman"), legend.text=element_text(size = 22, family = "Times New Roman"), legend.title = element_text(family ="Times New Roman")) + xlab("PC1") + ylab("PC2")
pcaplot <- pcaplot + geom_point(data = meanPCA, size = 14, shape = 19) + geom_text(data = meanPCA, size = 10, label = meanPCA$pop, family = "Times New Roman", vjust = -0.9)
pcaplot

'PCA'

The plot indicates very little differentiation between the populations, but I guess that`s well expected since so much randomness is at hand.

XY Outlines in GIMP and imageJ

Feb 1st, 2014

This post continues on the first post about outline deformations and presents a simple way for generating outlines as XY data. XY format is very convenient for representing outline and line art in R or python. This type of visualization can be produced using GIMP and imageJ through a sequence of easy steps. GIMP enables easy extraction of the central object from the image background. First step in outline extraction should be the precise definition of the object boundaries, such that the contrast between the object and the background is maximal. Example picture for this post is available here. After opening the picture in GIMP workspace it looks like in Figure 1.

'Gimp import' 'Gimp scissors'

The easiest way to separate this chamois cranium from its background is to use Scissor Select Tool from the GIMP toolbox. This tool can nicely track the path between sucessive control points that should be placed along the desired object (Figure 2). When control points are placed along the shape, selection can be completed by pressing enter, and the background can be deleted by inverting selection pressing Ctrl+I. and deleting it using the delete key (Figure 3). Selection should be inverted once more and converted to path; from the Select menu To Path. Finally, stroke path (Figure 4, path card and right click on selection, Stroke path) gives the desired outline that can be exported as .tiff for imageJ through Export in the File menu.

'Gimp no background' 'Gimp stroke path'

In imageJ, picture should be converted to binary in Process menu, by selecting Binary and Make Binary. Final step is saving the binary image selection as XY data. First the outline should be selected by the magic wand selection tool (Figure 5) and then the image should be saved in XY format, in File menu, Save as and select XY coordinates. The XY data is also available here.

'imageJ'

Now the .txt file could be easily imported into R and python.

Random Landmark Monsters in Python

Jan 17th, 2014

Generating random datasets is facilitated by powerful algorithms in both R and python that are capable of drawing random numbers from predefined distributions. For simulating geometric morphometric data the only important constraint is correlation between x and y coordinates of landmarks. Spacing of landmarks in also important, especially if the aim is to generte life-like data set, but it is not essential (hopefully) for morphometric analyses to work. Python scipy stack (numpy, scipy, pandas and matplotlib + Ipython console) is a great budnle which enables both random number generators and great visualizations. The code presented in this post uses all of these libraries, especially pandas, which offer the basic data structure used for storing landmark data, DataFrame. Also, it can just be pasted into Ipython console window (afer installing any of the scipy stack python distributions). The basic idea is to use uniform distribution random number generator (line 7) in order to generate ranges for the spacing of landmark groups, and then multivariate normal random distribution for creating the two-column correlated data (x, y coordinates, line 8). This code will simulate the situation with 10 landmarks recorded on 200 individuals. Spacing of landmarks can be controlled by the ranges of uniform generators (175-220 in code), and the spread of landmarks is contolled by input covariance matrix for multivariate normal generators ([[3,0],[0,3]] in code).

Importing libraries and generating random landmarks

import numpy as np
import pandas as pd
import brewer2mpl as bmpl
import matplotlib.pyplot as plt
import scipy.interpolate

coordstest = np.vstack([np.random.uniform(175, 220, 10), np.random.uniform(175, 220, 10)]).T #genereta coordinate ranges - spacing of landmarks
coords = np.vstack([np.random.multivariate_normal(coordstest[i,:], [[3,0],[0,3]], 200) for i in range(10)]) #correlated x and y from multivariate normal
coordinates = (np.arange(0,10).reshape(-1,1)*np.ones(200).reshape(1,-1)).flatten() #generate factor coodinates each landmark has 200 recorded points
coordinates = pd.Series(coordinates)
allCoords = pd.DataFrame(coords, columns = ['x','y']) #coordinates DataFrame
allCoords['coordinates'] = coordinates
meanCoords = allCoords.groupby('coordinates').mean() #mean landmark coordiantes
ind = np.arange(0,10)
ind = pd.Series(ind)
meanCoords['ind'] = ind

allCoords contain all generated landmarks, while meanCoords is the collection of mean coordinates for each landmark. In order to visualize the generated data, a convenient “outline” can be drawn, connecting all landmarks smoothly. Since generated landmarks are not ordered in a way that permits simple connect-the-dots line between them, they should be rotated, and ordered differently. This can be done by using the centroid coordinates (from all landmarks) and the polar angle between the lines connecting centroid and each landmark. If polar angle is used, then landmarks (from meanCoords) can be ordered according to its value, clockwise. Polar angle can be calculated as the arctan between x and y coordinates of mean landmarks and the centroid.

Calculating polar angle and re-ordering of the meanCoords

x = np.array(meanCoords['x']) #convenience function
y = np.array(meanCoords['y'])
points = np.array((x,y)).T
cent = (np.sum(meanCoords['x'])/len(meanCoords), np.sum(meanCoords['y'])/len(meanCoords)) #the overall centroid
angle = np.arctan2(points[:,1]-cent[1],points[:,0]-cent[0]) #summary polar angle of all points
points = pd.DataFrame(points, columns = ['x','y'])
angle = pd.Series(angle)
points['angle'] = angle
points = points.sort('angle') #sort by polar angle from the centroid

If a path would be drawn between the mean landmarks now, it would be irregular, and not too informative. One way of constructing the smooth connection between landmarks is to use interpolation algorithms that are part of scipy.interpolate. One issue with this approach is the connection between the first and the last landmark, since it must be added to points DataFrame as eleventh landmark, with x,y coords the same as for the first one, in order to complete the path. This added segment behaves erratically during interpolation so the generated figures might be distorted. But this will not happen always and all program routines could be re-run as long as the desired, nice, result emerges.

Interpolation of the outline data

x_hull = np.array(points['x'])
x_hull = np.append(x_hull, x_hull[0]) #add first landmark as the last one (x)
y_hull = np.array(points['y'])
y_hull = np.append(y_hull, y_hull[0]) #add first landmark as the last one (y)
t = np.zeros(x_hull.shape)
t[1:] = np.sqrt((x_hull[1:] - x_hull[:-1])**2 + (y_hull[1:] - y_hull[:-1])**2)
t = np.cumsum(t)
t /= t[-1]
nt = np.linspace(0, 1, 100)
x1 = scipy.interpolate.spline(t, x_hull, nt) #interpolated coordines for smooth lines
y1 = scipy.interpolate.spline(t, y_hull, nt)

Finally, plots are produced sequentially, first all landmarks, then mean landmarks, and finally the outline.

Plotting

plt.scatter(allCoords['x'], allCoords['y'], s = 25, c = "#FFD699", edgecolors = 'none') #plot all sampled points
plt.scatter(x, y, c = ind, s = 60, cmap = bmpl.get_map('Set3', 'qualitative', 7).mpl_colormap) #plot means with color brewer palette
plt.plot(x1, y1, '--', c = "#97CAFF", lw = 3) #plot outlines
labels = ['Landmark {0}'.format(i) for i in range(10)]
for label, x, y in zip(labels, points['x'], points['y']): #annotate mean landmarks by numbers
 plt.annotate(label, xy = (x, y+0.3),ha = 'right', va = 'bottom')
plt.grid()

'First monster' 'Second monster'

Generated “monsters” are just there to get the idea of a possible shape, although the create the impression of the “outline”, such that all landmarks are sampled from the outer perimeter of the object. The code presented would not be complete if it wasn`t for the help from people from stackoverflow (here, here and here).

Colorful Outlines for Shape Comparison

Jan 9th, 2014

This procedure is based on the outlines generated from the digital photos using imageJ and converting imageJ images to x-y continuous outline data available from here. Its goal is to present a visual overview of global shape differences between individuals from natural populations, using landmark data from ventral projection of their crania. All outlines are based on deformation via Thin Plate Splines, using mean shapes for populations as deformation targets and references. Superimposition methods as well as preliminary GM analyses were done in R and marvelous geomorph package by Dean Adams and Erik Otarola-Castillo. Additionally, since one of the common points of contemporary scientific research is the reproducibility of solutions offered all posts will also contain randomly generated sample data that could be used similarly to real-world datasets. Sample generation can be very useful, especially in teaching, so I intend to focus on it in future posts.

Importing libraries and generating the basic dataset (files should be placed in your working directory)

library(geomorph)
library(ggplot2)
library(Morpho)
d <- read.table("ventralnoOutlineCap.txt") #outline data
load("capreolusRgen.RData")
capreolusArray <- capreolusSample1 #or capreolusSample2-5

The workspace “capreolusRgen.RData” (which can be downloaded from here) contains several randomly generated datasetes of 657 individuals and 28 landmarks, named “capreolusSample#”. These data was generated on the basis of real-world values, using the linear regression model to control random number generators. The code that was used probably does not repoduce the sampling of landmarks well, especially regarding correlations between pairs of landmark coordinates or the landmarks that conform to the object symmetry, but for the purpose of illustration in this post, I hope they should be fine.

Procrustes superimposition

capreolusGPA <- gpagen(capreolusArray, ShowPlot = FALSE)
pop <- sample(3, 657, replace = TRUE) #generate random 3 population partition
pop[which(pop == 1)] <- "pop1"
pop[which(pop == 2)] <- "pop2"
pop[which(pop == 3)] <- "pop3"
capreolus <- capreolusGPA$coords #extract landmark coordinates

Following the Procrustes superimposition is the calculation of mean shapes, both for all males and for separate populations. After mean shapes are calculated the only thing left is to use TPS in order to deform outlines (variable d), using mean shape of all males as a reference and mean shape of populations as target. This can all be done using Morpho R-package from Stefan Schlager.

Mean shapes and TPS deformations

meanCap <- mshape(capreolus)
meanPop1 <- mshape(capreolus[,,which(pop == "pop1")]) #by population
meanPop2 <- mshape(capreolus[,,which(pop == "pop2")])
meanPop3 <- mshape(capreolus[,,which(pop == "pop3")])
pop1 <- data.frame(tps3d(as.matrix(d), meanCap, meanPop1)) #for each population
pop2 <- data.frame(tps3d(as.matrix(d), meanCap, meanPop2))
pop3 <- data.frame(tps3d(as.matrix(d), meanCap, meanPop3))
capWhole <- rbind(pop1, pop2, pop3) #combine data
pops <- c(rep("pop1", 2836), rep("pop2", 2836), rep("pop3", 2836)) #outline has 2836 points

Finally, depicting shape changes can be achieved by wonderful Hadley Wickham`s ggplot2 R-package. This package has a neat way of “forcing” you to keep your data organized, so all variables are inside one data frame, both quantitative and qualitative.

ggplot2 plotting of shape outline deformations

wholeCap <- data.frame(capWhole, pops) #deformed outlines and population membership
theme_set(theme_bw()) #change default ggplot theme to b&w
dplot <- ggplot(wholeCap, aes(wholeCap[,1],wholeCap[,2], group = pops)) #initialize ggplot object
dplot <- dplot + geom_path(size = 1, aes(color = pops)) + facet_grid(.~pops) #add layers
dplot + theme(axis.title = element_blank(), axis.text = element_blank(), axis.ticks = element_blank())

'Outlines'

By inspecting outlines it can be seen that the individuals from pop1 are the smallest while the ones from pop2 are the largest. Shape differences are also determined by the relationship of length to width, so that individuals from pop2 have the widest crania, while the ones from pop1 have the narrowest. Also, it can be seen that in the individuals with the largest crania, size differences are detemined mostly by dimensions of the anterior part, maxillary and rostral regions, that are both wider and longer with respect to individuals with smaller crania. Posterior part of the cranium is more similar between individuals from different populations, and it may be more stable.

Blog Archives Newer →

Creative morphometrics

Using R and python for colorful research of biological shape

R Generator and a Colorful PCA

XY Outlines in GIMP and imageJ

Random Landmark Monsters in Python

Colorful Outlines for Shape Comparison