As simple as it may seem, sample data generation is not a trivial task, especially when random landmarks are to be generated. Usually, one would use multivariate normal distribution-based generator (like mvrnorm in R`s MASS package) in order to generate correlated data. The following function was used to generate data similar to the real world datased, unfortunately based directly on it, by using the coefficients from regressions between successive columns in a data matrix, which represent landmark coordinates in XY data matrix. The following function uses this data matrix (446 individuals and 28 landmarks), and performs regressions between X-Y pairs for all coordinates. Finally it uses intercepts and slopes to infer mean and SD for rnorm function, random number generator.
Resampler function for random resamples of a real data matrix
resampler <-function(mat)#simulations based on rnorm random sampling{ x <- dim(mat)[1] y <- dim(mat)[2] indexrow <- c(2:x) combinations <- matrix(c(1:y,2:y), ncol =2)#ignore the warning message slope <- numeric(y) intercept <- numeric(y) sds <- numeric(y)for(i in1:y){ veca <- mat[,combinations[i,]][,1] vecb <- mat[,combinations[i,]][,2] model <- lm(veca~vecb) slope[i]<- coef(model)[2] intercept[i]<- coef(model)[1] sds[i]<- sd(mat[,combinations[i,]][,1])} coefs <- data.frame(intercept,slope,sds) cexox <- data.frame(c(1:x))for(i in1:y){ dataCol <- rnorm(length(mat[,combinations[i,]][,2]),mean=intercept[i]+slope[i]*mat[,combinations[i,]][,2],sd=sds) cexox <- cbind(cexox, dataCol)} sampleMatrix <- as.matrix(cexox) sampleMatrix <- sampleMatrix[,-1]return(sampleMatrix)}resampledCap <- resampler(capreolusMatrix)#resample the original matrix-generate random coordinates
When the function finishes the output is also an XY matrix, which needs to be converted to an array in order to use it in gpagen function from the geomorph package. After that the procedure follows all the usual steps of the GM analysis, with the exception of factor levels generation in order to simulate grouping, and finally performing PCA on the Procrustes shape variables.
Basic GM procedures and factor level generation
12345678910
library(geomorph)capreolusArray <- arrayspecs(resampledCap,28,2, byLand =FALSE)capreolusGPA <- gpagen(capreolusArray, ShowPlot =FALSE)pop <- sample(5,446, replace =TRUE)#generate random 5 population partitionpop[which(pop ==1)]<-"pop1"pop[which(pop ==2)]<-"pop2"pop[which(pop ==3)]<-"pop3"pop[which(pop ==4)]<-"pop4"pop[which(pop ==5)]<-"pop5"capreolusGPA2d <- two.d.array(capreolusGPA$coords)#get the data in XY format for PCA
PCA is then done using the usual R`s prcomp function and ggplot2 for plotting the data points using fantastic ColorBrewer color schemes (which are the names of types and palettes in ggplot2 scale_color_brewer geom). In order to fine-tune the PCA figure, the ggplot2 can also use custom fonts for plot annotation. Prior to that, fonts must be imported and registered, which is greatly facilitated by using the extrafont library.
PCA and ggplot2 code for a PCA scatterplot
1234567891011121314151617181920
capPCAwhole <- prcomp(capreolusGPA2d)capPCA <- data.frame(capPCAwhole$x[,1], capPCAwhole$x[,2], capPCAwhole$x[,3], capPCAwhole$x[,4])capPCA <- data.frame(capPCA, pop)names(capPCA)<- c("PC1","PC2","PC3","PC4","pop")#prepare a data.frame for ggplot2meanPCA1 <- aggregate(capPCA[,1], mean, by = list(capPCA[,5]))#calculate average PC score per group for plottingmeanPCA2 <- aggregate(capPCA[,2], mean, by = list(capPCA[,5]))meanPCA <- data.frame(meanPCA1, meanPCA2[,2])names(meanPCA)<- c("pop","PC1","PC2")library(extrafont)#for using i.e. Times New Roman Fonts in ggplotsfont_import(pattern="[T/t]imes")#this imports Times font familyloadfonts(device="pdf")library (ggplot2)theme_set(theme_bw())pcaplot <- ggplot(capPCA, aes(x=PC1, y=PC2, group = pop))+ geom_point(size =7, shape =19, aes(color=pop))+ scale_color_brewer(palette="Set1")pcaplot <- pcaplot + theme(panel.grid.major = element_line(size =0.8, linetype =2))+ theme(panel.grid.minor = element_line(size =1, linetype =2))pcaplot <- pcaplot + theme(text=element_text(size=20, family="Times New Roman"), legend.text=element_text(size =22, family ="Times New Roman"), legend.title = element_text(family ="Times New Roman"))+ xlab("PC1")+ ylab("PC2")pcaplot <- pcaplot + geom_point(data = meanPCA, size =14, shape =19)+ geom_text(data = meanPCA, size =10, label = meanPCA$pop, family ="Times New Roman", vjust =-0.9)pcaplot
The plot indicates very little differentiation between the populations, but I guess that`s well expected since so much randomness is at hand.
This post continues on the first post about outline deformations and presents a simple way for generating outlines as XY data. XY format is very convenient for representing outline and line art in R or python. This type of visualization can be produced using GIMP and imageJ through a sequence of easy steps. GIMP enables easy extraction of the central object from the image background. First step in outline extraction should be the precise definition of the object boundaries, such that the contrast between the object and the background is maximal. Example picture for this post is available here. After opening the picture in GIMP workspace it looks like in Figure 1.
The easiest way to separate this chamois cranium from its background is to use Scissor Select Tool from the GIMP toolbox. This tool can nicely track the path between sucessive control points that should be placed along the desired object (Figure 2). When control points are placed along the shape, selection can be completed by pressing enter, and the background can be deleted by inverting selection pressing Ctrl+I. and deleting it using the delete key (Figure 3). Selection should be inverted once more and converted to path; from the Select menu To Path. Finally, stroke path (Figure 4, path card and right click on selection, Stroke path) gives the desired outline that can be exported as .tiff for imageJ through Export in the File menu.
In imageJ, picture should be converted to binary in Process menu, by selecting Binary and Make Binary. Final step is saving the binary image selection as XY data. First the outline should be selected by the magic wand selection tool (Figure 5) and then the image should be saved in XY format, in File menu, Save as and select XY coordinates. The XY data is also available here.
Now the .txt file could be easily imported into R and python.
Generating random datasets is facilitated by powerful algorithms in both R and python that are capable of drawing random numbers from predefined distributions. For simulating geometric morphometric data the only important constraint is correlation between x and y coordinates of landmarks. Spacing of landmarks in also important, especially if the aim is to generte life-like data set, but it is not essential (hopefully) for morphometric analyses to work. Python scipy stack (numpy, scipy, pandas and matplotlib + Ipython console) is a great budnle which enables both random number generators and great visualizations. The code presented in this post uses all of these libraries, especially pandas, which offer the basic data structure used for storing landmark data, DataFrame. Also, it can just be pasted into Ipython console window (afer installing any of the scipy stack python distributions). The basic idea is to use uniform distribution random number generator (line 7) in order to generate ranges for the spacing of landmark groups, and then multivariate normal random distribution for creating the two-column correlated data (x, y coordinates, line 8). This code will simulate the situation with 10 landmarks recorded on 200 individuals. Spacing of landmarks can be controlled by the ranges of uniform generators (175-220 in code), and the spread of landmarks is contolled by input covariance matrix for multivariate normal generators ([[3,0],[0,3]] in code).
Importing libraries and generating random landmarks
12345678910111213141516
importnumpyasnpimportpandasaspdimportbrewer2mplasbmplimportmatplotlib.pyplotaspltimportscipy.interpolatecoordstest=np.vstack([np.random.uniform(175,220,10),np.random.uniform(175,220,10)]).T#genereta coordinate ranges - spacing of landmarkscoords=np.vstack([np.random.multivariate_normal(coordstest[i,:],[[3,0],[0,3]],200)foriinrange(10)])#correlated x and y from multivariate normalcoordinates=(np.arange(0,10).reshape(-1,1)*np.ones(200).reshape(1,-1)).flatten()#generate factor coodinates each landmark has 200 recorded pointscoordinates=pd.Series(coordinates)allCoords=pd.DataFrame(coords,columns=['x','y'])#coordinates DataFrameallCoords['coordinates']=coordinatesmeanCoords=allCoords.groupby('coordinates').mean()#mean landmark coordiantesind=np.arange(0,10)ind=pd.Series(ind)meanCoords['ind']=ind
allCoords contain all generated landmarks, while meanCoords is the collection of mean coordinates for each landmark. In order to visualize the generated data, a convenient “outline” can be drawn, connecting all landmarks smoothly. Since generated landmarks are not ordered in a way that permits simple connect-the-dots line between them, they should be rotated, and ordered differently. This can be done by using the centroid coordinates (from all landmarks) and the polar angle between the lines connecting centroid and each landmark. If polar angle is used, then landmarks (from meanCoords) can be ordered according to its value, clockwise. Polar angle can be calculated as the arctan between x and y coordinates of mean landmarks and the centroid.
Calculating polar angle and re-ordering of the meanCoords
123456789
x=np.array(meanCoords['x'])#convenience functiony=np.array(meanCoords['y'])points=np.array((x,y)).Tcent=(np.sum(meanCoords['x'])/len(meanCoords),np.sum(meanCoords['y'])/len(meanCoords))#the overall centroidangle=np.arctan2(points[:,1]-cent[1],points[:,0]-cent[0])#summary polar angle of all pointspoints=pd.DataFrame(points,columns=['x','y'])angle=pd.Series(angle)points['angle']=anglepoints=points.sort('angle')#sort by polar angle from the centroid
If a path would be drawn between the mean landmarks now, it would be irregular, and not too informative. One way of constructing the smooth connection between landmarks is to use interpolation algorithms that are part of scipy.interpolate. One issue with this approach is the connection between the first and the last landmark, since it must be added to points DataFrame as eleventh landmark, with x,y coords the same as for the first one, in order to complete the path. This added segment behaves erratically during interpolation so the generated figures might be distorted. But this will not happen always and all program routines could be re-run as long as the desired, nice, result emerges.
Interpolation of the outline data
1234567891011
x_hull=np.array(points['x'])x_hull=np.append(x_hull,x_hull[0])#add first landmark as the last one (x)y_hull=np.array(points['y'])y_hull=np.append(y_hull,y_hull[0])#add first landmark as the last one (y)t=np.zeros(x_hull.shape)t[1:]=np.sqrt((x_hull[1:]-x_hull[:-1])**2+(y_hull[1:]-y_hull[:-1])**2)t=np.cumsum(t)t/=t[-1]nt=np.linspace(0,1,100)x1=scipy.interpolate.spline(t,x_hull,nt)#interpolated coordines for smooth linesy1=scipy.interpolate.spline(t,y_hull,nt)
Finally, plots are produced sequentially, first all landmarks, then mean landmarks, and finally the outline.
Plotting
1234567
plt.scatter(allCoords['x'],allCoords['y'],s=25,c="#FFD699",edgecolors='none')#plot all sampled pointsplt.scatter(x,y,c=ind,s=60,cmap=bmpl.get_map('Set3','qualitative',7).mpl_colormap)#plot means with color brewer paletteplt.plot(x1,y1,'--',c="#97CAFF",lw=3)#plot outlineslabels=['Landmark {0}'.format(i)foriinrange(10)]forlabel,x,yinzip(labels,points['x'],points['y']):#annotate mean landmarks by numbersplt.annotate(label,xy=(x,y+0.3),ha='right',va='bottom')plt.grid()
Generated “monsters” are just there to get the idea of a possible shape, although the create the impression of the “outline”, such that all landmarks are sampled from the outer perimeter of the object. The code presented would not be complete if it wasn`t for the help from people from stackoverflow (here, here and here).
This procedure is based on the outlines generated from the digital photos using imageJ and converting imageJ images to x-y continuous outline data available from here. Its goal is to present a visual overview of global shape differences between individuals from natural populations, using landmark data from ventral projection of their crania. All outlines are based on deformation via Thin Plate Splines, using mean shapes for populations as deformation targets and references. Superimposition methods as well as preliminary GM analyses were done in R and marvelous geomorph package by Dean Adams and Erik Otarola-Castillo. Additionally, since one of the common points of contemporary scientific research is the reproducibility of solutions offered all posts will also contain randomly generated sample data that could be used similarly to real-world datasets. Sample generation can be very useful, especially in teaching, so I intend to focus on it in future posts.
Importing libraries and generating the basic dataset (files should be placed in your working directory)
The workspace “capreolusRgen.RData” (which can be downloaded from here) contains several randomly generated datasetes of 657 individuals and 28 landmarks, named “capreolusSample#”. These data was generated on the basis of real-world values, using the linear regression model to control random number generators. The code that was used probably does not repoduce the sampling of landmarks well, especially regarding correlations between pairs of landmark coordinates or the landmarks that conform to the object symmetry, but for the purpose of illustration in this post, I hope they should be fine.
Following the Procrustes superimposition is the calculation of mean shapes, both for all males and for separate populations. After mean shapes are calculated the only thing left is to use TPS in order to deform outlines (variable d), using mean shape of all males as a reference and mean shape of populations as target. This can all be done using Morpho R-package from Stefan Schlager.
Finally, depicting shape changes can be achieved by wonderful Hadley Wickham`s ggplot2 R-package. This package has a neat way of “forcing” you to keep your data organized, so all variables are inside one data frame, both quantitative and qualitative.
By inspecting outlines it can be seen that the individuals from pop1 are the smallest while the ones from pop2 are the largest. Shape differences are also determined by the relationship of length to width, so that individuals from pop2 have the widest crania, while the ones from pop1 have the narrowest. Also, it can be seen that in the individuals with the largest crania, size differences are detemined mostly by dimensions of the anterior part, maxillary and rostral regions, that are both wider and longer with respect to individuals with smaller crania. Posterior part of the cranium is more similar between individuals from different populations, and it may be more stable.