Convex hull is one of the most widely used techniques for finding the minimum enclosing shape around a set of points in a plane, but for certain tasks it is necessery to find such shape which encloses concave parts of the outline as well. Convex hulls are also sensitive to localized point groups, far apart from the majority of other points. In these cases, hulls enclose a lot of empty space so that the possible underlining shape is obscured. In geometric morphometrics, outlines around the set of landmarks in the plane can provide the idea of shape, useful for visualization of the underlining shape (i.e. the shape described by these points). Possibly, such outlines may be analysed further, with the usual GM analytic procedures. Alpha shape1 is the generalization of the convex hull for a set of points in the plane which can be used for shape reconstruction, since the frontier of an alpha shape is a linear approximation of the original shape. In R, this technique is implemented in the alphahull library. For this post, randomly generated points-in-the-plane data will be used.
Generating data and loading the library
12
library(alphahull)sampleData <- data.frame(x=runif(30,10,50), y=runif(30,10,50))#with uniform number generators
For a set of points, alpha shape is based on the Delaunay triangulation and its corresponding Voronoi diagram, so they are found first, followed by alpha shape and alpha hull.
Calculating Delaunay triangulation, Voronoi diagram and alpha shape of the sample data
The value of the parameter alpha is determined with respect to the dimensionality of data points, since it represents the minimal distance for icnlusion or exclusion of the neighbouring points within the alpha shape procedure.
Plotting of the corresponding shapes
1234
plot(sampleDel, col=c(1,"darkturquoise","grey80",1), lwd = c(2,2), cex =0.1, xlab =NA, ylab =NA)points(sampleData, col ="darkorange", pch =16, cex =1.5)#Plot of the Delaunayplot(sampleAlpha, wlines ="del", col = c("goldenrod4",1,"darkturquoise"), lwd = c(2,4), cex =0.1, xlab =NA, ylab =NA)points(sampleData, col ="darkorange", pch =16, cex =1.3)#Plot of the alpha shape
When the value of alpha is selected taking into account the spacing between the points, the enclosing outline can be obtained so that it approximates the most probable shape of the points in the plane (Figure 1, Figure 2).
Edelsbrunner, H., Kirkpatrick, D.G. and R. Seidel. 1983. On the shape of a set of points in the plane. IEEE Transactions on information theory, 29: 551-559.↩
Continuing on the post about the climate data extraction in R, this post demonstrates a simple approach to using PLS (partial least squares) for determining the common covariance patterns between morphology and mean monthly temperature (although the model is likely to be bad). Morphology data can be found here. This dataset only contains PCA scores of individuals for the first two PC axes (Figure 1), extracted from the Fourier descriptors of horn shape. PC1 axis (92%) clearly separates males and females on the basis of respective horn shape.
Temperature data are extracted using geoTiffs from WorldClim database, as in the mentioned post about climate data extraction. The ready-made dataset of mean monthly temperatures for the above study locality can be found here.
Importing data and plotting the Figure 1 PCA
1234567
fourierPC <- read.table("~/fourierPC.txt", header =TRUE, sep =",")#by default they are in the home directorymatTaraMean <- read.table("~/matTaraMean.txt", header =TRUE, sep =",")library(ggplot2)theme_set(theme_bw())sex <- c("f","f","m","m","f","m","m","m","f","m","f","f","m","f","f","m","f","f","f","m","f","m","f","f","m","f")#individual sex is known in advanceggplot(fourierPC, aes(PC1, PC2))+ geom_point(size =5, shape =19, aes(color = sex))
PLS analysis is a useful multivariate technique used for determining the common variation patterns in two blocks of data and is sometimes reffered to as PLS regression. In this post, of all PLS implementations in R, the choice is on the fabulous plsdepot library, developed and maintained by Gaston Sanchez, whose blog/personal page and work in general was a great insipiration for Creative Morphometrics. His approach is very well explained and documented over at his page, so only direct implementation on the data above will be provided here.
PLS plsdepot library
1234
library(plsdepot)taraMat <- matTaraMean[sample(221,26),]#extract 26 samples from 221 raster grid valuesclimatePLS <- plsreg1(taraMat, fourierPC$PC1, comps =3)plot(climatePLS)#plot circle of correlations in order to determine the influence of predictor variables
It is obvious from Figure 2. that the variables in question share no common variation pattern and are totally unrelated. If the chosen data was better, some of the blue lines (predictors) would run parallel with the orange one (response). If the R2 value for this model is examined (it is a part of climatePLS object – climatePLS$R2) the unrelatedness of predictors and response in this model gets even more obvious (R2 = 0.035). Predictors in this model are also highly correlated within themselves, which renders them rather useles in prediction, as was stated at the onset.
Quantification of biological shape often relies on manual extraction of information from a set of digital images or 3D models, like manual landmark placement or outlining the object of interest by hand. With computer vision approach and, especially, with the help of the openCV library, a set of image manipulation tecniques as well as automated feature extractions are facilitated to a large extent. Since there are excellent python bindings for this library it is easy to test ideas for image analysis and object extraction in the context of geometric morphometrics.
This post will serve as a general introduction to openCV in python, and will continue on the earlier posts about the outline extraction from digital photos usual in GM research. The sample image can be obtained here.
Library import, reading and displaying the image
1234567891011121314
importnumpyasnpimportcv2#this is the main openCV class, the python binding file should be in /pythonXX/Lib/site-packagesfrommatplotlibimportpyplotaspltgwash=cv2.imread("~/taster1.JPG")#import imagegwashBW=cv2.cvtColor(gwash,cv2.COLOR_BGR2GRAY)#change to grayscaleplt.imshow(gwashBW,'gray')#this is matplotlib solution (Figure 1)plt.xticks([]),plt.yticks([])plt.show()cv2.imshow('gwash',gwashBW)#this is for native openCV displaycv2.waitKey(0)
This image is suitable for automated online extraction of the skull, since there is a significant ammount of contrast of the object and the background. The idea is to use thresholding and a bit of erosion around the edges to delimit the skull from the rest of the picture and then apply automated contour extractions.
Image threshold, erosion and contour extraction
12345678910111213141516
ret,thresh1=cv2.threshold(gwashBW,15,255,cv2.THRESH_BINARY)#the value of 15 is chosen by trial-and-error to produce the best outline of the skullkernel=np.ones((5,5),np.uint8)#square image kernel used for erosionerosion=cv2.erode(thresh1,kernel,iterations=1)#refines all edges in the binary imageopening=cv2.morphologyEx(erosion,cv2.MORPH_OPEN,kernel)closing=cv2.morphologyEx(opening,cv2.MORPH_CLOSE,kernel)#this is for further removing small noises and holes in the imageplt.imshow(closing,'gray')#Figure 2plt.xticks([]),plt.yticks([])plt.show()contours,hierarchy=cv2.findContours(closing,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)#find contours with simple approximationcv2.imshow('cleaner',closing)#Figure 3cv2.drawContours(closing,contours,-1,(255,255,255),4)cv2.waitKey(0)
Finally, since other objects of no interest are also outlined in Figure 3, the skull outline must be identified in the array list contours. This can be done by a for loop, and the openCV method for calculating outline areas, as the skull outline is the longest continual outline in the image.
Calculating outline areas and extracting the skull outline
123456789101112131415
areas=[]#list to hold all areasforcontourincontours:ar=cv2.contourArea(contour)areas.append(ar)max_area=max(areas)max_area_index=areas.index(max_area)#index of the list element with largest areacnt=contours[max_area_index]#largest area contourcv2.drawContours(closing,[cnt],0,(255,255,255),3,maxLevel=0)cv2.imshow('cleaner',closing)cv2.waitKey(0)cv2.destroyAllWindows()
The extracted contour is a numpy array with xy coordinates of contour points in rows, so it can be directly used for sampling landmarks along the outline for fitting the normalized Fourier transform and doing shape analysis further. Of course, the procedure can be easily generalized to work over all images in a dataset, extract the contours with largest areas from each image, and generate a dataset for shape analysis.
Since Procrustes superimposition introduces the data into a curved shape space which corresponds to a hyperhemisphere of radius 1 with numerous dimensions (as defined by Rohlf, 19991), usual multivariate statistical methods are not readily applicable to such data. This data must be projected to a flat, Euclidean tangent space, that is in contact with the hyperhemisphere in one point, taken as a mean shape of the whole dataset. This projection can be applied only if the variation in shapes in not overtly large so that their “shadows” on a flat space are not too far apart. In this post, data will be generated as usual and Procrustes procedures will be taken from previous posts, using defined functions centsiz, transScale, mshapr, pProc and pGP.
Importing libraries, generating data and utility functions
After Procrustes superimposition, data can be projected orthogonally and stereographically. More common method is the orthogonal projection which minimizes lagre shape differences between landmark configurations. The procedure is based on Rohlf, 1999, suggesting that the matrix of aligned centered preshapes should be multiplied by the mean shape of unit centroid size, subtracted from the identity matrix of comparable dimensionality (Claude, 2008).
Orthogonal projection of superimposed data, Procrustes superimposition and projection
12345678910111213141516171819202122
defortproj(A,numland,dim,numind):#where A is a pandas dataframe (tempCoords), numind = landmarks, numind = individualsmshape=A.iloc[:,0:2].groupby(A.iloc[:,2]).mean()mshStd=mshape/centsize(mshape)mshMel=pd.melt(mshStd).drop('variable',1).T#mshRep = pd.concat([mshMel]*numind, ignore_index = True) #calculate repeating size standardized meanShapemshRep=np.dot(np.ones([numind,1]),mshMel)simDia=np.identity(numland*dim)#identity matrix#lonWid = A[["x","y"]].values.reshape(numind,numland*dim) #must have column names "x" and "y" for coordinateslonWidX=A["x"].values.reshape(numind,numland)lonWidY=A["y"].values.reshape(numind,numland)lonWid=np.concatenate((lonWidX,lonWidY),axis=1)temp1=simDia-np.dot(mshMel.T,mshMel)Xi=np.dot(lonWid,temp1)#orthogonal according to RholfXproj=Xi+mshRepreturnXprojprocCoords=pGP(allCoords,200,2,10)#perform Procrustes superimpositionprocoordinates=np.tile(np.arange(0,10),200)procindividuals=allCoords['individuals']procCoords['coordinates']=procoordinatesprocCoords['individuals']=sort(procindividuals)projCoords=ortproj(procCoords,10,2,200)#perform orthogonal projection
Numpy array projCoords holds wide representation of projected data (rows are individuals, while columns are all landmarks, first all x and then y coordinates), ready to be used in PCA or any other multivariate method. In order to illustrate potential grouping in PCA morphospace, since the data is randomly generated, a kmeans clustering will be performed with 3 groups, just to effectively split the data. PCA can be done in python in numerous ways, but in this post a PCA decomposition from scikit-learn package is used.
PCA, cluster analysis and plots
123456789101112131415161718192021222324
pca=PCA(n_components=3)pca.fit(projCoords)print(pca.explained_variance_ratio_)projCoordsP=pca.transform(projCoords)#get the individual scores on PC axesclusters=pd.Series(pyc.kcluster(projCoordsP,nclusters=3,method='a',dist='e')[0])set2=brewer2mpl.get_map('RdYlBu','diverging',3).mpl_colors#generate nice colorsprojCoordsPa=np.column_stack((projCoordsP,clusters))#concatenate all for sequential plot-buildinggroup1PCA=projCoordsPa[projCoordsPa[:,3]==0]#split by group for sequential plot-buildinggroup2PCA=projCoordsPa[projCoordsPa[:,3]==1]group3PCA=projCoordsPa[projCoordsPa[:,3]==2]fig=plt.figure()ax=fig.add_subplot(111,projection='3d')ax.plot(group1PCA[:,0],group1PCA[:,1],group1PCA[:,2],'o',c=set2[0],alpha=1,label='Group1')ax.plot(group2PCA[:,0],group2PCA[:,1],group2PCA[:,2],'o',c=set2[1],alpha=1,label='Group2')ax.plot(group3PCA[:,0],group3PCA[:,1],group3PCA[:,2],'o',c=set2[2],alpha=1,label='Group3')ax.set_title("PCA with grouping according to kmeans 3 group solution")plt.legend(loc='upper left')ax.set_xlabel('PC1 (14.45%)')ax.set_ylabel('PC2 (13.29%)')ax.set_zlabel('PC3 (12.46%)')plt.show()
Since the data is randomly generated, there is not much structured variation in it, as three of the first PC axes describe 14.45%, 13.29% and 12.46% of total sample variability, respectively. On the other hand, kmeans forces the data into three groups, which is useful for demonstration of plotting in matplotlib with 3D axes, and a groupping structure in PCA scatterplots (Figure 1), similar to real-world data.
Rohlf, J.F. 1999. Shape statistics: Procrustes superimpositions and tangent spaces. Journal of Classification16: 197-223.↩
Angles were important for some of the most interesting research in morphometrics. The approach of Rao and Suryawanshi, 19981. introduced the angles within the landmark triangulation grids as an alternative to Procrustes superimposition for extracting shape data. This post does not perform the procedures described by Rao (which are ported to R in “Geometric morphometrics in R” by J.Claude, 2008), but instead aims at extracting angle data, from configurations of landmarks, after Procrustes superimposition. The code presented here is motivated by the polarRotator function from some of the previous pyhton-related posts, and it presents landmark data as polar angles. These angles represent the angular deviation of each landmark from the centroid of the respective configuration (Figure 1), so that the variability in their position with respect to xy centroid coordinates, may represent their shape differences. Since superimposition procedure sets all configurations centroids to the origin, they are all 0,0. Dataset used is the same as for the triangulation post.
Plotting the circle, centroid axes and mean configuration
library(geomorph)library(ggplot2)library(reshape2)library(plyr)theme_set(theme_bw())load("capSample.RData")capGPA <- gpagen(capSample)capCoords <- capGPA$coords
mshapeCap <- mshape(capCoords)#mean configuration is great for illustrationscapCentro <- capGPA$Csize
circleFun <-function(center = c(0,0), npoints =100)#drawing an enlosing circle set at the origin (0,0){ r =0.5 tt <- seq(0,2*pi,length.out = npoints) xx <- center[1]+ r * cos(tt) yy <- center[2]+ r * sin(tt)return(data.frame(x = xx, y = yy))}enclosing <- circleFun(npoints =100)#it should be centered at zero and have diameter 0.5ggCirc <- ggplot(enclosing, aes(x=x,y=y))+ geom_path(size =1, color ="darkgrey", linetype =1)ggCirc <- ggCirc + geom_hline(size =1, color ="darkgrey", linetype =1)+ geom_vline(size =1, color ="darkgrey", linetype =1)linesAng <- data.frame(x1 = c(0,0,0,0), y1 = c(0,0,0,0), x2 = mshapeCap[,1], y2 = mshapeCap[,2])#point coordinateslinesAng <- data.frame(linesAng, x3 = linesAng$x2 *20, y3 = linesAng$y2 *20)#so that the lines would extend through landmarksggCirc <- ggCirc + geom_segment(data = linesAng, aes(x = x1, y = y1, xend = x3, yend = y3), colour ="lightblue", size =1)ggCirc <- ggCirc + geom_point(size =4.2, color ="orange", data = data.frame(mshapeCap), aes(x=X1, y = X2))ggCirc <- ggCirc + geom_text(data = data.frame(mshapeCap), aes(x=X1, y = X2, label = c(1:16)), vjust =-1.2)ggCirc <- ggCirc + coord_cartesian(xlim= c(-0.55,0.55), ylim= c(-0.55,0.55))
ggplot2 has one great geom, polar_angle() that will transform any shape, from Cartesian to polar coordinates. Unfortunately it does not give polar angles as well, but the plot is informative (Figure 2), since landmarks are ordered according to their angular deviation from 0, i.e. the X and Y axes defining the 0.5 centroid-centered ellipse.
Polar angles can be calculated by using the atan2 (arctangent) function, which takes two parameters, x and y and gives one point estimate of the polar angle in radians. This number is really the angle between the positive x-axis of a plane (X from Figure 1) and the point given by its xy coordinates. Since after superimposition the x-axis lies at zero for all landmark configurations, it would be sufficent to use only xy coodinates of landmarks, since centroids are also at 0.
Polar angles with atan2 function
1234567891011121314151617181920212223242526
polarAngler <-function(A)#function to calculate polar angles from the array{ siz <- dim(A)[3] len <- dim(A)[1] polarAngles <- vector("list", siz)for(i in1:siz){ temp <- atan2(A[,2,i], A[,1,i]) polarAngles[[i]]<- temp
} polarAngles <- array(unlist(polarAngles), dim = c(len,1,siz))return(polarAngles)}library(circular)library(colortools)pAngles <- polarAngler(capCoords)#angles for all configurationsmAngles <- apply(pAngles,1, mean)#mean angles for each landmarkcol1 <- pals("fish")#paletes from colortoolscol2 <- pals("ocean")col3 <- pals("mystery")colorPoint <- c(col1, col2, col3,"#9A76F5")#16 colors for landmarks in whellplot.circular(circular(t(mAngles), type ="angles"), sep =0.0003, col = colorPoint, cex =1.5)legend(0.3,0.6, legend =1:16, fill = colorPoint, cex =0.7)
Figure 3 shows the distribution of landmarks from along a circular path (0, π). It is obvious that landmarks form two distinct groups, one above, and one below the x-axis, with the exception of landmarks 7, 8 and 10 which are the most parallel to the x-axis. The proportion of the overall shape variability left in these angles must be thouroghly checked, but it is not easy, since angle-data can not be analysed by conventional statistics, but by circular or compositional methods. R provides some libraries for this, circular and CircStats which are both derived from the book of Jammalamadaka and Sengupta, “Topics in circular Statistics” from 2001. If one more individual is added to the circle-plot (Figure 3) it can be can visually inspected how much its coordinates (landmark polar angles) deviate from the sample mean (Figure 4).
Adding individual 10 to circular plot
123
ce <- circular(rbind(t(mAngles), pAngles[,,10]))#deviation of individual 10 from the meanplot.circular(ce, sep =0.0003, col = colorPoint, cex =1.5)legend(0.3,0.6, legend =1:16, fill = colorPoint, cex =0.7)
Some variablity is present around landmarks 10, 6 and 5, while others are nearly or completely overlapping. The variability may not be large, but it will be checked in future posts, as well as possible PCA with the angular data.
Rao C.R. and S. Suryawanshi. 1998. Statistical analysis of shape of objects based on landmark data. PNAS93: 12132-12136.↩
Procrustes superimposition is the first analytic step in geometric morphometrics and this post shows one possible solution to performing it in Python1, using several functions defined in previous posts. First steps include data generation and definition of functions important for extracting shape variables from randomly generated landmark data. These functions are centsize which calculates centroid size from a numpy array (xy data) and transScale which translates landmark coordinates to the origin of the coordinate system and scales them to unit centroid size. The mshapr function is needed in the main superimposition function since it calculates succesive mean shapes (all configurations), excluding the configuration that is currently being rotated.
Import libraries, function definition and data generation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.spatial.distance as sd
def centsize(M):#centroid size p = M.shape[0] csize = np.sqrt(np.sum(M.var(0))*(p-1))return csize
def transScale(M):#translation and scale tM = M - M.mean() centSize = centsize(M) tM = tM / centSize
return tM
def mshapr (pdf):#which is a pandas DataFrame, coordinates and individuals columns meanShapes = pd.DataFrame() dimen = len(allCoords.individuals.unique())for ind in range(0, dimen): temp = pdf[pdf.individuals != ind] meanShape = temp.iloc[:,0:2].groupby(temp.iloc[:,2]).mean() meanShapes = meanShapes.append(meanShape)return meanShapes
coordstest = np.vstack([np.random.uniform(175,220,10),#usual data generation np.random.uniform(175,220,10)]).T
coords = np.vstack([np.random.multivariate_normal(coordstest[i,:],[[3,0],[0,3]],200)for i in range(10)])coordinates = pd.Series((np.arange(0,10).reshape(-1,1)*np.ones(200).reshape(1,-1)).flatten())#coordinates columnindividuals = np.tile(np.arange(0,200),10)#individuals columnallCoords = pd.DataFrame(coords, columns =['x','y'])allCoords['coordinates']= coordinates
allCoords['individuals']= individuals
allCoords = allCoords.sort_index(by =['individuals','coordinates'])
Definition of the pProc function follows the same rules as in the previous post, with the robust (and ugly for now) if clause that is only included to allow either pandas DataFrame or a numpy array as the input data (both for configurations and mean shape objects). It also returns only the rotated matrix, landmark configuration (pmat1) and not the mean shape.
def pProc(mat1, mat2):#returning only centred preshape of mat1 on mat2if type(mat1) is np.ndarray and type(mat2) is np.ndarray: k = mat1.shape[1]-1 m = mat1.shape[0] sscaledMat1 = mat1 / centsize(mat1) sscaledMat2 = mat2 / centsize(mat2) z1 = sscaledMat1 -[sscaledMat1[:,0].mean(), sscaledMat1[:,1].mean()] z2 = sscaledMat2 -[sscaledMat2[:,0].mean(), sscaledMat2[:,1].mean()] tempSv = np.dot(np.transpose(z2), z1) svdMat = np.linalg.svd(tempSv) U = svdMat[0] V = svdMat[2] Ds = svdMat[1] tempSig = np.linalg.det(np.dot(np.transpose(z1), z2)) sig = np.sign(tempSig) Ds[k]= sig * np.absolute(Ds[k]) U[:,k]= sig * U[:,k] Gam = np.dot(V, np.transpose(U)) beta = np.sum(Ds) pmat1 = np.dot(z1, Gam) pmat1 = pd.DataFrame(pmat1, columns =['x','y'])return pmat1
else: mat1x = mat1.iloc[:,0] mat1y = mat1.iloc[:,1] mat1 = np.concatenate([mat1x, mat1y], axis =1).reshape(mat1.shape[1],mat1.shape[0]).T
mat2x = mat2.iloc[:,0] mat2y = mat2.iloc[:,1] mat2 = np.concatenate([mat2x, mat2y], axis =1).reshape(mat2.shape[1],mat2.shape[0]).T
k = mat1.shape[1]-1 m = mat1.shape[0] sscaledMat1 = mat1 / centsize(mat1) sscaledMat2 = mat2 / centsize(mat2) z1 = sscaledMat1 -[sscaledMat1[:,0].mean(), sscaledMat1[:,1].mean()] z2 = sscaledMat2 -[sscaledMat2[:,0].mean(), sscaledMat2[:,1].mean()] tempSv = np.dot(np.transpose(z2), z1) svdMat = np.linalg.svd(tempSv) U = svdMat[0] V = svdMat[2] Ds = svdMat[1] tempSig = np.linalg.det(np.dot(np.transpose(z1), z2)) sig = np.sign(tempSig) Ds[k]= sig * np.absolute(Ds[k]) U[:,k]= sig * U[:,k] Gam = np.dot(V, np.transpose(U)) beta = np.sum(Ds) pmat1 = np.dot(z1, Gam) pmat1 = pd.DataFrame(pmat1, columns =['x','y'])return pmat1
Finally, the pGP function, that can perform partial Procrustes superimposition for any number of individuals (landmark configurations) and landmarks. Generated data has 200 individuals and 10 2D landmarks. For now this function will ask such information in the function call, so that mmat1 is pandas DataFrame with x, y, coordinates and individuals columns (generated above), numind is the number of individuals, dim is 2D (3D not yet supported), and numland is the number of landmarks. For clearer understanding of the following code, comments are included where appropriate.
def pGP (mmat1, numind, dim, numland): groupCoords = mmat1.iloc[:,0:2].groupby(mmat1.iloc[:,3]) transCoords = groupCoords.apply(transScale)#translate and scale the sample data individuals = np.tile(np.arange(0,numind),numland) coordinates = pd.Series((np.arange(0,numland).reshape(-1,1)*np.ones(numind).reshape(1,-1)).flatten()) transCoords['coordinates']= coordinates
transCoords['individuals']= np.sort(individuals).astype(str)#this part of the code calculates Q-the convergence criterion#that is really the sum of the pairwise squared distances between all shapes in the sample#so that, after rotation these distances are minimized as much as possible arrayx = transCoords.iloc[:,0] arrayy = transCoords.iloc[:,1] forDist = np.concatenate([arrayx, arrayy], axis =1).reshape(dim,numland*numind).T
forDist = forDist.reshape(numind, numland*2)#distance function from scipy Qm1 = sd.pdist(forDist) Q = Qm1.sum()while absolute(Q)>0.00001:#execute the following code until Q cannot be reduced anymore groupTrans = transCoords.iloc[:,0:2].groupby(transCoords.iloc[:,3]) meanShapes = mshapr(transCoords)#calculate DataFrame of mean shapes excluding one individual at the time meanShapes['individuals']= np.sort(individuals).astype(str) meanGroups = meanShapes.iloc[:,0:2].groupby(meanShapes.iloc[:,2]) tempRot = pd.DataFrame()#following for loop performs partial Procrustes superimposition (pProc function from above) between#mean shapes and each configuration (individual)for ind in range(0, numind): tosto = pProc(groupTrans.get_group(str(ind)), meanGroups.get_group(str(ind))) tempRot = tempRot.append(tosto) tempX = tempRot.iloc[:,0] tempY = tempRot.iloc[:,1] tempDist = np.concatenate([tempX, tempY], axis =1).reshape(dim,numland*numind).T
tempDist = forDist.reshape(numind, numland*2) Qm2 = sd.pdist(tempDist) Q = Qm1.sum()- Qm2.sum() Qm1 = Qm2 #re-calculate the convergence criterion at each step transCoords = tempRot
return tempRot
DataFrame tempRot holds the superimposed configurations (shape variables) that can, subsequently, be used in standard GM analyses and further. Of course, graphical display of superimposition results can be very interesting, and in this post only basic plots will be given, while some of the further posts may include more visualizations. Figure 1 shows the original raw-generated data, Figure 2 the spatial relationships between raw and superimposed data, while Figure 3 shows only superimposed data.
Plotting the data and the relationship of raw and superimposed data
123456789101112131415161718192021
procCoords = pGP(allCoords,200,2,10)#perform superimpositionprocoordinates = np.tile(np.arange(0,10),200)procCoords['coordinates']= procoordinates
meanShape1 = allCoords.iloc[:,0:2].groupby(allCoords.iloc[:,2]).mean()#calculate mean shape for raw datameanShape1 = pd.DataFrame(polarRotator(np.array(meanShape1)))#natural ordering for landmark labelsplt.scatter(allCoords['x'], allCoords['y'], s =30, c ="#82CDFF", edgecolors ='none')#plot original sampled pointsplt.scatter(meanShape1[0], meanShape1[1], s =50, c ="#7909E8", edgecolors ='none')plt.plot(meanShape1[0], meanShape1[1],'-', color ="#7909E8")labels =['Landmark {0}'.format(i)for i in range(10)]for label, x, y in zip(labels, meanShape1[0], meanShape1[1]):#annotate mean landmarks by numbers plt.annotate(label, xy =(x, y+0.4),ha ='right', va ='bottom')plt.grid()meanShape2 = procCoords.iloc[:,0:2].groupby(procCoords.iloc[:,2]).mean()#calculate mean shape for superimposed datameanShape2 = pd.DataFrame(polarRotator(np.array(meanShape2)))#natural ordering for landmark labelsplt.scatter(procCoords['x'], procCoords['y'], s =30, c ="#FFD699", edgecolors ='none')plt.scatter(meanShape2[0], meanShape2[1], s =50, c ="#7909E8", edgecolors ='none')plt.plot(meanShape2[0], meanShape2[1],'-', color ="#7909E8")for label, x, y in zip(labels, meanShape2[0], meanShape2[1]):#annotate mean landmarks by numbers plt.annotate(label, xy =(x, y+0.01),ha ='right', va ='bottom')plt.grid()#for Figure 2 just plt.scatter of both allCoords and procCoords in the same window
If using IPython (:)) best way is to paste code with the %paste magic function.↩
Continuing on the previous post, procedures presented here show the comparison between the usual assesment of modularity in the collection of landmarks (same dataset as before) and the depiction of modular structure through graph theory. Usually, modularity in the mammalian cranium is determined a-priori, according to some of the established hypotheses about the developmental origin of cranial elements. One of the most prevalent hypotheses is the two-module organization, one module comprising of elements originating mainly from the neural crest embryonic tissue (anterior cranium) and one comprising of elements with somitomeric origin (posterior cranium). Following code shows the test of this specific hypothesis using geomorph package functions, and additional usage of deldir for visualization of modularity hypotheses.
Importing data, basic GM, modularity and plots
12345678910111213141516171819
library(geomorph)library(deldir)library(ggplot2)load("capSample.RData")theme_set(theme_bw())capGPA <- gpagen(capSample)capCoords <- capGPA$coords
mshapeCap <- mshape(capCoords)#mean shape is useful for plotting #vector depicting the division of landmarks according to modularity modularity1.gps <- c("A","A","A","A","A","A","A","A","B","B","A","B","B","B","B","B")#test for modularity based on the RV coefficientmodularity1 <- compare.modular.partitions(capCoords, landgroups = modularity1.gps)#plot of landmark membership to a-priori modulesdeldir(mshapeCap[,1], mshapeCap[,2], plotit =TRUE, wlines ="triang", xlim = c(-0.2,0.55))points(mshapeCap[c(1:8,11),1], mshapeCap[c(1:8,11),2], col ="#FF8740", pch =19, cex =1.2)points(mshapeCap[c(9,10,12:16),1], mshapeCap[c(9,10,12:16),2], col ="#61B4CF", pch =19, cex =1.2)legend(0.2,0.4, legend = c("neural crest","somitomere"), fill = colormap)
Figure 1 shows the theoretical distribution of the RV coefficients, calculated for all posible two-module subdivisions of landmarks, as well as the observed RV value from the hypothesis depicted in Figure 2. Since the observed value is lower than most of the hypothetic values, then it could be said that this modularity hypothesis stands.
The idea for using graph theory representation and analysis of modularity is based on correlation matrix, same one from the previous post. Correlation matrix was obtained through Delaunay triangulation and every vertex (node) in the graph actually represents the 1/3 of the total area of all Delaunay triangles emanating from the corresponding landmark.
Delaunay triangulation and derivation of the correlation matrix
12345
cicoCap <- t(apply(capCoords,3,function(A) deldir(A[,1],A[,2])$summary[,4]))cicoCap <- data.frame(cicoCap)names(cicoCap)<- paste("lm", c(1:16), sep ="")capCor <- abs(cor(cicoCap))#correlation matrixdiag(capCor)<-0#diagonals must be set to zero
Graph based on the correlation matrix can be created and manipulated with the wonderful igraph package. Vertices of this graph will be abovementioned areas, while the edges will represent correlation (edge weights) between areas around each landmark. According to weights, edges can be colored and the strength of thier lines increased, so that the most correlated landmarks wolud be more obvious.
Modularity, or community structure in the graph theory represents the degree of compartmentalization in the graph structure, based on the spatial relationship or the weights of graph edges. Although, a-priori subdivision of graph vertices is possible, and the evaluation of such graph community can be easily done, the advantage of graph theory for modularity is the availability of algorithms for searching the community structure, so it can derive subdivsion a-posteriori, that can be compared to the hypothesis from Figure 1.
Graph modularity and plotting
1234567891011
gsc <- leading.eigenvector.community(graph, weights = E(graph)$weights)modularity(graph, membership(gsc))colormap <- c("#FF8740","#61B4CF")V(graph)$color <- colormap[gsc$membership]#color graph vertices according to community#graph layout will be circular since values for the edges range from 0.2 to 0.9, #and they all form similar attractive force between the vertices.graph$layout <- layout.fruchterman.reingold #really no effect on the shape of this graphplot(graph, vertex.size =20, vertex.shape ="circle")legend(0.8,1.4, legend = c("neural crest","somitomere"), fill = colormap)
Leading eigenvector community algorithm tries to find community structure in the graph by calculating the eigenvector of the modularity matrix for the largest positive eigenvalue and then separating vertices into two communities based on the sign of the corresponding element in the eigenvector (Newman, 20061). This method was preffered over many others because it is closer to usual multivariate methods, such as PCA, that are commonly used to depict variability patterns.
Graph community strucure depicts modularity in the landmark configurations accurately, with the exception of lm16, that is positioned on the posterior basicranium. Higher correlations are, on average, present within modules while between modules, only one connection is higher than 0.5, that between lm2 and lm14. Since these landmarks lay on the opposite sides of the cranium, they may encompass the variability in total anterior-posterior length. i.e. cranial size.
Newman, M.E.J. 2006. Modularity and community structure in networks. PNAS103: 8577-8582.↩
Since landmark data consists of x and y Cartesian coordinates, it would be useful if they could be represented by one summary number, that would preserve the spatial information. This is just a proposal of one possiblity using Delaunay triangulation between landmark configurations, and the average area of all triangles emanating from individual landmarks. Since landmark position within the triangulation grid would influence the shape and size as well as number of triangles around it, the area should preserve enoguh information about the spatial relationships of landmarks. Dataset for this post consists of ventral aspect half configurations from here. Delaunay triangulation will be performed using deldir R package, since it reports the mentioned triangle areas as a part of object summary.
Importing data, basic GM, calculating the triangulation, extracting areas and plotting
123456789101112131415
library(deldir)library(geomorph)load("capSample.RData")capreolusGPA <- gpagen(capSample)capGPAcoords <- capreolusGPA$coords
delCap <- deldir(capSample[,,10][,1], capSample[,,10][,2])#extract any individual from an array for visualizationplot(delCap, col=c("lightblue","lightgrey",1,1,1), lwd = c(2,2), xlim = c(150,400), cex =0.1, ann =FALSE)points(capSample[,,10], col ="orange", pch =16, cex =1.4)text(capSample[,,10], label = c(1:16), pos =3, ceo =0.5)areaCap <- t(apply(capGPAcoords,3,function(A) deldir(A[,1],A[,2])$summary[,4]))#calcuate areas per landmarkareaCap <- data.frame(areaCap)names(areaCap)<- paste("lm", c(1:16), sep ="")
Values in the areaCap matrix represent average areas of all triangles emanating from the landmark in question, so its dimensions are 60x16. Correlation matrix (16x16) can be calculated from the areaCap matrix, and it should reflect relative positioning of landmarks through spatial grouping pattern. This can be checked visually using corrplot package visualization of the correlation matrix, which allows direct assesment of emerging patterns in the matrix. Additionally, this package allows reordering of rows and columns according to hierarchical clustering algorhithms and representing the desired number of clusters with rectangles in the correlation plot. This can only be considered as an idea of general modularity, since landmarks from the same cranial region should be correlated more than distant landmarks. But the transformation of xy coordinates to average triangle areas may have introduced non-biological variation or obscured some of the natural variation and these procedures as well as subsequent analyses may be treated only as a fun experiment and visualization tool for now.
Correlation plot with three proposed general landmark groups in the matrix reveals that the correlations are, on average, higher for locally grouped landmaks, especially the anterior ones, from 1 to 7. This can`t be used as a reliablie test for modularity hypothesis, but it can serve as a basis for further analyses based on construcing linked graphs or networks using correlation matrices from landmark data, where the xy coordinates are transformed to one number. Also this can be a useful way of representing modular structure visually, since both Delaunay and correlation plots are highly customizable, and can be colored according to real modularity hypotheses. Testing some of these will be the subject of future posts.
With the ongoing expand of the available geospatial databases in raster format (GeoTiff) it is getting easier to analyse the relationship between any complex morphology, captured in landmark data, and i.e. climate or precipitation data with the help of partial least squares (PLS). R has a wonderful raser, sp and gdal packages that facilitate the extraction of the geospatial data from GeoTiff raster grids. In this post the geoTiff used will be from the WorldClim website, and the climate data for European square 16, where some of my real samples originate from. In order to get this data you can follow the downloads section of the WorldClim website and choose download data by tile, 30 arc-seconds resolution. When a map opens the dataset for this post would be under the map, when clicked on the square zone 16 and finally download the Mean Temperature. This is a .zip file with twelve GeoTiffs, one for each month. Unpack the files in a pre-destined directory, and declare this a working directory in the R session.
Importing libraries, reading and inital plots of the GeoTiff data
1234567891011121314151617
setwd("~/tmean")#this is the folder with all the GeoTiffslibrary(raster)library(sp)library(rgdal)meanJan <- raster("tmean1_16.tif")#mean temp for January etc.meanFeb <- raster("tmean2_16.tif")meanMar <- raster("tmean3_16.tif")meanApr <- raster("tmean4_16.tif")meanMaj <- raster("tmean5_16.tif")meanJun <- raster("tmean6_16.tif")meanJul <- raster("tmean7_16.tif")meanAvg <- raster("tmean8_16.tif")meanSep <- raster("tmean9_16.tif")meanOkt <- raster("tmean10_16.tif")meanNov <- raster("tmean11_16.tif")meanDec <- raster("tmean12_16.tif")plot(meanJan)#will simply get the basic R plot of a GeoTiff raster
After reading in the basic data, raster package offers several formats in which to keep the data, besides the basic raster. The most useful one is a raster stack which really is just the piled-up rasters that can be manipulated together. Also one useful visual help is the addition of the country boundaries followed by zooming in to the extent of the country or region where the samples originate from. Country boundaries .shp files is freely avalilable from thematicmapping. When downloaded they should be in the same directory as the GeoTiffs.
Creating the raster stack, adding boundaries and zooming to an extent
12345678910
mtStack <- stack(meanJan, meanFeb, meanMar, meanApr, meanMaj, meanJun, meanJul, meanAvg, meanSep, meanOkt, meanNov, meanDec)bounds <- readOGR(dsn=getwd(), layer="TM_WORLD_BORDERS-0.3")#import bordersplot(meanJan)#plot the raster file for any monthextentR <- drawExtent()#drawExtent enters the interactive mode where you can draw the polygon for zooming based on two points, top right and lower left#values in extentR are coordinates in WGS84 datum and expressed as longitude dataplot(zoom(bounds, extentR), add =TRUE)#unfortunately this does not work well since it will plot the bounds in the zoom extent#and if the both raster and bounds are needed then it might be ok to overlap them in GIMPplot(zoom(meanJan, extentR), add =TRUE)
Before final extraction of the climate data it can be useful to plot each month`s mean temperature for the selected extent of the rasater. This can be easily achieved by using the rasterVis package levelplot and the extentR variable, which can be used to cut through all raster layers in the stack.
Cutting the stack and drawing levelplot for all months
123456
library(rasterVis)cutStack <-(crop(mtStack, extentR))/10#division by 10 is needed since the temperature data in this GeoTiff is x10 for saving memory-importantnames(cutStack)<- month.abb
levelplot(cutStack)#may take long time to plothistogram(cutStack)
Slightly more efficient than the drawExtent is the spatialPolygons function from the sp package that will be used for definition of the real sampling localities, by simply drawing boundaries of the localities by hand or using the known positional data. It is best if approximate latitude/longitude coordinates are known in advance so that the defining polygon can be drawn connecting several corners-points, that form the broader sampling area border. If coordinates are not known in advance then the drawExtent should be used first for finding the latitude/longitude of the spatial polygon points, and then making the SpatialPolygon object out of them, as described next.
Extracting spatialPolygons by latitude/longitude pairs
12345
taraCoord <- rbind(c(19.241867,44.012571), c(19.351730,43.976511), c(19.518585,43.923121), c(19.435501,43.894924), c(19.317398,43.896903), c(19.261780,43.955259), c(19.241867,44.012571))R1Coord <- rbind(c(24.656067,45.627484), c(25.252075,45.575600), c(25.628357,45.525592), c(25.488281,45.3000007), c(25.046082,45.381080), c(24.672546,45.539060), c(24.656067,45.627484))#these objects are matrices with two columns, one for latitude of a polygon corner and the other the longitudepolyTara <- SpatialPolygons(list(Polygons(list(Polygon(taraCoord)),1)))#declare object typepolyR1 <- SpatialPolygons(list(Polygons(list(Polygon(R1Coord)),1)))
After all spatial polygons are defined, final step involves the extraction of the mean temperature values from the points encircled by the polygon in question, for all months. Since GeoTiffs are raster formats, they are carrying the information on the temperature in the points of the bitmap grid, so if the polygon is too small, small will be the number of the grid-points inside, maybe smaller than the number of individuals. To avoid this make sampling area broader, and in this case polyTara has some 220 individual grid points inside, which means 220 values for the mean temperature in the area. If we have i.e. ten individuals per population it is necessery to have also 10 mean temperature values, so that both blocks in subsequent PLS would have same dimensionality. R`s basic sample function can be used to extract random values from the i.e. 220 points within the population (Tara and R1) polygons. This simulates random sampling of individuals from the sampling locality and after that any analysis can be done, from linear models to PLS, which will be shown in future posts.
Extracting climate data and preparing the dataset according to the number of individuals
1234567891011121314
matTaraMean <- extract(cutStack, polyTara)[[1]]matR1Mean <- extract(cutStack, polyR1)[[1]]plot(cutStack$Jan)plot(polyTara, add =TRUE)plot(polyR1, add =TRUE)n <- dim(matTaraMean)[1]#for sampling conveniencem <- dim(matR1Mean)[1]taraMat <- matTaraMean[sample(n,10),]#this is the final extracted dataR1Mat <- matR1Mean[sample(m,10),]library(ggplot2)library(reshape2)plotter <- data.frame(rbind(taraMat, R1Mat), loc = c(rep("Ta",10), rep("R1",10)))#for ggplot2forplo <- melt(plotter)m <- ggplot(forplo, aes(x = value, fill = loc))+ geom_histogram()+ facet_grid(variable~.)
Figure 5. shows the locations of tara and R1 spatial polygons within the zoomed extent of the zone 16, while in Figure 6. barplots are shown for 10 random points for all months, and for both localities, for comparison.
It is obvious that the Tara locality has higer mean monthly temperature, on average for every month, and also it has less temperature variaton than R1.
Continuing on one of the previous posts about data generation in python, next natural step in the analytic procedure is the Procrustes superimposition. Since this procedure enables direct analyses of configurations` shape, and all subsequent explorative visualizations, it must be employed first. This post concerns with the basic superimposition procedure, involving only two landmark configurations, i.e. two generated monsters. One monster will be used as a reference and one as a target, which should undergo superimposition. First steps in data genration are the same as before, with the exception of polarRotator function that can take any numpy array and reorder it according to polar rotation angle. This is very useful since generated landmarks are not ordered properly and do not “feel natural” in plots and analyses.
Library import, data generation and some functions
12345678910111213141516171819202122232425262728
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def centsize(M):#centroid size of any configuration matrix p = M.shape[0] csize = np.sqrt(np.sum(M.var(0))*(p-1))return csize
def polarRotator(M):#polar rotation for natural ordering of landmarks x = np.array(M[:,0]) y = np.array(M[:,1]) points = np.array((x,y)).T
cent =(np.sum(x)/len(M), np.sum(y)/len(M)) angle = np.arctan2(points[:,1]-cent[1],points[:,0]-cent[0]) points = pd.DataFrame(points, columns =['x','y']) angle = pd.Series(angle) points['angle']= angle
points = points.sort('angle') rotatedMat = np.array(points[['x','y']])return rotatedMat
coordstest = np.vstack([np.random.uniform(175,220,10), np.random.uniform(175,220,10)]).T
monster1 = np.vstack([np.random.multivariate_normal(coordstest[i,:],[[3,0],[0,3]],1)for i in range(10)])monster2 = np.vstack([np.random.multivariate_normal(coordstest[i,:],[[3,0],[0,3]],1)for i in range(10)])monster1 = polarRotator(monster1)monster2 = polarRotator(monster2)
The plot of configurations reveals their spatial relationship, as well as the general mean shape. This time, since landmarks are ordered properly, one line would be enough for representing mean shapes, and shapes of respective configurations.
Plotting the original monsters
123456789101112131415
plt.scatter(monster1[:,0], monster1[:,1], s =80, c ="#FFD200", edgecolors ='none', label ="monster1")plt.plot(monster1[:,0], monster1[:,1],'--', color ="#FFD200")plt.scatter(monster2[:,0], monster2[:,1], s =80, c ="#47BFDD", edgecolors ='none', label ="monster2")plt.plot(monster2[:,0], monster2[:,1],'--', color ="#47BFDD")meanMon =(monster1 + monster2)/2#calculate mean shapeplt.scatter(meanMon[:,0], meanMon[:,1], s =90, c ="#6a12c4", edgecolors ='none', label ="mean monster")plt.plot(meanMon[:,0], meanMon[:,1],'-', color ="#6a12c4")labels =['Landmark {0}'.format(i)for i in range(10)]for label, x, y in zip(labels, meanMon[:,0], meanMon[:,1]):#annotate mean landmarks by numbers plt.annotate(label, xy =(x, y+0.3), ha ='right', va ='bottom')plt.grid()plt.legend(loc ="lower left")
Procrustes superimposition revolves around three features of shape extraction, that is invariance of landmark configurations to position, scale and rotation. There are a number of excelent textbooks about the mathematics and logic, as well as procedures for Procrustes superimposition (Bookstein, 19911, Dryden and Mardia, 19982, Zelditch et al., 20123), but for this post direct inspiration was Morphometrics in R (Claude, 2008), especially with the basic procedure presented in the following function definition.
Partial Procrustes superimposition of two configurations
Plot of the superimposed configurations reveals that the Procrustes python was really able to force monsters` shape be more similar, removing the effects of orientation, size and rotation.
Plot of the superimposed configurations
12345678910111213141516171819
rotatedMon = pProc(monster1, monster2)rotatedMon1 = np.array(rotatedMon[:10])rotatedMon2 = np.array(rotatedMon[10:20])plt.scatter(rotatedMon1[:,0], rotatedMon1[:,1], s =80, c ="#FFD200", edgecolors ='none', label ="super monster1")plt.plot(rotatedMon1[:,0], rotatedMon1[:,1],'--', color ="#FFD200")plt.scatter(rotatedMon2[:,0], rotatedMon2[:,1], s =80, c ="#47BFDD", edgecolors ='none', label ="super monster2")plt.plot(rotatedMon2[:,0], rotatedMon2[:,1],'--', color ="#47BFDD")meanRotMon =(rotatedMon1 + rotatedMon2)/2plt.scatter(meanRotMon[:,0], meanRotMon[:,1], s =90, c ="#6a12c4", edgecolors ='none', label ="super mean monster")plt.plot(meanRotMon[:,0], meanRotMon[:,1],'-', color ="#6a12c4")labelsRot =['Landmark {0}'.format(i)for i in range(10)]for label, x, y in zip(labelsRot, meanRotMon[:,0], meanRotMon[:,1]):#annotate mean landmarks by numbers plt.annotate(label, xy =(x, y+0.01), ha ='right', va ='bottom')plt.grid()plt.legend(loc ="lower left")
Following posts should continue on this one and describe how the partial Procrustes superimposition for multilple configurations can be performed with fabulous sientific python.
Bookstein, F.L. 1991. Morphometric tools for landmark data: Geometry and Biology. Cambridge University press, Cambridge, UK.↩
Dryden, I.E. and K.V. Mardia. 1998. Statistical shape analysis. Wiley, Chichester, UK.↩
Zelditch, M.L., Swidersk, D.L. and H.D. Sheets. 2012. Geometric morphometrics for biologists: A primer. Second edition, Elsevier. ↩