Journal Articles

CVu Journal Vol 29, #6 - January 2018 + Design of applications and programs
Browse in : All > Journals > CVu > 296 (7)
All > Topics > Design (236)
Any of these categories - All of these categories

Note: when you create a new publication type, the articles module will automatically use the templates user-display-[publicationtype].xt and user-summary-[publicationtype].xt. If those templates do not exist when you try to preview or display a new article, you'll get this warning :-) Please place your own templates in themes/yourtheme/modules/articles . The templates will get the extension .xt there.

Title: Visualisation of Multidimensional Data

Author: Bob Schmidt

Date: 06 January 2018 16:40:18 +00:00 or Sat, 06 January 2018 16:40:18 +00:00

Summary: Frances Buontempo considers how to represent large data sets.

Body: 

We are familiar with scatter plots for two dimensional data. Simply use the a-axis for one dimension and the y-axis for the other and plot your points. Job done. How do you visualise data with more than two dimensions? You can manage a three dimensional plot, though either need this to be interactive so you can look at your graph from different angles or print plots from a variety of projections to draw out salient features. For more than three dimensions you are in trouble.

Let’s look at two ways to display high-dimensional data. The UCI holds a repository of a variety of data sets that are commonly used to show case machine learning algorithms. One frequently used set is the iris data [1]. This contains 150 instances of measurements of iris flowers along with the category or class of iris each belongs to:

  1. sepal length in cm
  2. sepal width in cm
  3. petal length in cm
  4. petal width in cm
  5. class:
    • Iris Setosa
    • Iris Versicolour
    • Iris Virginica

There are 50 examples or instances in each class, in blocks, so you know which is which. A machine learning algorithm will try to find ways to group the data correctly. We will ignore the class and just concentrate on the over-whelming (!) four dimensions of the data.

Scatter plots

This data set is extremely common. You can just load it in R:

  data(iris)

You can then ask for pairs of scatter plots. Exclude the final class column to see how the attributes correlate:

  pairs(iris[, 1:4])

This plots a matrix of scatter plots showing each pair of attributes in turn (Figure 1). (See the Quick-R website [2] for further details.)

Figure 1

You can’t immediately see the three different types of iris in these plots. You can seem some apparent correlations between the attributes though. This will get out of hand for more than a few dimensions. Let’s see an alternative approach to plotting lots of x values.

Parallel coordinates

A common way to plot such data is one line per data point on parallel y axes. Wikipedia [3] has an example point of the fabled iris data set. I said it was common! This approach, unlike our pairs of scatter plots, is scalable. You just need an extra y-axes in parallel for each new attribute.

If you download the data, and put quotes around the text in the final category column you can load it easily in Python. If you use numpy, you will make your life easier. Import pylab, or a graph package of your choice and load the data (see Listing 1).

import numpy as np
from pylab import *

def display():
  csv = np.genfromtxt ('iris.data', delimiter=",")
  fig = figure(figsize=(11,11))
  plot(csv [:, 0:4].T)  #Magic!
  show()
			
Listing 1

This code plots 150 lines, on the same graph. The magic T transposes the data, swapping the rows and columns. Without the magic T,you are sending the four attributes to pylab, so get four lines; one per attribute, with 150 data points on each. Once transposed, you have 150 lines with four points on each; one point per attribute. This gives you the four parallel coordinates you are after.

You can see (in Figure 2) where the four numeric attribute columns are; obviously at the left and right and then at the trough and spike in between. We end up with a colour for each, which is a bit multi-coloured. For those viewing in black and white, that’s a relief I assure you.

Figure 2

We just get a y-axis labelled for the first attribute, which we could add to. The a-axis is a bit pointless. We want four y axes, in parallel. I’ll leave that as an exercise for the reader. It’s easy enough to use axvline to draw vertical lines where you need them. You could try to plot each line in a colour dependant on the iris type too. In general, you should normalise the data. In the iris case, the numbers are all between 0 and 8 or so. However, you can see the last column has smaller values, so to do this properly we should scale everything between 0 and 1. If you had data with, say height in centimetres and shoe size you are likely to get the shoe sizes squashed up compared to the heights.

As your data sets increase in size, you can easily add extra coordinates. One potential downside of parallel coordinates is the patterns you see can depend on the order you place the axes in. Many visualization are interactive, allowing you to drag the columns around to see what’s happening. Let’s look at one last way to visualise multi-dimensional data that circumvents this problem.

Chernoff faces

Herman Chernoff deigned a way to show multi-dimensional data using faces in 1973 [4]. The idea is to map attributes of a dataset to salient features of faces, for example face shape, nose size, slant of mouth, eye shape and so on. I fell across this on Twitter and tried it for the iris data set. His motivation was an assumption that you are wired up to recognise faces, so will spot similar and different patterns easily. Wikipedia [5] shows a plot of rating of judges, in which you can see some very similar faces and a few outliers (see Figure 3).

Figure 3

Wikipedia has a link to some Python code [6], providing a function called cface, which I will leave you to experiment with. This uses 18 attributes:

  1. height of upper face
  2. overlap of lower face
  3. half of vertical size of face
  4. width of upper face
  5. width of lower face
  6. length of nose
  7. vertical position of mouth
  8. curvature of mouth
  9. width of mouth
  10. vertical position of eyes
  11. separation of eyes
  12. slant of eyes
  13. eccentricity of eyes
  14. size of eyes
  15. position of pupils
  16. vertical position of eyebrows
  17. slant of eyebrows
  18. size of eyebrows

Using this for the iris data is not very sensible, since this data set only has four attributes. This didn’t stop me, though, and after a bit of experimentation I chose four of the eighteen facial features. The sample code fixed the first feature, height of upper face to 0.9. so I followed suit. I set the others to 0.5, apart from the four, related to face size and eye size, to produce 150 faces for the 150 data points. (See Listing 2 and Figure 4.)

with open('iris.data', 'r') as csvfile:
  reader = csv.reader(csvfile)
  fig = figure(figsize=(11,11))
  i = 0
  for row in reader:
    ax = fig.add_subplot(15,10,i+1,aspect='equal')
    data = [0.5]*17
    data[0] = float(row[1]) #overlap of lower face
    data[1] = float(row[0]) # half of vertical size of face
    data[2] = float(row[2]) # width of upper face
    data[12] = float(row[3]) # size of eyes
    cface(ax, .9, *data)
    ax.axis([-1.2,1.2,-1.2,1.2])
    ax.set_xticks([])
    ax.set_yticks([])
    i += 1
			
Listing 2
Figure 4

You can clearly see the first 50, setosa iris flowers look very different to the next 100. Since the data is in groups of 50; setosa, versicolor then virginica, we know the first five rows are setosa, and so on. Had I labelled the earlier plots more clearly, you would have seen the petal width and lengths are much smaller for setosa flowers. The scatter plots have a clump of separated points for the petal attributes. The parallel coordinates also have a bunch of lines separated from the rest on the last two columns. You can also see a slight difference between the next two groups of flowers. The eyes, mapped to petal length does provide some disambiguation between versicolor and virginica flowers.

Conclusion

There are many ways to display multi-variate data. None are ideal, so it’s worth trying a few approaches if you have some data you want to explore. Everyone has seen scatter plots before. They are common because they can be very informative. Parallel coordinates are not so widely known - you probably didn’t study them at school. They do crop up quick frequently in serious data analysis studies so are worth knowing about. Chernoff faces seem to be relatively obscure. I can see a few academic articles and critiques of the technique on Google. I think the idea of a projection onto features is worth considering for data analysis. However, I suggest you have the same number of features as data columns, rather than spending far too much time trying to find the best four of eighteen to use if you want to spot differences in the iris data!

References

[1] Iris database: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/

[2] Quick-R: https://www.statmethods.net/graphs/scatterplot.html

[3] Wikipedia: https://en.wikipedia.org/wiki/Parallel_coordinates

[4] Chernoff faces: https://en.wikipedia.org/wiki/Chernoff_face

[5] Wikipedia (Chernoff faces): https://en.wikipedia.org/wiki/Chernoff_face

[6] Python example: https://gist.github.com/aflaxman/4043086

Notes: 

More fields may be available via dynamicdata ..