next_inactive up previous

pdf version

Dimension Reduction Algorithms
for
Data Mining and Visualization

Richard Holbrey (VOTech/DS6)

University of Leeds/Edinburgh




Contents

1 Introduction

1.1 Background

The need for visual insight into data has been understood for nearly a century, but many years the field only progressed within the realm of statistics (see eg. Tukey [197]). Attempts to popularise the subject more generally were somewhat limited1 and developments in data mining and visualization owed little to this work: data mining researchers were spurred on by the advance of database technology and the creation of huge commercial datasets in early 1990s [209] (see 1.2); the impetus for a more dedicated visualization effort came in 1987, with the publication of the NSF Report [154], which recognised the need to render the huge datasets resulting from engineering advances in fluid dynamics, brain scanning etc.

Since that time, researchers have expended a good deal of effort developing a series of powerful, interactive applications. Indeed, some analysts are now able to consider the fidelity of their visualizations by direct comparison with the original data [77]. Recently, however, the literature reveals a slightly more introspective mood [200]. The growth of `visual analytics', around questions of homeland security, has emphasised the danger of information overload and the need to harness new technology, such as high resolution and stereo monitors, in order to maintain an overview of the data [191]. Other researchers have stressed the need to grasp the semantics of visualization, and for the creation of ontologies, to ease communication between scientists using, for example, the e-Science grid [99].

Despite these advances, the problem of finding suitable methods for the exploration and analysis of large astronomical datasets remains formidable [148,208]. This document was instigated at the request of members of the VOTech DS6 team in August 2005 with the aim of providing a synthesis of relevant data mining and visualization algorithms, towards Stage 2 of the project [38]. The emphasis is on methods of bridging the gap between accurate representations of the data and the capabilities of current technology and users (see 1.2, below). Methods of dimension reduction of data are therefore prominent. Some mathematical background is given in this area (especially with respect to eigenanalysis) although readers are generally referred to other texts: supporting references have been drawn from Jackson [134], Manly [147], Rao et al [206] and Hastie et al [123].


1.2 Data mining and visualization: some definitions

1.2.1 Data mining

Data mining (DM) involves the retrospective analysis of large bodies of data, with the intention of recovering new facets which were hitherto unknown. Statisticians, however, remain divided over the need for the formulation of a priori hypotheses. On one hand, researchers have welcomed the chance to try new methods upon existing data sets without such constrictions. On the other, this has meant the abandonment of the probabilistic roots of the profession. In particular, it is true that data which has not been generated by deliberate random sampling cannot be generalised to the population at large. Nevertheless, the sheer size of some data sets indicates a measure of validity.

In this light, Wegman and Solka have defined a number of feasibility limits for data mining based on current processing power, the cost of storage and streaming capabilities [206]. For example, an algorithm was considered feasible if, given the size of the data set and the complexity of the algorithm applied, it could be completed in less than a week. At the time of writing (2004), this was thought to be achievable for most $ O\left(n\right)$ type problems without need for special computing resources. For $ O\left(n^{2}\right)$ algorithms, eg. clustering, this limit restricted the size of data table which could be processed to around $ 10^{6}$ entries ie. `medium' sized data tables, see Table 1.

Table 1: Data table sizes and execution times (in seconds) for algorithms of various complexities, executing on a 1 gigaflop machine (from [206])

Table size entries $ O\left(n\right)$

eg. $ \mu$, $ \sigma^{2}$ etc.

$ n\log\left(n\right)$

eg. quicksort

$ O\left(n^{2}\right)$

eg. clustering

small $ 10^{4}$ $ 10^{-5}$ $ 4\times10^{-5}$ $ 0.1$
medium $ 10^{6}$ $ 10^{-3}$ $ 6\times10^{-3}$ $ 1004.4$
large $ 10^{8}$ $ 0.1$ $ 0.78$ $ 116  days$
huge $ 10^{10}$ $ 10.02$ $ 100.1$ $ 3170  years$


A defining characteristic of DM is that data sets are often so large that -- without special hardware -- it may be impossible to load the complete dataset into memory at once (see [36]). It follows that the objective of DM searches is usually more exploratory than experimental ie. performed to generate hypotheses, rather than to test them. Also, it suggests that methods are needed initially to process the data as a stream, perhaps with some loss of precision, and that, later, a subset of the data will need to be selected.


1.2.2 Exploratory data analysis

A further distinction is sometimes made between exploratory data analysis (EDA) and DM. EDA makes use of various techniques designed specifically to identify patterns in multivariate data (or univariate sequences), such as factor analysis, clustering and so on (see Section 3). But the focus of DM is on producing a solution that can generate useful predictions. DM researchers, therefore, have a much greater acceptance of `black box' approaches, such as neural networks. Whilst these are certainly capable of generating valid predictions, it is not possible to identify the nature of the interrelations between the variables on which the predictions are based [183]2.

Scott also points out that exploratory analyses should ideally be non-parametric, despite the fact that assumptions about normality are common. Many parametric techniques are reasonably robust to deviations from normality, but for exploratory work the results can occasionally be very unpredictable [173]. This point has recently been echoed by Gray and Moore [118], who agree that non-parametric methods ``are generally the method of choice in exploratory data analyses...[but] often come at the heaviest computational cost.'' This issue is discussed further in Section 5.3.


1.2.3 Visualization

The role of graphics in data analysis has long been recognized as important, if not universally. In the 1930s, Fisher wrote: ``Diagrams prove nothing, but bring outstanding features readily to the eye; they are...no substitute for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them'' [173]. Despite this realization, however, it was not until the 1960s that Tukey finally made them respectable [196, p.53]; and not until 1987 that visualization was recognized as separate discipline with distinct aims.

Since that time, a good deal of effort has been expended to develop a number of powerful, highly interactive applications: Iris Explorer [1], Amira [2], Spotfire [57], Visage [208] etc). Humans are particularly good at detecting certain patterns and interactivity-based approaches are aimed at engaging our senses in order to aid our cognitive abilities. The hope is, in effect, to let the user `see' something new and from this perspective, interactivity is vital [66,64,175]. Against this view, however, various costs must be set:

It should also be realised that many effective visualizations have been created from careful consideration of the data and entirely static displays. As a result, Tufte places qualities such as design and data-ink maximization in much higher regard [194,196]. These issues are explored further in Section 2, and, given the difficulty of handling high dimensional data in 2D displays, the capabilities -- and interactivity -- of various software tools, notably HCE [174], R [164], Topcat [3] and XmdvTool [62] are also described.

1.3 Approach

1.3.1 The nature of the data

The approach adopted here is essentially borne of practical necessity. Queries of astronomical catalogues generally result in large data tables which may be hundreds of columns wide (representing the number of measured variables) and contain $ 10^{3}$ - $ 10^{9}$ rows (the number of observations), placing these data sets easily within Wegman and Solka's medium to large categories ( $ \leq10^{8}$ table entries, see above). In preliminary work up to this point, it was noted that some software tools -- perhaps the majority? -- work reasonably well with data sets of the order of MB (`small' to `medium' sized), but struggle with data tables of larger sizes for two main reasons:

Further discussion of these issues is available online at [37,38].


1.3.2 Multivariate analysis and `knowledge discovery'

It was thought convenient to treat the task of dimension reduction as two separate issues relating to (a) column reduction, and (b), row reduction. These topics are introduced in Sections 3.2 and 3.3. Readers who are already familiar with this territory are referred to the more succinct report by Fodor [108]. The treatment here is rather more in depth and a number of themes are apparent. The first, and probably the strongest, is that of the analysis of covariance leading to the techniques of principal components and factor analysis (amongst others). Indeed, an understanding of the difficulties and differences between these techniques is useful towards the understanding of the subject (multivariate analysis) as a whole. For this reason, these techniques are described in some detail in Section 4.

Another important theme is that of adopting a cyclic approach to data mining. A misleading term, which occurs in the literature, is knowledge discovery. The implication is that, after some processing step, domain experts should able to interpret newly `mined' data with much greater clarity than was previously possible. Wegman and Solka point out that this only really becomes possible if a considerable amount of effort is expended in cleaning up the data first. Any analysis must therefore be considered as part of a general data flow. Fig. 1 (adapted from [206, fig. 1]) illustrates this flow as a cycle to emphasise that each step in the process should be seen as one of data refinement, rather than as an immediate endpoint and visualization is no exception

Figure 1: Data flow
\begin{figure}\par
\begin{center}
\par
\input{data_flow.pstex_t}
\par\par
\par
\end{center}\par
\end{figure}
. Performing a principal components analysis, for example, may not produce immediately usable results (since the axes are difficult to interpret) but could be used to maximise the linear separation of clusters (for visualization) or to reduce the set of variables under study (selection criteria).

1.3.3 Data preparation

The development of astronomical instruments and procedures for data gathering and storage is beyond the scope of this report. Nevertheless, it is useful to say something about the level of processing which can normally be expected. It is common, for example, for data to be calibrated depending upon the instrument used before the data is stored, but additional steps may be required to achieve the highest possible levels of precision. A number of software tools are able to assist here, such as the Aips++ library [32], and others provide facilities for immediate analysis, notably IDL [47], Aladin [4] and Xmatch [65,5]. Also, there are standardisation issues to do with the adoption of various naming schemes, units and so on, although it is hoped that the creation of the IVOA UCD and VOTable standards [6] may make this aspect easier to handle in future.

One of the most obvious features of Fig. 1 is that curation and preparation of the data forms the bulk of the work. In particular, care is needed to remove errors (noise, missing values etc) and to select appropriate variates for analysis. This may require domain expertise, since it may be necessary to:-

Where the data need to be kept, it would seem to be desirable to be able to store the upper and lower limits of a given process. Although such limits might be stored in the VOTable format, it seems that these boundaries must often be inferred from the data and hence a satisfactory solution to this problem remains elusive. Otherwise, for removing potentially conflicting data, scatterplots and other visual techniques (see 2) may be useful step in identifying outliers or clusters. It should be noted, however, that it is the view of the DS6 group that visualization tools may be able to assist in all stages of this cycle, not just in the initial assay or the final presentation. The cleaning and normalisation of data are discussed in Section 4.

1.3.4 Objectives

Given the number of different but relevant domains upon which this research might touch, the scope of the present report is restricted to:-


2 The Limits of Visual Thinking

2.1 Background

Visualization is a complex research area which has developed over the past few decades by building on theory from information design, computer graphics, human-computer interaction and cognitive science. In view of this conglomeration of ideas, the title (`limits of visual thinking') is intended to reflect that the development of computing algorithms must be matched by an understanding of the psychology of the user. In other words, it is not sufficient for researchers to demonstrate a given algorithmic complexity to ensure a successful visualization, since little is achieved if the display hinders rather than helps. However, rigorous user testing of visualization software is both difficult and rare [85]. In order to get some appreciation for the current limits of hardware and software, therefore, this section gives a brief survey of available tools but begins by revisiting some of the guidelines for good graphic design set out by Tufte (see also [42]) and Bertin [75].


2.2 Guidelines for visual displays

2.2.1 Above all else, show the data

Tufte has recently published a harsh critique of NASA engineers' usage of MicroSoft PowerPoint©  instead of producing formal written reports prior to the Columbia disaster in 2003 [192]. This is a new variation of an old complaint: in the past, many authors have often exempted themselves from working with the data at hand and allowed new formats to obscure the end conclusions. Inappropriate scaling and ordering of data are common symptoms (see 2.2.4). Ultimately though, this is an issue of integrity and authors should be encouraged to think through, prepare and attach their own names to their own work [196,195, ch. 2].


2.2.2 Juxtaposition: contrasts and parallels

Bertin observed that tabulating data is an essential part of constructing a well-defined problem, and may require a good deal of imagination. But a homogeneous table is unlikely to be effective in capturing the eye. The next stage of the process, therefore, involves graphic illustration, but this should leave open the question of the organisation of the table. Indeed it is unlikely that the first effort will be optimal. Exposing contrasts in the data involves some skill at the removal of unnecessary data-ink, juxtaposition and scaling [87]4. This may be automated to some extent, since it is partly based on patterns of sectors in the data or graphic, but still requires (re-)organisation in line with semantic reasoning [75].

Similarly, Tufte has argued that presenting similar graphics side-by-side can enhance the visualization by suggesting motion between the images, as if in animation. This motion indicates greater dimensionality and, hence, helps to reinforce the differences between each `frame' [195, ch. 6]. It is also one of the guiding principles behind the scattergram displays (see 2.4.2) and those of Andrew's Curves (see 2.4.5). Tufte has also proposed that `sparklines' can give intense word-sized contrasts, as shown in Fig. 2, which may be still more effective when used in multiples [193].

Figure 2: Using `sparklines' (from [193])
Image 0_home_richardh_reports_ds6_alg_figures_sparklines

2.2.3 Maximize the data-ink and the data density ratios

A common pitfall of new graphic designs (and 2D/3D computer displays) is to emphasise line-art and to overlook supporting orientation or scaling data. Tufte describes the effect as one of proliferating `chartjunk'. Often, the added weight of this detail tends to impose on the graphic and masks some of the meaning. Two quantities describe this problem:

\begin{displaymath}
\begin{array}{ccc}
Data  ink  ratio & = & (data  ink)/(to...
... entries  in  data  matrix)/(area  of  graphic)\end{array}\end{displaymath}

In [196], Tufte advocates the removal of all non-data ink and any redundant ink. The axes of a scatterplot, for example (see 2.4.2), might be trimmed to show only the range of the data or to show median and quartiles. In this way, the data-ink ratio is maximized and the possibility for gratuitous self-decoration and moiré effects is removed. At the same time, thinned or greyed-out lines and varied line thickness can be used to great effect, if, say, a background grid is really needed.

In computer graphics, it might be supposed that the equivalent to the data-ink ratio would be data pixels (pixels representing data elements) to non-background window pixels; the data density might be data pixels to window pixels. Although providing useful concepts, these quantities may not be entirely appropriate, however, for electronic displays (see below).


2.2.4 Emphasise causal and multivariate relationships in data

Whilst data may not always contain cause and effect type relationships, the organisation of data plots is critical if such relationships are to be exposed. An oft-quoted example is that of the problem of `O'-ring failures on the Challenger space shuttle in 1986. NASA engineers went to some lengths to describe the risk prior to the launch, by using plots of rocket-shaped symbols with various kinds of cross-hatching to indicate particular failures. The rocket symbols, however, contributed nothing to the readability of the graphic and since they were ordered by the testing sequence, the explanatory variable (temperature) was completely obscured.

Tufte's reordering the plot, so that the symbols appear at the correct location along the temperature axis, makes the decision to launch looks a good deal riskier [195, p.49]; by removing the symbols and using a weighted metric to describe an overall damage statistic, this relationship is very much clearer [195, p.45]. More generally in scientific testing, it is necessary to employ experimental controls to provide a baseline for comparison. This baseline then permits us to relate variations in cause with variations in effect. Without such controls, we must assume that there may be many possible causes [195, p. 50-52].


2.3 Initial assessment and standardisation

Although some graphical techniques are useful for standardisation, it is useful to consider some of the basic requirements for cleaning data before proceeding. In the first place, the question of missing data needs to be tackled. Many software packages simply ignore data units with missing entries and this seems acceptable where there is sufficient data to do so. Often, however, this may not be possible or may even be unwise if there is some underlying cause which demands further explanation [104,147] (also note the earlier comments about keeping data in Section 1.3.2).

Scatterplots (see below) are undeniably useful for initial assessment, and it would be unusual today if this were not one of the analyst's first steps. Apart from gaining insight into the distribution, the nature and influence of outliers needs to be gauged. One method is to plot regression lines over the scatterplot, which may be locally weighted if the distribution seems uneven (see Fig. 4 and Appendix A.1 for a description of the data). Finding and possibly removing the `boundary' or convex hull of the data or is one way of assessing the influence outliers, see Fig. 3.

Figure: Left: convex hull plots define bounding boxes to show potential outliers; Right: `chi-plots' (with boundary points removed) indicate departure from independence if most of the points lie outside the horizontal band (after [103]). Note red=`star' and green=`galaxy' candidates; see Fig. 4, which shows combined views of these data.
Image 1_home_richardh_reports_ds6_alg_figures_1-2-chi-plots

Image 2_home_richardh_reports_ds6_alg_figures_1-3-chi-plots

In these plots, outliers do not appear to be particularly distant and keeping or removing the boundary points does not affect the adjacent chi-plots. The basic principle of these is to assess the independence of variables using the expected $ \chi^{2}$-distribution, where dependency is indicated by deviation from the central band running horizontally across the centre of each plot. In this case, the lower plots show a greater level of structure (ie. less independence), since relatively few points lie in this zone [103]. Note that the presence of two supposed groups may be more rigorously tested by the use of canonical variate analysis (see 5.6.1). The detection of outliers is discussed with reference to self-organised maps in Section 4.6.

A common requirement before proceeding with many forms of analysis is to scale the data, so that the range or units of the data do not figure. Conversion to Z-score (autoscaling) is probably the most common technique, so that the distribution has unit variance, although division by the sample range has been shown to be more robust in many clustering applications [104]. Principal components analysis and some clustering techniques provide ready made methods of normalization which are discussed later. For the multivariate case, autoscaling can be achieved by employing the covariance matrix, $ S$, where the transformation is:

$\displaystyle Z=S^{-\frac{1}{2}}(X-\mu)$ (1)

given that $ X$ describes the distribution of $ p$ variates $ \mu$ contains $ p$ sample means. It is possible, though for large problems, not generally advisable, to obtain $ S$ by using the identities which eigenanalysis provides (see 4.2.2).


2.4 Plotting in 2D

2.4.1 Issues with reducing nD to 2D

Given the high dimensionality of most astronomical datasets, reduction to 2/3D for visualization purposes seems unavoidable. Also, the preference for many researchers is naturally for 2D graphics, so that (amongst other things) publication drawings can be considered. Various techniques for plotting in 2D are considered in this section.

Where the dimensionality gets very high, however, several problems arise with reduction to 2D [173,205]. If we consider the hypercube $ \left[-1,1\right]^{n}$, the $ 2^{n}$ diagonals, $ {\bf v}$, from the origin to the corners are of the form $ \left(\pm1,\pm1,\pm1,\cdots,\pm1\right)^{T}$. The angle between the diagonals and the Euclidean axes, $ {\bf e}$, is given by $ \theta$, where:

$\displaystyle \cos\theta=\frac{\left\langle {\bf v},{\bf e}\right\rangle }{\sqr...
...ht\rangle }}=\frac{\pm1}{\sqrt{n}}\quad\xrightarrow[n\rightarrow\infty]{}\quad0$ (2)

so that, $ \theta\rightarrow90{}^{\circ}$ and the diagonals are nearly orthogonal to the axes. Table 2 shows this angle for progressively higher dimensional multiples:

Table 2: Angle between Euclidean axes and diagonals
$ n$ 2 4 8 16 32 64 128 256 512 1024
$ \theta ({}^{\circ})$ 45 60 69.3 75.5 79.8 82.8 84.9 86.4 87.5 88.2


In 2D, such diagonals would tend to become invisible, so that a simple `shadow' projection down to lower dimensions would tend to lose data near these diagonals.

In higher dimensions, the volume of hyperspace also appears distorted, so that the distribution of data is also counter-intuitive. Since the area of a circle can be written as $ c_{2}r^{2}$ and the volume of sphere as $ c_{3}r^{3}$, we can generalise the volume of a hypersphere as $ c_{n}r^{n}$. It is easy to show that the volume of a thin hypershell relative to that of the hypersphere approaches 1 as $ n$ increases. In other words, most of the volume occurs within the shell. The result is that if the data were uniformly distributed within the hypersphere, most data would, in fact, lie within the shell. However, by projecting the data onto a 2D plane, most of the data will appear to lie near the origin of a circle, thus giving an unrealistic impression [205].


2.4.2 Scatterplots

A typical scattergram (ie. a collection of scatterplots) is shown in Fig. 4, showing some of the SuperCOSMOS data set as depicted in R, with colour used as an additional `dimension' to reflect different classes in the data (here red implies `star', green a `galaxy'). Due to the linear nature of the plotting algorithm, the number of points which can be plotted is almost unbounded (but see below). There is a good deal of redundancy in the display (a poor data-ink ratio) since each smaller graphic is plotted twice, although statisticians seem to prefer this, so there is no particular bias to `x' or `y' axes. Some applications can make use of the plot windows on the diagonal eg. to plot histograms, as is the case here. The blue curvilinear feature in the scatterplots is a locally weighted regression (LOWESS) curve, after Cleveland [164]5.

Figure 4: SuperCOSMOS data scatterplot
Image 3_home_richardh_reports_ds6_alg_figures_col-pairs

Nevertheless, a practical limit of 10-15 variates exists for most displays. The use of `coplots' in R or dimensional stacking in XmdvTool [211] can make some relationships more explicit for, say, up to 5 variables.

Scott has observed that where the number of observations is large, the areas of greatest density lose visual impact and the eye is drawn to outlying data in the tails of the distribution [173]. One response has been the development of sunflower plots, which make use of hexagonal bins to form a 2D histogram [101], Fig. 5. Individual observations are plotted when there is only $ 1$ observation per bin as in a conventional scatter plot. Each bin with from $ 1-p$ observations contains a light sunflower. Other bins contain a dark sunflower. In a light sunflower each petal represents one observation. In a dark sunflower, each petal represents $ k$ observations. (A dark sunflower with $ p$ petals represents between $ pk-k/2$ and $ pk+k/2$ observations.)

Figure 5: Sunflower scatterplot (from [101])
Image 4_home_richardh_reports_ds6_alg_figures_sunflower


2.4.3 Parallel coordinates

Described by Inselberg over two decades ago [133], many tools are now capable of rendering parallel coordinates. The basic premise is very simple: instead of plotting data by using orthogonal axes, the axes are maintained parallel and the data plotted as lines across the axes. In this space, therefore, a point is mapped to a continuous line (and vice versa). This has the effect that the amount of data-ink increases relative to the background, but may obscure the display. Also, since it is common for the axes to be automatically scaled to the maximum and minimum for each variable, it is usually difficult to assess directly the correlation between any pair (cf. [205]).

Examples using the XmdvTool display are shown in Fig. 6, using a subset of 20 variables from the SuperCOSMOS dataset (1000 rows) [7].

Figure 6: XmdvTool: (a) parallel coordinate display, with brushing (blue-shaded region) to emphasise one possible pattern; (b) with hierarchical clustering (see text) [212]
(a) Image 5_home_richardh_reports_ds6_alg_figures_supercos_brush



(b) Image 6_home_richardh_reports_ds6_alg_figures_supercos_hierarc

The upper plot (a) emphasises the brushing technique, in which the `brushed' area is used to highlight a perceived group in the data, which might be used as an input for spatial restriction (note that Mirage employs this tactic; see 3.3.1). In the lower plot, (b), a simplified representation of the data was created by the hierarchical clustering feature of XmdvTool. Note that the clustering is effective in reducing the amount of clutter in the display but these seem to be rather too arbitrarily defined. As another illustration of this artefact, the well-known `iris' dataset displays 4 clusters, where there are only 3 species present and where, in fact, only two groups are linearly separable [48] (see Appendix A.2 and [173, Pl.1]). This exposes one of the difficulties with the hierarchical clustering technique, ie. in obtaining valid clusters, see 3.3.3.

A further problem is apparent when the number of variables increases (beyond, say, 15-20) in that it is difficult to separate the axes or associate axes with the respective variables. In fact, as the distance between axes decreases, the display becomes saturated with `data-ink' and any visible pattern is lost. An interesting response to the saturation problem was proposed by Lederman and implemented in Parvis [41]. Here, the author has created a fading brush technique and is also able to highlight single strands of data to assist the user. As a more innovative use of the display, histograms can also be plotted against the axes to indicate the basic data distribution.

2.4.4 Glyphs

A classic example of the use of glyphs was that of Chernoff faces, Fig. 7.

Figure 7: Chernoff faces (from [8])
Image 7_home_richardh_reports_ds6_alg_figures_chernoff

The assumption behind this approach is that the human eye is particularly sensitive to facial expressions, so that by mapping variables to particular features (eyebrow height, mouth curvature etc), we may be able to distinguish characteristic elements of the data more easily. Many other types of glyph, including stars, weathervanes and profiles, have been proposed. Ward has observed that the success of these is highly dependent upon colour and glyph placement [203], since they may be discrete or overplotted as required.

Tufte has observed that despite the quixotic character of Chernoff faces, `ink' may saved by halving the faces, since they are usually symmetrical [195]. One problem which frequently occurs, however, is that of representing the key in glyph plots, eg. see Fig. 8. The use of text in the key creates a fairly large mental load and has the effect of distancing the user, even for this small data set.

Figure: XmdvTool: Subset of SuperCOSMOS data (see A.1) set as glyphs -- note key. Each radial spoke represents a separate variable in this case, with the length of spoke reflecting the magnitude of the variable. The colours of the glyphs reflect a hierarchical clustering
Image 8_home_richardh_reports_ds6_alg_figures_xmdv_glyph

Another criticism is that glyphs are asymmetrical ie. some glyphs have greater visual impact, and it is recommended that all combinations are tried [173].


2.4.5 Andrew's Curves

An interesting implementation of the `juxtaposition' principle (2.2.2) is that of Andrew's curves [70], in which each observation is represented by a combination of the (scaled) variates, $ x_{i}$, into a sine wave of the form :

$\displaystyle f_{x}\left(t\right)=x_{1}\sin\left(n_{1}t\right)+x_{2}\cos\left(n_{1}t\right)+x_{3}\sin\left(n_{1}t\right)+\hdots$ (3)

This allows each row of data to be plotted as a single unique wave-form, typically in the same display for comparison (see [9] for a graphic demonstration). A number of extensions to this technique have recently been advanced by García-Osirio to visualize self-organizing maps using various over-plotting methods [113], for which a more illustrative movie demonstration also exists at [112].


2.4.6 Linked displays and tours

Several tools provide a greatly enhanced level of interactivity by combining displays and allowing highlights to be broadcast from one to the other, eg. Mirage [52]. One of the first tools to make use of so-called `brushing' of data between displays was Arc (itself based on XlispStat) which - even to this day - remains one of the few tools capable of generating regression planes to describe data [89]. XGobi [185] and XGVis [81] also adopted the brushing technique, adding various methods to display data in parallel coordinate, 3D rotation and making use of tours (animations) particularly with in conjunction with dimensional scaling (see 4.6 below, CViz [120] and HiSee [45]). Another valuable property of these systems is that data can can be excluded to examine the effect of outliers.

Several commercial systems have now been developed in this idiom, such as Spotfire [57]. One system which deserves further mention, however, is Hierarchical Clustering Explorer (HCE) [44], see Fig. 9.

Figure 9: Hierarchical Clustering Explorer
Image 9_home_richardh_reports_ds6_alg_figures_hce-2

A unique feature of this tool is that 1D and 2D histograms are used to suggest interesting distributions (and hence variates) to the user (but note 3.2.2). However, the use of a clustering algorithm as data are input forbids the processing of `medium' or larger data tables (see Table 1). This issue also affects XmdvTool when using the InterRing [213] or clustering data selection tools.


2.5 Tree-based plots

In cladistics6, trees are common tools for representing the distance between species graphically. To represent such trees within limited computer displays, however, various forms of non-linear presentation have been adopted using `backbone' spanning trees and focus/context motifs, such as the i-Disc [127], or hyperbolic forms, such as HypViewer, see Fig. 10, which shows the current UCD1+ hierarchy [160].

Figure 10: Hypviewer
Image 10_home_richardh_reports_ds6_alg_figures_ucd

This system was capable of displaying some $ 10^{5}$ edges; in later schemes, using `focus + context' techniques, the authors claim to support over $ 10^{7}$ nodes (see TreeJuxtaposer [73,161,10]). The hyperbolic model of Hypviewer has been adapted into various later systems, notably Walrus [131], Inxight [11], Prefuse [125], Kartoo [12] and Touchgraph [13].


2.6 3D tools

Graph-based techniques have also been adopted within the visual programming environments of IRIS Explorer [1] and OpenDX [14] (see also ViSta [214]). Here, customised applications exist as modules of code which can be connected together graphically, so that data appears to flow around the system. For astronomy, VisIVO, which also employs a modular library approach (using VTK [60,172]), is currently being adapted for virtual observatory use [72]. A natural tendency of these systems has been towards the display of 3D, spatial data, making use of hardware rendering.

These tools have been particularly successful at visualizing scalar data and flow problems (see eg. [143] for more recent developments). For more general multidimensional data, there are several challenges. Figure 11 shows an example from VTK using `splatting' with the quasar data [59]. This technique aggregates the data by overlapping areas or volumes associated with each data point (here the u, g and r variables are described by the three axes; and i by the attributed colour scalar.

Figure 11: VTK `splat' example (see text)
Image 11_home_richardh_reports_ds6_alg_figures_vtk-ugri-splat

Higher dimensional (say 5D or 6D) data sets can also be rendered if glyphs, animation or contours are also be employed, although issues of occlusion and human perception (which varies particularly for colour) begin to take precedence [203]. However, the technique of hierarchical `splatting' (see Fig. 12) may offer faster rendering of very large tables of unstructured data [129].
Figure 12: Hierarchical splatting of n-body simulation (from [128])
Image 12_home_richardh_reports_ds6_alg_figures_splatting

Working in 3D (and higher) therefore requires that a number of issues must be faced:-

With respect to the above navigational issues, it is useful to describe research at Leeds undertaken by dos Santos [98]. An interaction graph was proposed to allow users to select variables for 1D, 2D or 3D plotting with additional, movable nodes along the spokes of the graph to select the range of the data and to locate the viewing position. For a 57-dimensional sample of the SuperCOSMOS data, however, it was necessary to expand the graph to a radial spiral shape, Fig. 13.
Figure 13: Interaction graph for high dimensional data
Image 13_home_richardh_reports_ds6_alg_figures_selan-57

Unfortunately (as with glyphs above), this scheme begins to distance the user from the data since it is no longer easy to display labels or keys, nor use the central spoke nodes to alter the range of the data.


2.7 Comment

It has already been observed that only the least complex (eg.  $ O\left(n\right)$) schemes are able to cope with `large' data sets as proposed. For this reason, the simpler schemes here are described more thoroughly here than might have been expected. Although this list is not exhaustive, numerous web-pages exist for comparisons elsewhere [39,16,17,18,199]. In addition, it is convenient here to make a number of further observations:


3 Dimensional Reduction


3.1 The curse of dimensionality

The `curse of dimensionality' (following Bellman (1961) [20]) refers to the exponential growth of hypervolume as a function of dimensionality. In most cases, this issue is manifested as one of data distribution, which is usually very uneven within the hyperspace. Hence data may be abundant in some regions, but inadequate in others, especially if statistical inferences need to be drawn.

For astronomy, however, it is recognized that some variables are of greater interest than others. In particular, error terms, although important, are generally of lesser interest than their respective observational parameters. Hence there may be a hierarchy of data terms, which may be different for each researcher. The proposed UCD1+ terminology also suggests a data hierarchy [95,49] which might be exploited. It was proposed to examine this issue further by investigating current tree-building tools, and developing, if necessary, some new method of configurconfiguringating the display of data (see Section 6.4). The remainder of this section considers methods of table reduction, by column and then row.


3.2 Column reduction

The main column (ie. variate) reduction methods are summarised in Fig. 14.

Figure 14: Column reduction techniques
Image 14_home_richardh_reports_ds6_alg_columns

A brief description of the elements in this figure is given below.


3.2.1 Manual Section


3.2.2 Automated selection


3.2.3 Cardinality


3.2.4 Factor Analysis


3.3 Row reduction

These techniques aim to give focus to particular units of data (often called `individuals') and, in this sense, provide a narrowing of the extent of the data. For example, the focus might be used as an input to select rows from a table. The techniques are summarised in Fig. 15 and below.

Figure 15: Row reduction techniques
Image 15_home_richardh_reports_ds6_alg_rows


3.3.1 Spatial restriction


3.3.2 Sampling and quantization


3.3.3 Clustering


3.3.4 Classification and discriminant analysis


4 Column Reduction Techniques

4.1 Introduction

4.1.1 Background

The techniques of factor analysis, principal components etc have developed over the course of the 20th Century, fuelled by research in fields as diverse as psychology and ballistics. One of the original aims was to extract the line or plane of maximal variance. Such a line, plane or hyperplane -- known as the principal component -- is actually one of regression about which the orthogonal error is minimised, see Fig. 16.

Figure 16: Principal components of a 2D data set (first: long axis, red; second: short axis, green)[54]
Image 16_home_richardh_reports_ds6_alg_figures_pca_pcs

The advantage of this kind of error minimisation is that it allows any one variable to `predict' another (within certain tolerances) and these are then described as predictor and response variables. Also, if this axis is adopted as the main component of interest (typically, by transforming the data to this axis), then the remaining variance (orthogonal to this plane) appears to be procedural in origin. Moreover, the variables under this transformation are now said to be uncorrelated ie. it is no longer true that one variable may be used to predict the other. Only after this step can the variables be considered to be independent and hence this term is avoided in multivariate analysis.

4.1.2 Methods and models

Factor analysis (FA) has often been confused with principal component's analysis (PCA) as the techniques are similar, although the underlying mathematical model is rather different [84]. Given the importance of these techniques to the present discussion, some background theory to FA and PCA is given below.

4.2 Covariance and correlation


4.2.1 Extracting factors

If we have $ p$ factors (usually columns of data, each representing a separate variable), we may denote the factors as $ F_{1},F_{2},\cdots,F_{p}$. Generally, we may suppose that these factors can be linearly combined for each of $ n$ observations, to give an overall index, $ X_{i}$ ( $ i=1,2,\ldots,n$), which can be written:

\begin{displaymath}\begin{array}{ccccccccccc} X_{1} & = & a_{11}F_{1} & + & a_{1...
...{n2}F_{2} & + & \cdots & + & a_{np}F_{p} & + & e_{n}\end{array}\end{displaymath} (4)

If $ p$ is large, however, it is advantageous to select the factors (and coefficients) of greatest interest -- usually those factors which display some interesting distribution or the greatest variance. If some factors are to be omitted because their contribution is thought to be negligible, then the error term, $ e_{i}$, in each of the expressions of Equation 4 increases. That is to say, this term acknowledges that some error has been introduced.

In Equation 4, the coefficients $ a_{ij}$ represent a matrix, denoted $ A$. But the usual starting point for this matrix is the sample covariance matrix, $ S$, which is defined by the relation:

$\displaystyle s_{ij}=\frac{n\Sigma x_{ik}x_{jk}-\Sigma x_{ik}\Sigma x_{jk}}{n\left(n-1\right)}$ (5)

where $ s_{ij}$ are elements of $ S$ ( $ i,j=1,2,\ldots,p$) , $ n$ is the number of observations and $ k$ is used to sum the values in columns $ x_{i}$ and $ x_{j}$ from 1 to $ n$. In this case, $ S$ will be $ p\times p$ symmetric and the diagonal values ($ s_{i=j}$) will be the usual column variances. It is convenient here to point out the similarity between $ S$ and the correlation matrix, $ R,$where:

$\displaystyle r_{ij}=\frac{s_{ij}}{s_{ii}s_{jj}}$ (6)

It turns out that the factors, $ a_{ij}$, may then be drawn from the characteristic forms of the $ s_{ij}$ or $ r_{ij}$ matrices, which arise from eigenanalysis [103]. In fact, if $ S$ is nonsingular (ie. the determinant of $ S$, ie.  $ \left\vert S\right\vert$ is non-zero), it can be shown that:

$\displaystyle U^{\prime}SU=L$ (7)

where $ U$ is orthonormal, so that $ U^{\prime}U$ is identity ($ I$). Since $ S$ is relatively easy to compute, its decomposition to $ ULU^{\prime}$ is more natural. Under these conditions, $ L$ is a diagonal matrix, the elements of which comprise the latent roots or eigenvalues (literally `characteristic values') of the system. The columns of $ U$ then give the eigenvectors associated with each root and thereby define the principal components. These roots may also be obtained by supposing that for some constant $ \lambda$ and vector $ {\bf u}$, there is a non-trivial solution to the problem:

$\displaystyle S{\bf u}=\lambda{\bf u}$

so that:

$\displaystyle \left(S-\lambda I\right){\bf u}=0$ (8)

Hence, ignoring trivial cases, there will be solutions to this system for characteristic vectors $ {\bf u}$ where $ \left\vert S-\lambda I\right\vert=0$ (see [40]). Each solution for $ {\bf u}$ then has an associated eigenvalue $ \lambda$, which may be assembled (respectively) into the $ U$ and $ L$ matrices as described above.

We may visualize this transformation as a rotation of the axes along the axes of maximal variation and, if normalised, the columns of $ U$ give the direction cosines between the old and new axes. Determinants may be used to obtain such roots for small problems, but for large systems, iterative `power' methods are more appropriate. The Lapack library provides a well-tested and freely available set of such routines, using eg. single value decomposition [58,68].


4.2.2 Principal component transforms

It is useful to consider several further extensions to Equation 7 so that we may elucidate the process of PCA and data transformation (including `sphering'; see Equation 1). Jackson [134], for example, employs the identities:

\begin{displaymath}\begin{array}{ccc} V & = & UL^{1/2}\ W & = & UL^{-1/2}\end{array}\end{displaymath} (9)

to give the results shown in Table 3.


Table 3: U,V,W vector products of PCA (after Jackson [134])
  PCA transform Squared terms Notes
U-vector $ z_{i}=U^{\prime}\left[x_{i}-\overline{x}\right]$ \begin{displaymath}\begin{array}{ccc}
UU^{\prime} & = & I\end{array}\end{displaymath} PCs have variances equal to corresponding root
V-vectors $ \sqrt{l_{i}}z_{i}=V^{\prime}\left[x_{i}-\overline{x}\right]$ \begin{displaymath}\begin{array}{cc}
VV^{\prime}= & S\end{array}\end{displaymath} PCs have variances equal to square of corresponding root
W-vectors $ \frac{z_{i}}{\sqrt{l_{i}}}=W^{\prime}\left[x_{i}-\overline{x}\right]$ $ WW^{\prime}=S^{-1}$ PCs have variances equal to unity


The object of PCA is the linear transformation of a group of correlated variables, such that some set of optimal conditions is achieved. The most important condition is that the new variables are uncorrelated. Mathematically, this is ensured by the requirement that $ u_{i}^{\prime}u_{j}=\delta_{ij}$ (using Kronecker's delta function: $ \delta_{ij}=1$ if $ i=j$, or 0 otherwise). Even so, these identities suggest many possible transforms. From the above, $ W$ might be written as $ S^{-1/2}$ (the inverse square root of covariance), so that the data sphering problem of Equation 1 can be solved in this way (although for large problems, the Lapack library offers more viable inversion routines). Hence, we might apply the $ W$-transform:

$\displaystyle Z=W^{\prime}\left(X-\mu\right)$ (10)

to obtain variables centred at $ \mu$ with unit variance. On the other hand, the comparable $ V$-transform scales the variables to the same units as the original data, which may be convenient for some problems (typically in engineering) where the units happen to be the same for all the variables. Finally, the $ U$-transform scales the resultant vectors to unity and hence the coefficients are always in the range $ \pm1$, which may be useful for more general diagnostic purposes (eg. in the analysis of shapelet functions; see [165]). Note that, in this latter case, the variance of each component is then equal to the corresponding root.

4.2.3 Using correlation

The choice of the forms of scaling presented above will be influenced by the type of application and the nature of inferential statistics required. In practice, however, it is often more appropriate to work with the correlation matrix, $ R$, rather than that of covariance ($ S$), although this may be less powerful statistically. The reason for this is that PCA is sensitive to additional variance which may be introduced by:

If uncorrected, these sources of variability will easily dominate the first few component terms. It should be borne in mind though, that the variables which result from using correlation acquire similar importance. Also, there is no obvious correspondence between components derived from correlation and covariance matrices. It is possible, however, to characterise the relationship between original variable, $ i$, and the component, $ j$, using the following:

where $ l_{j}$ is the $ j$th root of the diagonal matrix $ L$ and $ u_{ij}$ is the $ i,j$th element of the matrix of eigenvectors, $ U$ [103].

4.3 Using PCA


4.3.1 Assessing the number of components

Amongst the many properties of latent roots, we may cite two of special interest:

  1. $\displaystyle \left\vert S\right\vert=\left\vert L\right\vert=l_{11}l_{22}...l_{pp}$

  2. $\displaystyle trace\left(S\right)=trace\left(L\right)=l_{11}+l_{22}+\ldots+l_{pp}$

ie. two important properties of the covariance matrix $ S$, the determinant $ \left\vert S\right\vert$ (sometimes known as the generalised variance) and trace (defined as the sum of the diagonal terms), can be represented respectively by the product or sum of the latent root terms. This is especially useful since it implies that the largest root indicates the most important component. These properties also provide a mechanism to assess the importance of a particular component quantitatively. Several methods exist to perform this kind of analysis [134, Sec. 2.8, p.41]. The simplest is simply to take the ratio, so that $ P_{j}$ defines the proportion of variation accounted for by the $ j$th term as follows:

$\displaystyle P_{j}=\frac{l_{j}}{trace(S)}$

Another useful technique -- the `scree' plot -- depicts each eigenvalue as a sequence from largest to smallest. Those components which then lie above the `scree' at the base of the slope are thought to be the most useful candidates for further analysis, see Fig. 17.

Figure 17: Example scree-plot from R [105]
Image 17_home_richardh_reports_ds6_alg_Rplot001

An alternative stopping point is the Kaiser criterion [26], according to which we only retain factors with eigenvalues greater than 1. In essence this is like saying that, unless a factor extracts at least as much information as the equivalent of one original variable, we drop it. It is noted that a more sophisticated test, developed by Bartlett and extended by Anderson, also exists to provide a significance test for the equality of smaller roots. Having chosen to adopt a given number of components, it is also possible to assess the error from the `lost' components by inverting the system represented by Equation 4 and using residual analysis [134, p.34,87].


4.3.2 Understanding PCA

Although used widely for decades, many authors have given warnings about the wisdom of interpreting PCA results. Jackson has observed that the physical importance of a component is not necessarily related to the size of the characteristic root; indeed, the equivalent variation in a lesser component might have a greater effect on the human eye or ear and hence scaling might be a more important factor in visualization. The use of inferential statistics such as Hotelling $ T^{2}$-test may help, although this would probably not qualify as an exploratory technique [134, p.55].

Everitt, on the other hand, favours the `labelling' of PCs, albeit with caution, and advocates the re-scaling of variables (eg. by negating) if this makes the process easier [103]. Whether or not labels can be attached, however, a useful graphical technique is to take the first two or three components and perform pairs plots (or a scatterplot; see 2.4.2) to identify clusters and outliers. To facilitate this, PCA `scores' can be obtained by taking the element-wise product of the $ m$ chosen components with $ m$ data terms ie.  $ u_{ij}^{\prime}x_{ij}$ for $ j=1,\cdots,m \forall i$.

Fig. 18 shows the use of PCA scores in two different problems. In the upper plot (a), the iris data are separated largely by the first principal component and although the groups are not distinct, there is relatively little overlap (cf Section 2.4.3). In the lower plot (b), three components are distinguished by classification into two groups, ie. star or galaxy for the SuperCOSMOS data (see A.1). This difference is highlighted by quadratic regression surfaces through these scores (cf Fig. 4).

Figure 18: Using PCA scores to plot the data. Top (a): a maximal linear separation of the iris data using first and second components (colours indicate the separately identified species); bottom (b): regression planes of PCA scores of the SuperCOSMOS data (generated by R Commander)
(a) Image 18_home_richardh_reports_ds6_alg_figures_r-pca-iris



(b) Image 19_home_richardh_reports_ds6_alg_figures_r-pca

These surfaces show some degree of separation along the first component axis, but their convoluted shape makes this representation seem less than satisfactory, even when viewed with rotation in 3D. A more useful technique, suggested by Everitt [103], may be to omit a particular variable of interest from the PCA and then to plot/regress scores from the first few PCs against the raw data for the selected variable. The effect of outliers in these plots should again signal a note of caution, and it may be advisable to choose more robust estimators for covariance in these cases.


4.3.3 Re-interpretation by correlation

A surprising feature of PCA is that, although the vector components are uncorrelated, the individual coefficients are not. On reflection, this is less surprising within each component since the terms are usually constrained by their sum (see Table 3) but it turns out that the coefficients of adjacent components are also often strongly correlated. This would seem to indicate that correlation tables and perhaps (graphically) parallel coordinates may be able to help, especially if components with small roots might be important, since, otherwise, these would likely be rejected.

4.3.4 Rotation of PCs

A more widespread technique of re-interpretation is to attempt to rotate the PCs until they are aligned with some notionally simpler system, usually so that the coefficients are as close to zero or one as possible. If the new system has $ k$ components assembled as $ B=\left[b_{1}\mid b_{2}\mid\hdots\mid b_{k}\right]$, then the requirements may be stated as:

  1. Each row of $ B$ should contain at least one zero, so that each of the original variables is uncorrelated with at least one of the rotated components.
  2. Each column of $ B$ should contain at least $ k$ zeros, the more the better.
  3. Each pair of columns should be distinguished by the presence of several ones (or zeros) at particular locations. If $ k\ge4$, the aim should be to increase the number of zeros, so that the matrix remains fairly sparse.
These methods, in essence, provide a simple clustering of variables, which (it is hoped) are independent. But the rotated variables are no longer uncorrelated. The forms of rotation can be summarised as follows:

Many rotation techniques had their origin with factor analysis and are discussed further below, but it is fair to say that they have had a chequered history, with many objections to their ad hoc application. They have been most useful in psychological and market research and may be more valuable with very high dimensional data.


4.3.5 Removal of variables

Instead of rotating the system to maximise (or minimise) the influence of variables of interest, Mardia et al [150] have suggested that PCA may be used to remove variables directly. First, PCA is undertaken and a method selected to find the most appropriate number of components, $ k$, such as by scree-plot. Then one begins a process of deselection by examining the least significant PC and rejecting the variable associated with the largest coefficient. The variable associated with the largest coefficient of the second least significant PC is then rejected and so on, until only $ k$ variables remain.

4.3.6 Non-linear PCA

It is noted here that PCA has also been developed to provide an analysis of principal curves, particularly useful where it is suspected that the primary relationships between variables are non-linear. This approach is available in S-PLUS and R pcurve library, based on work by Hastie and De'ath (see [35]), and has also been applied to astronomy data via neural networks (see 5.8.4).


4.4 Factor analysis

4.4.1 Difficulties with PCA, FA and rotation

One of the main objections to PCA is the absence an underlying statistical model. It is difficult, therefore, to interpret the meaning of the resulting components with any certainty or to give an estimate of the effect of sampling error. Whilst FA overcomes some of these issues, it introduces new problems. To start with, it must be assumed that the underlying factors actually exist, and that, where there is more than one factor, that the factors are independent of each other and any error terms (see Equation 4) [134].

For many authors, another disquieting feature was the willingness of statisticians to employ rotation techniques to find more readily amenable factors. Since these rotations often appeared to be equally plausible, this suggested a considerable degree of subjectivity. The result, particularly for FA, was a considerable amount of adverse publicity. In 1989, Chatfield and Collins observed that few cases existed where the technique had proven to be useful outside the field of psychology [84]. But in recent years, a significant step-forward has been the development and inclusion of maximum-likelihood procedures in software packages, such as SPSS [61] and R, and applications of FA in science and engineering are much more common (see eg. [119,140]).

4.4.2 Background to factor analysis

In PCA, the fact that the sum of the latent roots and the sum of the original variance is the same (4.3.1) is often interpreted as supposing that PCA accounts for all the variation in the data. FA, on the other hand, is said to explain correlations. In the early 20th Century, Spearman observed that the ratio of off-diagonal covariance was often similar between rows, suggesting the presence of underlying `factors' [147].

The general FA model is stated in Equation 4. Ideally, the variable $ X_{i}$ in this model is the $ i$th test score with mean zero and unit variance; the coefficients $ a_{ij}$ are the factor loadings for the $ i$th test; and $ F_{i}$ ( $ i=1,\hdots,k$) are uncorrelated common factors, also with mean zero and unit variance. It is assumed that the variables($ X_{i}$) are correlated with the factors, but are otherwise uncorrelated with each other. With this model:


$\displaystyle Var\left(X_{i}\right)$ $\displaystyle =$ $\displaystyle a_{i1}^{2}Var\left(F_{1}\right)+a_{i2}^{2}Var\left(F_{2}\right)+\hdots+a_{ik}^{2}Var\left(F_{k}\right)+Var\left(e_{i}\right)$  
  $\displaystyle =$ $\displaystyle a_{i1}^{2}+a_{i2}^{2}+\hdots+a_{im}^{2}+Var\left(e_{i}\right)$  
  $\displaystyle =$ $\displaystyle 1$ (11)

The $ a_{i1}^{2}+a_{i2}^{2}+\hdots+a_{im}^{2}$ term is called the communality of $ X_{i}$: the part of the variance that is related to common factors. It's complement, $ Var\left(e_{i}\right)$, is called the specificity of