next_inactive up previous

pdf version

Dimension Reduction Algorithms
for
Data Mining and Visualization

Richard Holbrey (VOTech/DS6)

University of Leeds/Edinburgh




Contents

1 Introduction

1.1 Background

The need for visual insight into data has been understood for nearly a century, but many years the field only progressed within the realm of statistics (see eg. Tukey [197]). Attempts to popularise the subject more generally were somewhat limited1 and developments in data mining and visualization owed little to this work: data mining researchers were spurred on by the advance of database technology and the creation of huge commercial datasets in early 1990s [209] (see 1.2); the impetus for a more dedicated visualization effort came in 1987, with the publication of the NSF Report [154], which recognised the need to render the huge datasets resulting from engineering advances in fluid dynamics, brain scanning etc.

Since that time, researchers have expended a good deal of effort developing a series of powerful, interactive applications. Indeed, some analysts are now able to consider the fidelity of their visualizations by direct comparison with the original data [77]. Recently, however, the literature reveals a slightly more introspective mood [200]. The growth of `visual analytics', around questions of homeland security, has emphasised the danger of information overload and the need to harness new technology, such as high resolution and stereo monitors, in order to maintain an overview of the data [191]. Other researchers have stressed the need to grasp the semantics of visualization, and for the creation of ontologies, to ease communication between scientists using, for example, the e-Science grid [99].

Despite these advances, the problem of finding suitable methods for the exploration and analysis of large astronomical datasets remains formidable [148,208]. This document was instigated at the request of members of the VOTech DS6 team in August 2005 with the aim of providing a synthesis of relevant data mining and visualization algorithms, towards Stage 2 of the project [38]. The emphasis is on methods of bridging the gap between accurate representations of the data and the capabilities of current technology and users (see 1.2, below). Methods of dimension reduction of data are therefore prominent. Some mathematical background is given in this area (especially with respect to eigenanalysis) although readers are generally referred to other texts: supporting references have been drawn from Jackson [134], Manly [147], Rao et al [206] and Hastie et al [123].


1.2 Data mining and visualization: some definitions

1.2.1 Data mining

Data mining (DM) involves the retrospective analysis of large bodies of data, with the intention of recovering new facets which were hitherto unknown. Statisticians, however, remain divided over the need for the formulation of a priori hypotheses. On one hand, researchers have welcomed the chance to try new methods upon existing data sets without such constrictions. On the other, this has meant the abandonment of the probabilistic roots of the profession. In particular, it is true that data which has not been generated by deliberate random sampling cannot be generalised to the population at large. Nevertheless, the sheer size of some data sets indicates a measure of validity.

In this light, Wegman and Solka have defined a number of feasibility limits for data mining based on current processing power, the cost of storage and streaming capabilities [206]. For example, an algorithm was considered feasible if, given the size of the data set and the complexity of the algorithm applied, it could be completed in less than a week. At the time of writing (2004), this was thought to be achievable for most $ O\left(n\right)$ type problems without need for special computing resources. For $ O\left(n^{2}\right)$ algorithms, eg. clustering, this limit restricted the size of data table which could be processed to around $ 10^{6}$ entries ie. `medium' sized data tables, see Table 1.

Table 1: Data table sizes and execution times (in seconds) for algorithms of various complexities, executing on a 1 gigaflop machine (from [206])

Table size entries $ O\left(n\right)$

eg. $ \mu$, $ \sigma^{2}$ etc.

$ n\log\left(n\right)$

eg. quicksort

$ O\left(n^{2}\right)$

eg. clustering

small $ 10^{4}$ $ 10^{-5}$ $ 4\times10^{-5}$ $ 0.1$
medium $ 10^{6}$ $ 10^{-3}$ $ 6\times10^{-3}$ $ 1004.4$
large $ 10^{8}$ $ 0.1$ $ 0.78$ $ 116  days$
huge $ 10^{10}$ $ 10.02$ $ 100.1$ $ 3170  years$


A defining characteristic of DM is that data sets are often so large that -- without special hardware -- it may be impossible to load the complete dataset into memory at once (see [36]). It follows that the objective of DM searches is usually more exploratory than experimental ie. performed to generate hypotheses, rather than to test them. Also, it suggests that methods are needed initially to process the data as a stream, perhaps with some loss of precision, and that, later, a subset of the data will need to be selected.


1.2.2 Exploratory data analysis

A further distinction is sometimes made between exploratory data analysis (EDA) and DM. EDA makes use of various techniques designed specifically to identify patterns in multivariate data (or univariate sequences), such as factor analysis, clustering and so on (see Section 3). But the focus of DM is on producing a solution that can generate useful predictions. DM researchers, therefore, have a much greater acceptance of `black box' approaches, such as neural networks. Whilst these are certainly capable of generating valid predictions, it is not possible to identify the nature of the interrelations between the variables on which the predictions are based [183]2.

Scott also points out that exploratory analyses should ideally be non-parametric, despite the fact that assumptions about normality are common. Many parametric techniques are reasonably robust to deviations from normality, but for exploratory work the results can occasionally be very unpredictable [173]. This point has recently been echoed by Gray and Moore [118], who agree that non-parametric methods ``are generally the method of choice in exploratory data analyses...[but] often come at the heaviest computational cost.'' This issue is discussed further in Section 5.3.


1.2.3 Visualization

The role of graphics in data analysis has long been recognized as important, if not universally. In the 1930s, Fisher wrote: ``Diagrams prove nothing, but bring outstanding features readily to the eye; they are...no substitute for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them'' [173]. Despite this realization, however, it was not until the 1960s that Tukey finally made them respectable [196, p.53]; and not until 1987 that visualization was recognized as separate discipline with distinct aims.

Since that time, a good deal of effort has been expended to develop a number of powerful, highly interactive applications: Iris Explorer [1], Amira [2], Spotfire [57], Visage [208] etc). Humans are particularly good at detecting certain patterns and interactivity-based approaches are aimed at engaging our senses in order to aid our cognitive abilities. The hope is, in effect, to let the user `see' something new and from this perspective, interactivity is vital [66,64,175]. Against this view, however, various costs must be set:

It should also be realised that many effective visualizations have been created from careful consideration of the data and entirely static displays. As a result, Tufte places qualities such as design and data-ink maximization in much higher regard [194,196]. These issues are explored further in Section 2, and, given the difficulty of handling high dimensional data in 2D displays, the capabilities -- and interactivity -- of various software tools, notably HCE [174], R [164], Topcat [3] and XmdvTool [62] are also described.

1.3 Approach

1.3.1 The nature of the data

The approach adopted here is essentially borne of practical necessity. Queries of astronomical catalogues generally result in large data tables which may be hundreds of columns wide (representing the number of measured variables) and contain $ 10^{3}$ - $ 10^{9}$ rows (the number of observations), placing these data sets easily within Wegman and Solka's medium to large categories ( $ \leq10^{8}$ table entries, see above). In preliminary work up to this point, it was noted that some software tools -- perhaps the majority? -- work reasonably well with data sets of the order of MB (`small' to `medium' sized), but struggle with data tables of larger sizes for two main reasons:

Further discussion of these issues is available online at [37,38].


1.3.2 Multivariate analysis and `knowledge discovery'

It was thought convenient to treat the task of dimension reduction as two separate issues relating to (a) column reduction, and (b), row reduction. These topics are introduced in Sections 3.2 and 3.3. Readers who are already familiar with this territory are referred to the more succinct report by Fodor [108]. The treatment here is rather more in depth and a number of themes are apparent. The first, and probably the strongest, is that of the analysis of covariance leading to the techniques of principal components and factor analysis (amongst others). Indeed, an understanding of the difficulties and differences between these techniques is useful towards the understanding of the subject (multivariate analysis) as a whole. For this reason, these techniques are described in some detail in Section 4.

Another important theme is that of adopting a cyclic approach to data mining. A misleading term, which occurs in the literature, is knowledge discovery. The implication is that, after some processing step, domain experts should able to interpret newly `mined' data with much greater clarity than was previously possible. Wegman and Solka point out that this only really becomes possible if a considerable amount of effort is expended in cleaning up the data first. Any analysis must therefore be considered as part of a general data flow. Fig. 1 (adapted from [206, fig. 1]) illustrates this flow as a cycle to emphasise that each step in the process should be seen as one of data refinement, rather than as an immediate endpoint and visualization is no exception

Figure 1: Data flow
\begin{figure}\par
\begin{center}
\par
\input{data_flow.pstex_t}
\par\par
\par
\end{center}\par
\end{figure}
. Performing a principal components analysis, for example, may not produce immediately usable results (since the axes are difficult to interpret) but could be used to maximise the linear separation of clusters (for visualization) or to reduce the set of variables under study (selection criteria).

1.3.3 Data preparation

The development of astronomical instruments and procedures for data gathering and storage is beyond the scope of this report. Nevertheless, it is useful to say something about the level of processing which can normally be expected. It is common, for example, for data to be calibrated depending upon the instrument used before the data is stored, but additional steps may be required to achieve the highest possible levels of precision. A number of software tools are able to assist here, such as the Aips++ library [32], and others provide facilities for immediate analysis, notably IDL [47], Aladin [4] and Xmatch [65,5]. Also, there are standardisation issues to do with the adoption of various naming schemes, units and so on, although it is hoped that the creation of the IVOA UCD and VOTable standards [6] may make this aspect easier to handle in future.

One of the most obvious features of Fig. 1 is that curation and preparation of the data forms the bulk of the work. In particular, care is needed to remove errors (noise, missing values etc) and to select appropriate variates for analysis. This may require domain expertise, since it may be necessary to:-

Where the data need to be kept, it would seem to be desirable to be able to store the upper and lower limits of a given process. Although such limits might be stored in the VOTable format, it seems that these boundaries must often be inferred from the data and hence a satisfactory solution to this problem remains elusive. Otherwise, for removing potentially conflicting data, scatterplots and other visual techniques (see 2) may be useful step in identifying outliers or clusters. It should be noted, however, that it is the view of the DS6 group that visualization tools may be able to assist in all stages of this cycle, not just in the initial assay or the final presentation. The cleaning and normalisation of data are discussed in Section 4.

1.3.4 Objectives

Given the number of different but relevant domains upon which this research might touch, the scope of the present report is restricted to:-


2 The Limits of Visual Thinking

2.1 Background

Visualization is a complex research area which has developed over the past few decades by building on theory from information design, computer graphics, human-computer interaction and cognitive science. In view of this conglomeration of ideas, the title (`limits of visual thinking') is intended to reflect that the development of computing algorithms must be matched by an understanding of the psychology of the user. In other words, it is not sufficient for researchers to demonstrate a given algorithmic complexity to ensure a successful visualization, since little is achieved if the display hinders rather than helps. However, rigorous user testing of visualization software is both difficult and rare [85]. In order to get some appreciation for the current limits of hardware and software, therefore, this section gives a brief survey of available tools but begins by revisiting some of the guidelines for good graphic design set out by Tufte (see also [42]) and Bertin [75].


2.2 Guidelines for visual displays

2.2.1 Above all else, show the data

Tufte has recently published a harsh critique of NASA engineers' usage of MicroSoft PowerPoint©  instead of producing formal written reports prior to the Columbia disaster in 2003 [192]. This is a new variation of an old complaint: in the past, many authors have often exempted themselves from working with the data at hand and allowed new formats to obscure the end conclusions. Inappropriate scaling and ordering of data are common symptoms (see 2.2.4). Ultimately though, this is an issue of integrity and authors should be encouraged to think through, prepare and attach their own names to their own work [196,195, ch. 2].


2.2.2 Juxtaposition: contrasts and parallels

Bertin observed that tabulating data is an essential part of constructing a well-defined problem, and may require a good deal of imagination. But a homogeneous table is unlikely to be effective in capturing the eye. The next stage of the process, therefore, involves graphic illustration, but this should leave open the question of the organisation of the table. Indeed it is unlikely that the first effort will be optimal. Exposing contrasts in the data involves some skill at the removal of unnecessary data-ink, juxtaposition and scaling [87]4. This may be automated to some extent, since it is partly based on patterns of sectors in the data or graphic, but still requires (re-)organisation in line with semantic reasoning [75].

Similarly, Tufte has argued that presenting similar graphics side-by-side can enhance the visualization by suggesting motion between the images, as if in animation. This motion indicates greater dimensionality and, hence, helps to reinforce the differences between each `frame' [195, ch. 6]. It is also one of the guiding principles behind the scattergram displays (see 2.4.2) and those of Andrew's Curves (see 2.4.5). Tufte has also proposed that `sparklines' can give intense word-sized contrasts, as shown in Fig. 2, which may be still more effective when used in multiples [193].

Figure 2: Using `sparklines' (from [193])
Image 0_home_richardh_reports_ds6_alg_figures_sparklines

2.2.3 Maximize the data-ink and the data density ratios

A common pitfall of new graphic designs (and 2D/3D computer displays) is to emphasise line-art and to overlook supporting orientation or scaling data. Tufte describes the effect as one of proliferating `chartjunk'. Often, the added weight of this detail tends to impose on the graphic and masks some of the meaning. Two quantities describe this problem:

\begin{displaymath}
\begin{array}{ccc}
Data  ink  ratio & = & (data  ink)/(to...
... entries  in  data  matrix)/(area  of  graphic)\end{array}\end{displaymath}

In [196], Tufte advocates the removal of all non-data ink and any redundant ink. The axes of a scatterplot, for example (see 2.4.2), might be trimmed to show only the range of the data or to show median and quartiles. In this way, the data-ink ratio is maximized and the possibility for gratuitous self-decoration and moiré effects is removed. At the same time, thinned or greyed-out lines and varied line thickness can be used to great effect, if, say, a background grid is really needed.

In computer graphics, it might be supposed that the equivalent to the data-ink ratio would be data pixels (pixels representing data elements) to non-background window pixels; the data density might be data pixels to window pixels. Although providing useful concepts, these quantities may not be entirely appropriate, however, for electronic displays (see below).


2.2.4 Emphasise causal and multivariate relationships in data

Whilst data may not always contain cause and effect type relationships, the organisation of data plots is critical if such relationships are to be exposed. An oft-quoted example is that of the problem of `O'-ring failures on the Challenger space shuttle in 1986. NASA engineers went to some lengths to describe the risk prior to the launch, by using plots of rocket-shaped symbols with various kinds of cross-hatching to indicate particular failures. The rocket symbols, however, contributed nothing to the readability of the graphic and since they were ordered by the testing sequence, the explanatory variable (temperature) was completely obscured.

Tufte's reordering the plot, so that the symbols appear at the correct location along the temperature axis, makes the decision to launch looks a good deal riskier [195, p.49]; by removing the symbols and using a weighted metric to describe an overall damage statistic, this relationship is very much clearer [195, p.45]. More generally in scientific testing, it is necessary to employ experimental controls to provide a baseline for comparison. This baseline then permits us to relate variations in cause with variations in effect. Without such controls, we must assume that there may be many possible causes [195, p. 50-52].


2.3 Initial assessment and standardisation

Although some graphical techniques are useful for standardisation, it is useful to consider some of the basic requirements for cleaning data before proceeding. In the first place, the question of missing data needs to be tackled. Many software packages simply ignore data units with missing entries and this seems acceptable where there is sufficient data to do so. Often, however, this may not be possible or may even be unwise if there is some underlying cause which demands further explanation [104,147] (also note the earlier comments about keeping data in Section 1.3.2).

Scatterplots (see below) are undeniably useful for initial assessment, and it would be unusual today if this were not one of the analyst's first steps. Apart from gaining insight into the distribution, the nature and influence of outliers needs to be gauged. One method is to plot regression lines over the scatterplot, which may be locally weighted if the distribution seems uneven (see Fig. 4 and Appendix A.1 for a description of the data). Finding and possibly removing the `boundary' or convex hull of the data or is one way of assessing the influence outliers, see Fig. 3.

Figure: Left: convex hull plots define bounding boxes to show potential outliers; Right: `chi-plots' (with boundary points removed) indicate departure from independence if most of the points lie outside the horizontal band (after [103]). Note red=`star' and green=`galaxy' candidates; see Fig. 4, which shows combined views of these data.
Image 1_home_richardh_reports_ds6_alg_figures_1-2-chi-plots

Image 2_home_richardh_reports_ds6_alg_figures_1-3-chi-plots

In these plots, outliers do not appear to be particularly distant and keeping or removing the boundary points does not affect the adjacent chi-plots. The basic principle of these is to assess the independence of variables using the expected $ \chi^{2}$-distribution, where dependency is indicated by deviation from the central band running horizontally across the centre of each plot. In this case, the lower plots show a greater level of structure (ie. less independence), since relatively few points lie in this zone [103]. Note that the presence of two supposed groups may be more rigorously tested by the use of canonical variate analysis (see 5.6.1). The detection of outliers is discussed with reference to self-organised maps in Section 4.6.

A common requirement before proceeding with many forms of analysis is to scale the data, so that the range or units of the data do not figure. Conversion to Z-score (autoscaling) is probably the most common technique, so that the distribution has unit variance, although division by the sample range has been shown to be more robust in many clustering applications [104]. Principal components analysis and some clustering techniques provide ready made methods of normalization which are discussed later. For the multivariate case, autoscaling can be achieved by employing the covariance matrix, $ S$, where the transformation is:

$\displaystyle Z=S^{-\frac{1}{2}}(X-\mu)$ (1)

given that $ X$ describes the distribution of $ p$ variates $ \mu$ contains $ p$ sample means. It is possible, though for large problems, not generally advisable, to obtain $ S$ by using the identities which eigenanalysis provides (see 4.2.2).


2.4 Plotting in 2D

2.4.1 Issues with reducing nD to 2D

Given the high dimensionality of most astronomical datasets, reduction to 2/3D for visualization purposes seems unavoidable. Also, the preference for many researchers is naturally for 2D graphics, so that (amongst other things) publication drawings can be considered. Various techniques for plotting in 2D are considered in this section.

Where the dimensionality gets very high, however, several problems arise with reduction to 2D [173,205]. If we consider the hypercube $ \left[-1,1\right]^{n}$, the $ 2^{n}$ diagonals, $ {\bf v}$, from the origin to the corners are of the form $ \left(\pm1,\pm1,\pm1,\cdots,\pm1\right)^{T}$. The angle between the diagonals and the Euclidean axes, $ {\bf e}$, is given by $ \theta$, where:

$\displaystyle \cos\theta=\frac{\left\langle {\bf v},{\bf e}\right\rangle }{\sqr...
...ht\rangle }}=\frac{\pm1}{\sqrt{n}}\quad\xrightarrow[n\rightarrow\infty]{}\quad0$ (2)

so that, $ \theta\rightarrow90{}^{\circ}$ and the diagonals are nearly orthogonal to the axes. Table 2 shows this angle for progressively higher dimensional multiples:

Table 2: Angle between Euclidean axes and diagonals
$ n$ 2 4 8 16 32 64 128 256 512 1024
$ \theta ({}^{\circ})$ 45 60 69.3 75.5 79.8 82.8 84.9 86.4 87.5 88.2


In 2D, such diagonals would tend to become invisible, so that a simple `shadow' projection down to lower dimensions would tend to lose data near these diagonals.

In higher dimensions, the volume of hyperspace also appears distorted, so that the distribution of data is also counter-intuitive. Since the area of a circle can be written as $ c_{2}r^{2}$ and the volume of sphere as $ c_{3}r^{3}$, we can generalise the volume of a hypersphere as $ c_{n}r^{n}$. It is easy to show that the volume of a thin hypershell relative to that of the hypersphere approaches 1 as $ n$ increases. In other words, most of the volume occurs within the shell. The result is that if the data were uniformly distributed within the hypersphere, most data would, in fact, lie within the shell. However, by projecting the data onto a 2D plane, most of the data will appear to lie near the origin of a circle, thus giving an unrealistic impression [205].


2.4.2 Scatterplots

A typical scattergram (ie. a collection of scatterplots) is shown in Fig. 4, showing some of the SuperCOSMOS data set as depicted in R, with colour used as an additional `dimension' to reflect different classes in the data (here red implies `star', green a `galaxy'). Due to the linear nature of the plotting algorithm, the number of points which can be plotted is almost unbounded (but see below). There is a good deal of redundancy in the display (a poor data-ink ratio) since each smaller graphic is plotted twice, although statisticians seem to prefer this, so there is no particular bias to `x' or `y' axes. Some applications can make use of the plot windows on the diagonal eg. to plot histograms, as is the case here. The blue curvilinear feature in the scatterplots is a locally weighted regression (LOWESS) curve, after Cleveland [164]5.

Figure 4: SuperCOSMOS data scatterplot
Image 3_home_richardh_reports_ds6_alg_figures_col-pairs

Nevertheless, a practical limit of 10-15 variates exists for most displays. The use of `coplots' in R or dimensional stacking in XmdvTool [211] can make some relationships more explicit for, say, up to 5 variables.

Scott has observed that where the number of observations is large, the areas of greatest density lose visual impact and the eye is drawn to outlying data in the tails of the distribution [173]. One response has been the development of sunflower plots, which make use of hexagonal bins to form a 2D histogram [101], Fig. 5. Individual observations are plotted when there is only $ 1$ observation per bin as in a conventional scatter plot. Each bin with from $ 1-p$ observations contains a light sunflower. Other bins contain a dark sunflower. In a light sunflower each petal represents one observation. In a dark sunflower, each petal represents $ k$ observations. (A dark sunflower with $ p$ petals represents between $ pk-k/2$ and $ pk+k/2$ observations.)

Figure 5: Sunflower scatterplot (from [101])
Image 4_home_richardh_reports_ds6_alg_figures_sunflower


2.4.3 Parallel coordinates

Described by Inselberg over two decades ago [133], many tools are now capable of rendering parallel coordinates. The basic premise is very simple: instead of plotting data by using orthogonal axes, the axes are maintained parallel and the data plotted as lines across the axes. In this space, therefore, a point is mapped to a continuous line (and vice versa). This has the effect that the amount of data-ink increases relative to the background, but may obscure the display. Also, since it is common for the axes to be automatically scaled to the maximum and minimum for each variable, it is usually difficult to assess directly the correlation between any pair (cf. [205]).

Examples using the XmdvTool display are shown in Fig. 6, using a subset of 20 variables from the SuperCOSMOS dataset (1000 rows) [7].

Figure 6: XmdvTool: (a) parallel coordinate display, with brushing (blue-shaded region) to emphasise one possible pattern; (b) with hierarchical clustering (see text) [212]
(a) Image 5_home_richardh_reports_ds6_alg_figures_supercos_brush



(b) Image 6_home_richardh_reports_ds6_alg_figures_supercos_hierarc

The upper plot (a) emphasises the brushing technique, in which the `brushed' area is used to highlight a perceived group in the data, which might be used as an input for spatial restriction (note that Mirage employs this tactic; see 3.3.1). In the lower plot, (b), a simplified representation of the data was created by the hierarchical clustering feature of XmdvTool. Note that the clustering is effective in reducing the amount of clutter in the display but these seem to be rather too arbitrarily defined. As another illustration of this artefact, the well-known `iris' dataset displays 4 clusters, where there are only 3 species present and where, in fact, only two groups are linearly separable [48] (see Appendix A.2 and [173, Pl.1]). This exposes one of the difficulties with the hierarchical clustering technique, ie. in obtaining valid clusters, see 3.3.3.

A further problem is apparent when the number of variables increases (beyond, say, 15-20) in that it is difficult to separate the axes or associate axes with the respective variables. In fact, as the distance between axes decreases, the display becomes saturated with `data-ink' and any visible pattern is lost. An interesting response to the saturation problem was proposed by Lederman and implemented in Parvis [41]. Here, the author has created a fading brush technique and is also able to highlight single strands of data to assist the user. As a more innovative use of the display, histograms can also be plotted against the axes to indicate the basic data distribution.

2.4.4 Glyphs

A classic example of the use of glyphs was that of Chernoff faces, Fig. 7.

Figure 7: Chernoff faces (from [8])
Image 7_home_richardh_reports_ds6_alg_figures_chernoff

The assumption behind this approach is that the human eye is particularly sensitive to facial expressions, so that by mapping variables to particular features (eyebrow height, mouth curvature etc), we may be able to distinguish characteristic elements of the data more easily. Many other types of glyph, including stars, weathervanes and profiles, have been proposed. Ward has observed that the success of these is highly dependent upon colour and glyph placement [203], since they may be discrete or overplotted as required.

Tufte has observed that despite the quixotic character of Chernoff faces, `ink' may saved by halving the faces, since they are usually symmetrical [195]. One problem which frequently occurs, however, is that of representing the key in glyph plots, eg. see Fig. 8. The use of text in the key creates a fairly large mental load and has the effect of distancing the user, even for this small data set.

Figure: XmdvTool: Subset of SuperCOSMOS data (see A.1) set as glyphs -- note key. Each radial spoke represents a separate variable in this case, with the length of spoke reflecting the magnitude of the variable. The colours of the glyphs reflect a hierarchical clustering
Image 8_home_richardh_reports_ds6_alg_figures_xmdv_glyph

Another criticism is that glyphs are asymmetrical ie. some glyphs have greater visual impact, and it is recommended that all combinations are tried [173].


2.4.5 Andrew's Curves

An interesting implementation of the `juxtaposition' principle (2.2.2) is that of Andrew's curves [70], in which each observation is represented by a combination of the (scaled) variates, $ x_{i}$, into a sine wave of the form :

$\displaystyle f_{x}\left(t\right)=x_{1}\sin\left(n_{1}t\right)+x_{2}\cos\left(n_{1}t\right)+x_{3}\sin\left(n_{1}t\right)+\hdots$ (3)

This allows each row of data to be plotted as a single unique wave-form, typically in the same display for comparison (see [9] for a graphic demonstration). A number of extensions to this technique have recently been advanced by García-Osirio to visualize self-organizing maps using various over-plotting methods [113], for which a more illustrative movie demonstration also exists at [112].


2.4.6 Linked displays and tours

Several tools provide a greatly enhanced level of interactivity by combining displays and allowing highlights to be broadcast from one to the other, eg. Mirage [52]. One of the first tools to make use of so-called `brushing' of data between displays was Arc (itself based on XlispStat) which - even to this day - remains one of the few tools capable of generating regression planes to describe data [89]. XGobi [185] and XGVis [81] also adopted the brushing technique, adding various methods to display data in parallel coordinate, 3D rotation and making use of tours (animations) particularly with in conjunction with dimensional scaling (see 4.6 below, CViz [120] and HiSee [45]). Another valuable property of these systems is that data can can be excluded to examine the effect of outliers.

Several commercial systems have now been developed in this idiom, such as Spotfire [57]. One system which deserves further mention, however, is Hierarchical Clustering Explorer (HCE) [44], see Fig. 9.

Figure 9: Hierarchical Clustering Explorer
Image 9_home_richardh_reports_ds6_alg_figures_hce-2

A unique feature of this tool is that 1D and 2D histograms are used to suggest interesting distributions (and hence variates) to the user (but note 3.2.2). However, the use of a clustering algorithm as data are input forbids the processing of `medium' or larger data tables (see Table 1). This issue also affects XmdvTool when using the InterRing [213] or clustering data selection tools.


2.5 Tree-based plots

In cladistics6, trees are common tools for representing the distance between species graphically. To represent such trees within limited computer displays, however, various forms of non-linear presentation have been adopted using `backbone' spanning trees and focus/context motifs, such as the i-Disc [127], or hyperbolic forms, such as HypViewer, see Fig. 10, which shows the current UCD1+ hierarchy [160].

Figure 10: Hypviewer
Image 10_home_richardh_reports_ds6_alg_figures_ucd

This system was capable of displaying some $ 10^{5}$ edges; in later schemes, using `focus + context' techniques, the authors claim to support over $ 10^{7}$ nodes (see TreeJuxtaposer [73,161,10]). The hyperbolic model of Hypviewer has been adapted into various later systems, notably Walrus [131], Inxight [11], Prefuse [125], Kartoo [12] and Touchgraph [13].


2.6 3D tools

Graph-based techniques have also been adopted within the visual programming environments of IRIS Explorer [1] and OpenDX [14] (see also ViSta [214]). Here, customised applications exist as modules of code which can be connected together graphically, so that data appears to flow around the system. For astronomy, VisIVO, which also employs a modular library approach (using VTK [60,172]), is currently being adapted for virtual observatory use [72]. A natural tendency of these systems has been towards the display of 3D, spatial data, making use of hardware rendering.

These tools have been particularly successful at visualizing scalar data and flow problems (see eg. [143] for more recent developments). For more general multidimensional data, there are several challenges. Figure 11 shows an example from VTK using `splatting' with the quasar data [59]. This technique aggregates the data by overlapping areas or volumes associated with each data point (here the u, g and r variables are described by the three axes; and i by the attributed colour scalar.

Figure 11: VTK `splat' example (see text)
Image 11_home_richardh_reports_ds6_alg_figures_vtk-ugri-splat

Higher dimensional (say 5D or 6D) data sets can also be rendered if glyphs, animation or contours are also be employed, although issues of occlusion and human perception (which varies particularly for colour) begin to take precedence [203]. However, the technique of hierarchical `splatting' (see Fig. 12) may offer faster rendering of very large tables of unstructured data [129].
Figure 12: Hierarchical splatting of n-body simulation (from [128])
Image 12_home_richardh_reports_ds6_alg_figures_splatting

Working in 3D (and higher) therefore requires that a number of issues must be faced:-

With respect to the above navigational issues, it is useful to describe research at Leeds undertaken by dos Santos [98]. An interaction graph was proposed to allow users to select variables for 1D, 2D or 3D plotting with additional, movable nodes along the spokes of the graph to select the range of the data and to locate the viewing position. For a 57-dimensional sample of the SuperCOSMOS data, however, it was necessary to expand the graph to a radial spiral shape, Fig. 13.
Figure 13: Interaction graph for high dimensional data
Image 13_home_richardh_reports_ds6_alg_figures_selan-57

Unfortunately (as with glyphs above), this scheme begins to distance the user from the data since it is no longer easy to display labels or keys, nor use the central spoke nodes to alter the range of the data.


2.7 Comment

It has already been observed that only the least complex (eg.  $ O\left(n\right)$) schemes are able to cope with `large' data sets as proposed. For this reason, the simpler schemes here are described more thoroughly here than might have been expected. Although this list is not exhaustive, numerous web-pages exist for comparisons elsewhere [39,16,17,18,199]. In addition, it is convenient here to make a number of further observations:


3 Dimensional Reduction


3.1 The curse of dimensionality

The `curse of dimensionality' (following Bellman (1961) [20]) refers to the exponential growth of hypervolume as a function of dimensionality. In most cases, this issue is manifested as one of data distribution, which is usually very uneven within the hyperspace. Hence data may be abundant in some regions, but inadequate in others, especially if statistical inferences need to be drawn.

For astronomy, however, it is recognized that some variables are of greater interest than others. In particular, error terms, although important, are generally of lesser interest than their respective observational parameters. Hence there may be a hierarchy of data terms, which may be different for each researcher. The proposed UCD1+ terminology also suggests a data hierarchy [95,49] which might be exploited. It was proposed to examine this issue further by investigating current tree-building tools, and developing, if necessary, some new method of configurconfiguringating the display of data (see Section 6.4). The remainder of this section considers methods of table reduction, by column and then row.


3.2 Column reduction

The main column (ie. variate) reduction methods are summarised in Fig. 14.

Figure 14: Column reduction techniques
Image 14_home_richardh_reports_ds6_alg_columns

A brief description of the elements in this figure is given below.


3.2.1 Manual Section


3.2.2 Automated selection


3.2.3 Cardinality


3.2.4 Factor Analysis


3.3 Row reduction

These techniques aim to give focus to particular units of data (often called `individuals') and, in this sense, provide a narrowing of the extent of the data. For example, the focus might be used as an input to select rows from a table. The techniques are summarised in Fig. 15 and below.

Figure 15: Row reduction techniques
Image 15_home_richardh_reports_ds6_alg_rows


3.3.1 Spatial restriction


3.3.2 Sampling and quantization


3.3.3 Clustering


3.3.4 Classification and discriminant analysis


4 Column Reduction Techniques

4.1 Introduction

4.1.1 Background

The techniques of factor analysis, principal components etc have developed over the course of the 20th Century, fuelled by research in fields as diverse as psychology and ballistics. One of the original aims was to extract the line or plane of maximal variance. Such a line, plane or hyperplane -- known as the principal component -- is actually one of regression about which the orthogonal error is minimised, see Fig. 16.

Figure 16: Principal components of a 2D data set (first: long axis, red; second: short axis, green)[54]
Image 16_home_richardh_reports_ds6_alg_figures_pca_pcs

The advantage of this kind of error minimisation is that it allows any one variable to `predict' another (within certain tolerances) and these are then described as predictor and response variables. Also, if this axis is adopted as the main component of interest (typically, by transforming the data to this axis), then the remaining variance (orthogonal to this plane) appears to be procedural in origin. Moreover, the variables under this transformation are now said to be uncorrelated ie. it is no longer true that one variable may be used to predict the other. Only after this step can the variables be considered to be independent and hence this term is avoided in multivariate analysis.

4.1.2 Methods and models

Factor analysis (FA) has often been confused with principal component's analysis (PCA) as the techniques are similar, although the underlying mathematical model is rather different [84]. Given the importance of these techniques to the present discussion, some background theory to FA and PCA is given below.

4.2 Covariance and correlation


4.2.1 Extracting factors

If we have $ p$ factors (usually columns of data, each representing a separate variable), we may denote the factors as $ F_{1},F_{2},\cdots,F_{p}$. Generally, we may suppose that these factors can be linearly combined for each of $ n$ observations, to give an overall index, $ X_{i}$ ( $ i=1,2,\ldots,n$), which can be written:

\begin{displaymath}\begin{array}{ccccccccccc} X_{1} & = & a_{11}F_{1} & + & a_{1...
...{n2}F_{2} & + & \cdots & + & a_{np}F_{p} & + & e_{n}\end{array}\end{displaymath} (4)

If $ p$ is large, however, it is advantageous to select the factors (and coefficients) of greatest interest -- usually those factors which display some interesting distribution or the greatest variance. If some factors are to be omitted because their contribution is thought to be negligible, then the error term, $ e_{i}$, in each of the expressions of Equation 4 increases. That is to say, this term acknowledges that some error has been introduced.

In Equation 4, the coefficients $ a_{ij}$ represent a matrix, denoted $ A$. But the usual starting point for this matrix is the sample covariance matrix, $ S$, which is defined by the relation:

$\displaystyle s_{ij}=\frac{n\Sigma x_{ik}x_{jk}-\Sigma x_{ik}\Sigma x_{jk}}{n\left(n-1\right)}$ (5)

where $ s_{ij}$ are elements of $ S$ ( $ i,j=1,2,\ldots,p$) , $ n$ is the number of observations and $ k$ is used to sum the values in columns $ x_{i}$ and $ x_{j}$ from 1 to $ n$. In this case, $ S$ will be $ p\times p$ symmetric and the diagonal values ($ s_{i=j}$) will be the usual column variances. It is convenient here to point out the similarity between $ S$ and the correlation matrix, $ R,$where:

$\displaystyle r_{ij}=\frac{s_{ij}}{s_{ii}s_{jj}}$ (6)

It turns out that the factors, $ a_{ij}$, may then be drawn from the characteristic forms of the $ s_{ij}$ or $ r_{ij}$ matrices, which arise from eigenanalysis [103]. In fact, if $ S$ is nonsingular (ie. the determinant of $ S$, ie.  $ \left\vert S\right\vert$ is non-zero), it can be shown that:

$\displaystyle U^{\prime}SU=L$ (7)

where $ U$ is orthonormal, so that $ U^{\prime}U$ is identity ($ I$). Since $ S$ is relatively easy to compute, its decomposition to $ ULU^{\prime}$ is more natural. Under these conditions, $ L$ is a diagonal matrix, the elements of which comprise the latent roots or eigenvalues (literally `characteristic values') of the system. The columns of $ U$ then give the eigenvectors associated with each root and thereby define the principal components. These roots may also be obtained by supposing that for some constant $ \lambda$ and vector $ {\bf u}$, there is a non-trivial solution to the problem:

$\displaystyle S{\bf u}=\lambda{\bf u}$

so that:

$\displaystyle \left(S-\lambda I\right){\bf u}=0$ (8)

Hence, ignoring trivial cases, there will be solutions to this system for characteristic vectors $ {\bf u}$ where $ \left\vert S-\lambda I\right\vert=0$ (see [40]). Each solution for $ {\bf u}$ then has an associated eigenvalue $ \lambda$, which may be assembled (respectively) into the $ U$ and $ L$ matrices as described above.

We may visualize this transformation as a rotation of the axes along the axes of maximal variation and, if normalised, the columns of $ U$ give the direction cosines between the old and new axes. Determinants may be used to obtain such roots for small problems, but for large systems, iterative `power' methods are more appropriate. The Lapack library provides a well-tested and freely available set of such routines, using eg. single value decomposition [58,68].


4.2.2 Principal component transforms

It is useful to consider several further extensions to Equation 7 so that we may elucidate the process of PCA and data transformation (including `sphering'; see Equation 1). Jackson [134], for example, employs the identities:

\begin{displaymath}\begin{array}{ccc} V & = & UL^{1/2}\ W & = & UL^{-1/2}\end{array}\end{displaymath} (9)

to give the results shown in Table 3.


Table 3: U,V,W vector products of PCA (after Jackson [134])
  PCA transform Squared terms Notes
U-vector $ z_{i}=U^{\prime}\left[x_{i}-\overline{x}\right]$ \begin{displaymath}\begin{array}{ccc}
UU^{\prime} & = & I\end{array}\end{displaymath} PCs have variances equal to corresponding root
V-vectors $ \sqrt{l_{i}}z_{i}=V^{\prime}\left[x_{i}-\overline{x}\right]$ \begin{displaymath}\begin{array}{cc}
VV^{\prime}= & S\end{array}\end{displaymath} PCs have variances equal to square of corresponding root
W-vectors $ \frac{z_{i}}{\sqrt{l_{i}}}=W^{\prime}\left[x_{i}-\overline{x}\right]$ $ WW^{\prime}=S^{-1}$ PCs have variances equal to unity


The object of PCA is the linear transformation of a group of correlated variables, such that some set of optimal conditions is achieved. The most important condition is that the new variables are uncorrelated. Mathematically, this is ensured by the requirement that $ u_{i}^{\prime}u_{j}=\delta_{ij}$ (using Kronecker's delta function: $ \delta_{ij}=1$ if $ i=j$, or 0 otherwise). Even so, these identities suggest many possible transforms. From the above, $ W$ might be written as $ S^{-1/2}$ (the inverse square root of covariance), so that the data sphering problem of Equation 1 can be solved in this way (although for large problems, the Lapack library offers more viable inversion routines). Hence, we might apply the $ W$-transform:

$\displaystyle Z=W^{\prime}\left(X-\mu\right)$ (10)

to obtain variables centred at $ \mu$ with unit variance. On the other hand, the comparable $ V$-transform scales the variables to the same units as the original data, which may be convenient for some problems (typically in engineering) where the units happen to be the same for all the variables. Finally, the $ U$-transform scales the resultant vectors to unity and hence the coefficients are always in the range $ \pm1$, which may be useful for more general diagnostic purposes (eg. in the analysis of shapelet functions; see [165]). Note that, in this latter case, the variance of each component is then equal to the corresponding root.

4.2.3 Using correlation

The choice of the forms of scaling presented above will be influenced by the type of application and the nature of inferential statistics required. In practice, however, it is often more appropriate to work with the correlation matrix, $ R$, rather than that of covariance ($ S$), although this may be less powerful statistically. The reason for this is that PCA is sensitive to additional variance which may be introduced by:

If uncorrected, these sources of variability will easily dominate the first few component terms. It should be borne in mind though, that the variables which result from using correlation acquire similar importance. Also, there is no obvious correspondence between components derived from correlation and covariance matrices. It is possible, however, to characterise the relationship between original variable, $ i$, and the component, $ j$, using the following:

where $ l_{j}$ is the $ j$th root of the diagonal matrix $ L$ and $ u_{ij}$ is the $ i,j$th element of the matrix of eigenvectors, $ U$ [103].

4.3 Using PCA


4.3.1 Assessing the number of components

Amongst the many properties of latent roots, we may cite two of special interest:

  1. $\displaystyle \left\vert S\right\vert=\left\vert L\right\vert=l_{11}l_{22}...l_{pp}$

  2. $\displaystyle trace\left(S\right)=trace\left(L\right)=l_{11}+l_{22}+\ldots+l_{pp}$

ie. two important properties of the covariance matrix $ S$, the determinant $ \left\vert S\right\vert$ (sometimes known as the generalised variance) and trace (defined as the sum of the diagonal terms), can be represented respectively by the product or sum of the latent root terms. This is especially useful since it implies that the largest root indicates the most important component. These properties also provide a mechanism to assess the importance of a particular component quantitatively. Several methods exist to perform this kind of analysis [134, Sec. 2.8, p.41]. The simplest is simply to take the ratio, so that $ P_{j}$ defines the proportion of variation accounted for by the $ j$th term as follows:

$\displaystyle P_{j}=\frac{l_{j}}{trace(S)}$

Another useful technique -- the `scree' plot -- depicts each eigenvalue as a sequence from largest to smallest. Those components which then lie above the `scree' at the base of the slope are thought to be the most useful candidates for further analysis, see Fig. 17.

Figure 17: Example scree-plot from R [105]
Image 17_home_richardh_reports_ds6_alg_Rplot001

An alternative stopping point is the Kaiser criterion [26], according to which we only retain factors with eigenvalues greater than 1. In essence this is like saying that, unless a factor extracts at least as much information as the equivalent of one original variable, we drop it. It is noted that a more sophisticated test, developed by Bartlett and extended by Anderson, also exists to provide a significance test for the equality of smaller roots. Having chosen to adopt a given number of components, it is also possible to assess the error from the `lost' components by inverting the system represented by Equation 4 and using residual analysis [134, p.34,87].


4.3.2 Understanding PCA

Although used widely for decades, many authors have given warnings about the wisdom of interpreting PCA results. Jackson has observed that the physical importance of a component is not necessarily related to the size of the characteristic root; indeed, the equivalent variation in a lesser component might have a greater effect on the human eye or ear and hence scaling might be a more important factor in visualization. The use of inferential statistics such as Hotelling $ T^{2}$-test may help, although this would probably not qualify as an exploratory technique [134, p.55].

Everitt, on the other hand, favours the `labelling' of PCs, albeit with caution, and advocates the re-scaling of variables (eg. by negating) if this makes the process easier [103]. Whether or not labels can be attached, however, a useful graphical technique is to take the first two or three components and perform pairs plots (or a scatterplot; see 2.4.2) to identify clusters and outliers. To facilitate this, PCA `scores' can be obtained by taking the element-wise product of the $ m$ chosen components with $ m$ data terms ie.  $ u_{ij}^{\prime}x_{ij}$ for $ j=1,\cdots,m \forall i$.

Fig. 18 shows the use of PCA scores in two different problems. In the upper plot (a), the iris data are separated largely by the first principal component and although the groups are not distinct, there is relatively little overlap (cf Section 2.4.3). In the lower plot (b), three components are distinguished by classification into two groups, ie. star or galaxy for the SuperCOSMOS data (see A.1). This difference is highlighted by quadratic regression surfaces through these scores (cf Fig. 4).

Figure 18: Using PCA scores to plot the data. Top (a): a maximal linear separation of the iris data using first and second components (colours indicate the separately identified species); bottom (b): regression planes of PCA scores of the SuperCOSMOS data (generated by R Commander)
(a) Image 18_home_richardh_reports_ds6_alg_figures_r-pca-iris



(b) Image 19_home_richardh_reports_ds6_alg_figures_r-pca

These surfaces show some degree of separation along the first component axis, but their convoluted shape makes this representation seem less than satisfactory, even when viewed with rotation in 3D. A more useful technique, suggested by Everitt [103], may be to omit a particular variable of interest from the PCA and then to plot/regress scores from the first few PCs against the raw data for the selected variable. The effect of outliers in these plots should again signal a note of caution, and it may be advisable to choose more robust estimators for covariance in these cases.


4.3.3 Re-interpretation by correlation

A surprising feature of PCA is that, although the vector components are uncorrelated, the individual coefficients are not. On reflection, this is less surprising within each component since the terms are usually constrained by their sum (see Table 3) but it turns out that the coefficients of adjacent components are also often strongly correlated. This would seem to indicate that correlation tables and perhaps (graphically) parallel coordinates may be able to help, especially if components with small roots might be important, since, otherwise, these would likely be rejected.

4.3.4 Rotation of PCs

A more widespread technique of re-interpretation is to attempt to rotate the PCs until they are aligned with some notionally simpler system, usually so that the coefficients are as close to zero or one as possible. If the new system has $ k$ components assembled as $ B=\left[b_{1}\mid b_{2}\mid\hdots\mid b_{k}\right]$, then the requirements may be stated as:

  1. Each row of $ B$ should contain at least one zero, so that each of the original variables is uncorrelated with at least one of the rotated components.
  2. Each column of $ B$ should contain at least $ k$ zeros, the more the better.
  3. Each pair of columns should be distinguished by the presence of several ones (or zeros) at particular locations. If $ k\ge4$, the aim should be to increase the number of zeros, so that the matrix remains fairly sparse.
These methods, in essence, provide a simple clustering of variables, which (it is hoped) are independent. But the rotated variables are no longer uncorrelated. The forms of rotation can be summarised as follows:

Many rotation techniques had their origin with factor analysis and are discussed further below, but it is fair to say that they have had a chequered history, with many objections to their ad hoc application. They have been most useful in psychological and market research and may be more valuable with very high dimensional data.


4.3.5 Removal of variables

Instead of rotating the system to maximise (or minimise) the influence of variables of interest, Mardia et al [150] have suggested that PCA may be used to remove variables directly. First, PCA is undertaken and a method selected to find the most appropriate number of components, $ k$, such as by scree-plot. Then one begins a process of deselection by examining the least significant PC and rejecting the variable associated with the largest coefficient. The variable associated with the largest coefficient of the second least significant PC is then rejected and so on, until only $ k$ variables remain.

4.3.6 Non-linear PCA

It is noted here that PCA has also been developed to provide an analysis of principal curves, particularly useful where it is suspected that the primary relationships between variables are non-linear. This approach is available in S-PLUS and R pcurve library, based on work by Hastie and De'ath (see [35]), and has also been applied to astronomy data via neural networks (see 5.8.4).


4.4 Factor analysis

4.4.1 Difficulties with PCA, FA and rotation

One of the main objections to PCA is the absence an underlying statistical model. It is difficult, therefore, to interpret the meaning of the resulting components with any certainty or to give an estimate of the effect of sampling error. Whilst FA overcomes some of these issues, it introduces new problems. To start with, it must be assumed that the underlying factors actually exist, and that, where there is more than one factor, that the factors are independent of each other and any error terms (see Equation 4) [134].

For many authors, another disquieting feature was the willingness of statisticians to employ rotation techniques to find more readily amenable factors. Since these rotations often appeared to be equally plausible, this suggested a considerable degree of subjectivity. The result, particularly for FA, was a considerable amount of adverse publicity. In 1989, Chatfield and Collins observed that few cases existed where the technique had proven to be useful outside the field of psychology [84]. But in recent years, a significant step-forward has been the development and inclusion of maximum-likelihood procedures in software packages, such as SPSS [61] and R, and applications of FA in science and engineering are much more common (see eg. [119,140]).

4.4.2 Background to factor analysis

In PCA, the fact that the sum of the latent roots and the sum of the original variance is the same (4.3.1) is often interpreted as supposing that PCA accounts for all the variation in the data. FA, on the other hand, is said to explain correlations. In the early 20th Century, Spearman observed that the ratio of off-diagonal covariance was often similar between rows, suggesting the presence of underlying `factors' [147].

The general FA model is stated in Equation 4. Ideally, the variable $ X_{i}$ in this model is the $ i$th test score with mean zero and unit variance; the coefficients $ a_{ij}$ are the factor loadings for the $ i$th test; and $ F_{i}$ ( $ i=1,\hdots,k$) are uncorrelated common factors, also with mean zero and unit variance. It is assumed that the variables($ X_{i}$) are correlated with the factors, but are otherwise uncorrelated with each other. With this model:


$\displaystyle Var\left(X_{i}\right)$ $\displaystyle =$ $\displaystyle a_{i1}^{2}Var\left(F_{1}\right)+a_{i2}^{2}Var\left(F_{2}\right)+\hdots+a_{ik}^{2}Var\left(F_{k}\right)+Var\left(e_{i}\right)$  
  $\displaystyle =$ $\displaystyle a_{i1}^{2}+a_{i2}^{2}+\hdots+a_{im}^{2}+Var\left(e_{i}\right)$  
  $\displaystyle =$ $\displaystyle 1$ (11)

The $ a_{i1}^{2}+a_{i2}^{2}+\hdots+a_{im}^{2}$ term is called the communality of $ X_{i}$: the part of the variance that is related to common factors. It's complement, $ Var\left(e_{i}\right)$, is called the specificity of $ X_{i}$ and is unrelated to the common factors. For any two variables, it can also be shown that the correlation:

$\displaystyle r_{ij}=a_{i1}a_{j1}+a_{i2}a_{j2}+\hdots+a_{ik}a_{jk}$ (12)

and hence the two scores can only be highly correlated if they have high loadings on the same factors.

4.4.3 Analysis

If the provisional loadings for the analysis are generated by PCA, then the resulting test is one of principal factors. The procedure for performing the analysis has 3 steps:

The choice of factors and rotations has led many in the past to argue that FA too closely resembles witchcraft. A more respectable approach appears to be that of maximum likelihood analysis, which attempts to define a `distance measure' between the observed covariance matrix and that generated by the FA model. The technique then tries to maximise the likelihood function for the $ k$-factor model (by minimising the distance) under the assumption of multivariate normality. The main advantage, however, is that this approach has an associated formal testing procedure for the number of factors, although it is still necessary to iterate, usually by starting with $ k=1$ and increasing until convergence [103, p.70].

Modern software tools permit more rigorous techniques, such as confirmatory analysis, as well as the more exploratory methods discussed here [147]. In general, however, it should be noted that that FA is likely to be more successful than PCA if there is reason to believe that there is some underlying structure to the data.


4.4.4 Rotation of factors

Everitt notes that despite the large number of published algorithms available to do rotations, only a few are in general use. These are summarised in Table 4.


Table 4: Factor rotation methods
Method Summary Notes
Varimax

(Orthogonal)

Uses an iterative maximisation of the variance of the loadings; see [134,150] Factors have high correlations with a subset of variables and little or none with others. Some `general' factors may disappear
Quartimax (Orthogonal) Uses modification of above; see [82] Forces a given variable to correlate highly with one factor
Oblimin (Oblique) Tries to find a parameter which controls the degree of correlation between factors; see [135] Values between $ -0.5$ and $ 0.5$ seem to work well for most applications
Promax (Oblique) Operates by raising the loadings of an orthogonal solution to some power; see [126] Aim is to find a balance between the lowest power and the smallest correlation between factors




Despite the indeterminacy of particular rotations, it is clear that the distribution of points remains invariant. FA simply seeks to describe this distribution as `simply' as possible. In some ways, a more intractable problem is that, unlike PCA, it is difficult to calculate scores using these factors since the relationship essentially has to be inverted. However, some attempt can be made, if it is assumed that the distribution is normal [103, p.76].

As a simple example, Table 5 shows loadings for factor analysis (with promax rotation) and PCA (unrotated) of sunspot data contributed by S. Dalla of AstroGrid (see Appendix A.6; in press). The first factor has tended to combine variates associated with the magnitude and size of sunspots; the second, the number of spots and longitude; the third is bound to the variable `mindistar', the great circle angle associated with he nearest adjacent sunspot; and the fourth relates mainly to time along the sun's 11 year cycle (denoted `dt'). It is straightforward to obtain plots of scores on this factor data to identify outliers, anomalies or other patterns using eg. parallel coordinates.

Table 5: Factor analysis and PCA of sunspot data

  FA PCA
Variable fa1 fa2 fa3 fa4 pc1 pc2 pc3 pc4
dt - - - 0.459 -0.222 -0.172 0.682 -
latitude - - - -0.23 - - -0.339 -0.132
longitude - 0.446 - - -0.196 0.72 - 0.107
long_carr - - - - - - - 0.98
area 0.743 -0.161 - - -0.428 -0.499 0.14 -
long_extent 0.763 0.237 - - -0.584 - -0.264 -
n_spots 0.878 0.9 - - -0.581 0.119 -0.288 -
mindistar - - 0.95 -0.117 0.223 -0.407 -0.489 -
n_flares 0.501 - - - na na na na


Dashes indicate that the factor loading is effectively zero for this variable. In contrast, the principal components analysis creates a spread of loadings, which is hard, if not impossible, to disentangle (note also that number of flares could not be included in this particular analysis).

4.4.5 Comparison of PCA and FA

For data which shows a strong degree of correlation, both FA and PCA should give similar results and the converse is also true: poorly correlated data will give components similar to the original variables, whilst FA will generate meaningless results. Table 6 gives a comparison of the two techniques.


Table 6: Comparison of PCA and FA
Factor Analysis Principal Components
Tries to explain correlations or covariances by means of a few common factors Tries to explain the variance of the observed variables
Increasing the number of factors included in the study has a profound effect on the factors as a whole Increasing the number of components to $ m+1$ has no effect on the first $ m$ components
Computing factor scores is complex, and a number of assumptions have to be made (usually normality) Computing factor scores is straightforward
Analysing the correlation or covariance matrix makes little difference, especially if using maximum likelihood techniques The PCs obtained through using the correlation or covariance matrix are substantially different
Factors are uncorrelated only within the factor space PCs are uncorrelated without condition
Solutions robust with respect to changes in scale Solutions will differ with scale
Computationally heavy, especially with more complex models Computations relatively straightforward


In general, the application of PCA or FA depends very much on the problem at hand. If the data contain a large number of relatively anonymous or indistinct variables, such as shapelet coefficients [152,165], then PCA is the more obvious choice. This is also true if it is suspected that many variables are contributing very little variance to the results, in which case, a `weeding-out' of such variables (eg. by association with the smallest eigenvalues) is recommended.

On the other hand, if there are fewer, more distinct variables (as with the sunspot data), then PCA is unlikely to be helpful. The result will probably be a mixing of these variables which will be highly counter-intuitive. In this case, FA with rotation (preferably using maximum likelihood procedures) may give a more satisfactory grouping of variables, which will be all the more plausible if this agrees with the hypothesis under test or some notion of the underlying behaviour of the system. By definition, PCA is entirely arbitrary -- no underlying model can be supported, and it is not possible to attach significance to the results.


4.5 Canonical correlation analysis

This form of analysis dates back to the work of Hotelling in the 1930s, in which he was interested whether reading ability was related to mathematical ability in schoolchildren. Realising that the observed variables: reading power ($ X_{1}$) and reading speed ($ X_{2}$), and mathematical power ($ Y_{1}$) and mathematical speed ($ Y_{2}$), naturally fell into two groups, he proposed that:

\begin{displaymath}
\begin{array}{ccc}
Reading  score & = & a_{1}X_{1}+a_{2}X_{2}\\
Mathematics  score & = & b_{1}Y_{1}+b_{2}Y_{2}\end{array}\end{displaymath}

and then sought to make the correlation between the two scores as large as possible. This idea is somewhat similar to that behind PCA, except that here the correlation is maximised in instead of variance. The largest correlation was found between pupils where a large difference between scores for reading also indicated a large difference between scores for mathematics. This produced coefficients of similar but opposite sign, such that both scores were apparently measuring the difference between power and speed. In other words, it was this aspect of the data which showed most correlation. Further details are given in [147]. Of potentially greater benefit for astronomers, however, is the technique of canonical variate analysis (see 5.6).


4.6 Multidimensional scaling

4.6.1 Background

If a table of distances (or proximity matrix), $ d_{ij}$, can be defined between objects (or observations), it is usually possible to employ multidimensional scaling (MDS) to give a lower dimensional representation. This immediately suggests that such an algorithm is $ O\left(n^{2}\right)$ since $ n\left(n-1\right)/2$ operations are needed to build the distance table. The observations will rarely lie in the same plane, however, and hence these distances will need to be adjusted if they are to be plotted in the lower dimensional space. This leads to the concept of a stress function in MDS which will usually need to be minimised.

In some respects, MDS presents the opposite problem to PCA, although components still provide the solution. Our starting point, however, is the matrix $ d_{ij}$, from which were are attempting to determine a data matrix, $ X$, but usually in some `reduced' form ie. with lower dimensionality.

4.6.2 Classical multidimensional scaling

In classical MDS, the distance matrix, $ D$, is usually defined by the Euclidean distance, so that:

$\displaystyle d_{ij}^{2}=\left(x_{i}-x_{j}\right)^{T}\left(x_{i}-x_{j}\right)=x_{i}^{T}x_{i}+x_{j}^{T}x_{j}-2x_{i}^{T}x_{j}$

and we define an $ n\times n$ inner product matrix $ B$, such that:

$\displaystyle B=XX^{\prime}$ (13)

To overcome the indeterminacy of the solution to arbitrary translation, the centroid of the configuration of points is placed at the origin. Hence:

$\displaystyle \sum_{i=1}^{n}x_{ik}=0\forall k$

These constraints imply that the sum of terms of any row in $ B$ must be zero and hence that $ B$ can be expressed in the form:

$\displaystyle b_{ij}=-\frac{1}{2}\left[d_{ij}^{2}-d_{i.}^{2}-d_{.j}^{2}+d_{..}^{2}\right]$ (14)

where the dot notation is used to imply summing across all columns, all rows or both. Since $ B$ is symmetric, it can also be expressed as:

$\displaystyle B=VAV^{\prime}$

and hence we have:

$\displaystyle X=VA^{1/2}$

where $ A$ is a matrix of eigenvalues of $ B$ and $ V$, the matrix of associated eigenvectors. It should be noted that by employing Euclidean distances, this technique is equivalent to that employed for PCA (see 4.2.1 and 4.2.2). The main difference is that the matrix $ D$ was used to determine $ X$ and was of rank $ n$ (the total number of observations), rather than $ p$ (the number of variables). Assuming that $ p<n$, the last $ n-p$ eigenvalues of $ B$ will be zero, so that $ B$ is usually re-written:

$\displaystyle B=V_{1}A_{1}V_{1}^{\prime}$

where $ V_{1}$ contains the first $ p$ eigenvectors and $ A_{1}$, the first $ p$ eigenvalues. Sometimes, however, the distances are not Euclidean, so that $ B$ is no longer positive definite and negative eigenvalues will occur. This leads to coordinates (sometimes called principal coordinates) which are complex.

For plotting purposes, however, the best-fitting $ k$-dimensional representation is given by the $ k$ eigenvectors of $ B$ corresponding to the $ k$ largest (positive) eigenvalues. As an example, Fig. 19 shows a projection of sunspot data (from longitude/latitude information) alongside that for flare activity, which can be cycled as an animated gif file.

Figure 19: Sunspot and flare activity
Image 20_home_richardh_reports_ds6_alg_figures_anim

The adequacy of a $ k$-dimensional representation can be judged by the size of the criterion:

$\displaystyle P_{k}=\frac{\sum_{i=1}^{k}\lambda_{i}}{\sum_{i=1}^{n-1}\lambda_{i}}$

where values in excess of $ 0.8$ suggest a reasonable fit. If negative eigenvalues do occur, these may often be rejected and/or $ P_{k}$ assessed by replacing the $ \lambda_{i}$ terms with either the absolute or the squared values. Several other criteria have been suggested [103, p.96].

4.6.3 By regression

A simple regression procedure for performing MDS can be described as follows [147]:

$\displaystyle stress=\left\{ \sum\left(d_{ij}-\hat{d_{ij}}\right)^{2}/\sum\hat{d_{ij}^{2}}\right\}$ (15)

The original coordinates of each object may then be adjusted to improve the stress coefficient, after which the table of distances may need to be recomputed and so on until some level of convergence is reached. This often requires a balance between the choice of a small number of dimensions and the desired level of stress.

It should be noted that this algorithm relates to metric scaling ie. some linear or polynomial relationship exists between $ d_{ij}$ and $ \delta_{ij}$. In non-metric scaling, only the ordering of the data distances is important. This gives more flexibility to obtain clearer low-dimensional representations.

4.6.4 Robustness and grouping

Cox and Cox have observed that very little work has been done on diagnostics for MDS, other than the use of stress formulae and Shepard plots (scatterplots of the interpoint distances against the original dissimilarities, eg. see [153]). Some attempt, however, has been made to visualize disparities by use of interactive plots using colour `smearing' and density estimates, though the effects of extreme outliers can be rather devastating [90, p.94]. Nevertheless, Spence and Lewandowsky have proposed a number of more robust parameterisations, using transformations with medians [180]. Dissimilarities may also be grouped, especially for larger data sets and this is the principle behind the implementation of CViz [97].

4.6.5 Non-linear scaling methods

Non-linear techniques include Sammon and self-organised (or Kohonen) maps (see the demo at [170]; [163] for discussion of outliers, and 5.8 below). Kohonen maps were first introduced in the 1980s and have found many practical applications, but been criticised for their ad hoc approach to convergence and handling of missing or categorical data. Verbeek, however, has proposed various extensions to improve these criteria [201].

Fig. 20, gives some impression of Kohonen maps using the R package `KlaR' [168] with the iris data set (Appendix A.2) and shows six separate plots by starting with different seed parameters.

Figure 20: KlaR Kohonen map of iris data set
Image 21_home_richardh_reports_ds6_alg_figures_klaR-iris-all

The starting parameters provide an initial random distribution of the data across a rectangular grid. Each data point is represented by an eight direction edge map, the final configuration of which is dictated through iterations of a stress function. As with many of these component-based and neural network techniques, this is rather difficult understand intuitively. However, a notable feature of this visualization is that, unlike the hierarchical clustering results described earlier, there are only two clearly separated groups, although a third is suggested in most of the plots. Since two of the groups are not linearly separable (see 2.4.3), this display would appear to give an accurate reflection of the data, despite the evident variation in the plots.

4.6.6 Correspondence analysis

Correspondence analysis (CA) is closely related to classical MDS, except that it is applied to $ \chi^{2}$-squared distances, following the assumption that most multivariate problems will give a reasonable approximation to the $ \chi^{2}$-distribution and, consequently, that the proportions of the data in each row and column provide the basis for measurement. Everitt has observed that correspondence analysis (CA) is essentially a technique for displaying multivariate categorical data by deriving appropriate coordinates to represent both the row and column variables. In effect, CA can be regarded as either [103, p.104]:

For a contingency table with $ I$ rows and $ J$ columns, it can be shown that the chi-squared distances can be represented exactly in $ \min\left\{ I-1,J-1\right\} $ dimensions. Thus, when both $ I,J>3$, an exact 2D representation is no longer possible, although some approximation may be. In this case, however, the goodness of fit should be independently tested and for very large tables, it would be advisable to consider a separate method of data reduction. The computation of chi-squared distances is of $ O\left(n^{3}\right)$ and it is also easier to interpret the proximity of labelled groups, rather than individual data items.

4.7 Independent components analysis

Independent components analysis (ICA) can be thought of as another factor rotation method, where the goal is to find rotations that maximise certain independence criteria. The assumption is that a given signal will be a mixed response and where this is so, blind ICA separation of the signal gives very good results. It is also used for signals that are not supposed to be generated by a mixing for analysis purposes.

The statistical method finds the independent components (aka factors, latent variables or sources) by maximising the statistical independence of the estimated components. Non-Gaussianity, motivated by the central limit theorem, is one method for measuring the independence of the components. Non-Gaussianity can be measured, for instance, by kurtosis or approximations of negentropy. Typical algorithms for ICA use centering, whitening and dimensionality reduction as preprocessing steps in order to simplify and reduce the complexity of the problem for the actual iterative algorithm. Whitening and dimension reduction can be achieved with principal component analysis or singular value decomposition. Algorithms for ICA include infomax, FastICA and JADE, but there are many others also. The ICA method is not able to extract the actual number of source signals, the order of the source signals, nor the signs or the scales of the sources [108,132].

A major problem in application of independent component analysis (ICA) is that the reliability of the estimated independent components is not known. Firstly, the finite sample size induces statistical errors in the estimation. Secondly, as real data never exactly follows the ICA model, the contrast function used in the estimation may have many local minima which are all equally good, or the practical algorithm may not always perform properly, for example getting stuck in local minima with strongly suboptimal values of the contrast function.

One implementation which is relevant here is Icasso (available as a package for MATLAB; see [46]). This tool attempts to provide an exploratory visualization method for investigating the relations between estimates from FastICA, by running the algorithm many times with different initial values or with differently bootstrapped data sets (see 3.3.2). Reliable estimates are said to correspond to tight clusters, and unreliable ones to points which do not belong to any such cluster.

4.8 Projection pursuit

Projection pursuit (PP) is a type of statistical technique which involves finding the most `interesting' possible projections in multidimensional data. Often, projections which deviate more from a Normal distribution are considered to be more interesting. PP is a linear method that, unlike PCA and FA, can incorporate higher than second-order information, and thus is useful for non-Gaussian datasets. It is more computationally intensive than second-order methods.

Given a projection index that defines the most interesting direction, PP looks for the directions that optimise that index. As the Gaussian distribution is the least interesting distribution (having the least structure), projection indices usually measure some aspect of non-Gaussianity. If, however, one uses the second-order maximum variance, with orthogonal projections, PP yields the familiar PCA. A commonly used higher-order projection index is based on the negative Shannon entropy. A commonly used higher-order projection index is based on negative Shannon entropy. The FastICA algorithm for ICA can also used to find projection pursuit directions (see above) [108].

4.9 Regression

Regression methods can be used for dimension reduction when the goal is to model a response variable $ y$ in terms of a set of $ X_{i}$ variables. The regression function can be linear or non-linear. In traditional analyses, it was generally assumed that the explanatory variables ($ X_{i}$) were carefully selected and uncorrelated. In current data mining applications, however, those assumptions rarely hold. Variable selection, or dimension reduction, is therefore needed for such cases. A well-known statistical variable selection method is step-wise regression, where different models are fitted using different subsets of the explanatory variables. The results are then compared by calculating various goodness-of-fit measures, and the subset with the best measure is chosen as the explanatory variables with the reduced dimension. A similar approach, selecting the most relevant features by evaluating random subsets of the original features, is called the wrapper method in the machine learning community [108,67,207].

5 Sampling, Clustering and Classification

5.1 Introduction

This section introduces a number of row-reduction techniques which might be described generally as clustering methods. Conventional cluster analysis (3.3.3) comprises at least three different techniques: hierarchical, partitional (notably by k-means) and spectral. Other algorithms, such as vector quantization (5.4.5), provide mechanisms to perform clustering, or what might be better described as proto-clustering. Density sampling is useful both for modelling the distribution of data and as the basis for a broad class of model-based clustering techniques (Section 5.4). Classification (5.7) may be defined as a supervised form of clustering, implying that some data partitions already exist, either by the efforts of human experts (supervised learning), or by previous processing and statistical testing (unsupervised learning).

An important reason for introducing these various techniques is that clustering is an NP-complex problem ie. one which can only really be solved by trying every available option. This is usually impossible, even with the fastest processors available. As an example, choosing 5 groups from a 100 individuals would require searching through some possible $ 10^{68}$ partitions. Some compromise must be found and it is quite likely that initial solutions will not be optimal; steps or analyses will often need to be repeated before the results are satisfactory [103].


5.2 Conventional clustering techniques


5.2.1 Distance-based measures

The essential aim of clustering is to define groups which possess some average property, but which is distinct from that of other groups. It is common to employ distance-based metrics since this is probably the most intuitive, though Euclidean distance is not always the most obvious choice. Various distance based measures for $ p$-dimensional problems are defined here for convenience:

where $ x_{ij}$ is the data item in the $ i$th row and $ j$th column, $ \mu_{i}$ is the population mean vector for the $ i$th population and $ S^{-1}$ is the inverted matrix of covariance.

Distance-based techniques rely upon the chosen metric to assess the difference between any two data observations. As there are $ \left(_{2}^{n}\right)$ pairs in any data set, this immediately suggests $ O\left(n^{2}\right)$ complexity. The complexity of the distance metric also has an impact (see [114, p.25] and [90, Table 1.1] for a more complete list). The Mahalanobis distance function increases with increasing distance between group centres and with decreasing within-group variation (see 5.6.2), but is similar to Euclidean distance if the correlation between variables is slight. It is particularly useful if:

It should be noted that the naive computation of distance matrices can sometimes be avoided and that tree-based methods may be employed to reduce the complexity of the problem in these cases of eg. two-point correlation functions. In particular, Gray and Moore [117] have described the use of multi-resolution kd-trees to reduce the $ O\left(n^{2}\right)$ complexity to approximately $ O\left(n\log n\right)$ (see 5.4.4.4).

5.2.2 Hierarchical techniques

Hierarchical or agglomerative techniques employ a distance metric (such as the above), but there is no clear notion of the number of clusters. Instead, a tree of clusters is formed either from the bottom (`agglomerative') or from the top (`divisive'). Agglomerative clustering starts by considering each data point as a single cluster and proceeds by iteratively merging the two most similar clusters. Divisive clustering algorithms start with one cluster containing all the data and proceed by iteratively splitting the clusters into two disjoint subsets.

The agglomerative and divisive methods differ greatly in their computational demands. Suppose we have $ N$ data points, then the number of possible mergers an agglomerative algorithm has to consider in the first step is $ N\left(N-1\right)/2$ (any two data points can be merged to form a cluster of size two). In total, $ N-1$ mergers have to be made to construct the complete cluster hierarchy, and in total $ O\left(n^{3}\right)$ possible mergers have to be considered. The number of possible splits a divisive algorithm has to consider in the first step alone is already $ O\left(n^{2}\right)$. Therefore, divisive algorithms that consider every possible split to find the optimal split will already be intractable for moderately sized data sets. Agglomerative algorithms therefore scale much better.

The method of agglomeration, however, differs in the (dis)similarity measure that is used. Most dissimilarity measures, $ d_{AB}$, between clusters A and B are defined in terms of a distance measure $ d_{ij}$ between elements of the clusters. Well-known measures are:

The single-link measure is based on the smallest distance between pairs of members of A and B; clusters are similar if there is at least a single link of similar data items between them. The complete-link measure uses the largest distance between pairs of members, thus clusters are similar if all pairs ( $ i\in A,j\in B$) are similar [201].

Other methods include centroid linkage, median linkage and sum-of-squares error (see Equation 16). The choice of these may be dependent upon the problem at hand: single linkage tends to emphasise outliers, but performs poorly if the clusters are non-spherical, leading to the phenomenon known as `chaining'. Centroid linkage appears to be better if the cluster sizes vary. Above all, it should be realised that hierarchical clustering techniques may give very different results on the same data.

Some estimate of the most appropriate number of clusters can be obtained by monitoring the point at which the difference between pairs gives the maximum difference overall, and this is the approach adopted in HCE [174]. More formal methods (eg. the upper tail rule) are described in [104, p.77], but it should be noted that there is no universally agreed consensus and that `informal and subjective criteria' often predominate.


5.2.3 Partitioning by K-means

For partitioning techniques, the number of clusters, $ N$, must be known (or supposed) in advance. Their main attraction over the hierarchical methods is that the algorithms are generally much more efficient. Their main drawback is that assumptions on the shape of the clusters have to be made in advance. In k-means, for example, there is an implicit assumption that the data exhibits compact spherical clusters. Also note that whilst some work has been done on estimating the number of clusters from the data, this issue remains generally unresolved.

The k-means algorithm is probably the most common partitioning-type algorithm, but expectation-maximisation is also popular (see 5.2.4). Given an initial k-clustering the k-means algorithm adapts this clustering to reduce the sum-of-squares error criterion given by:

$\displaystyle E_{SS}=\sum_{i=1}^{k}\sum_{n\in A_{i}}\left\Vert x_{n}-\mu_{i}\right\Vert ^{2}$ (16)

where there are $ k$ clusters $ A_{i}$ and $ n_{i}$ is the number of data points in cluster $ A_{i}$ (the distance metric is essentially Euclidean here).

The k-means algorithm alternates between two steps to improve a given k-clustering. The idea is to treat $ E_{SS}$ in (16) as a function of both the cluster centres $ \mu_{i}$ and the assignment of data points to the clusters. In the first step we fix the assignment of data to clusters and minimise the error with respect to the cluster means $ \mu_{i}$. The optimal cluster centres are given by the mean of the data assigned to each cluster. After this step the error (as function of both the assignments and cluster means) equals $ E_{SS}$. In the second step we fix the cluster means and minimise the error with respect to the assignment of the data to the clusters. This amounts to assigning every point $ x_{n}$ to the cluster $ i=\arg\min_{j}\left\Vert x_{n}-\mu_{j}\right\Vert ^{2}$. After these two steps we obtain a new k-clustering for which $ E_{SS}$ cannot be larger than for the previous k-clustering [104,201].

The algorithm is guaranteed to terminate after a finite number of steps since: (i) there are a finite number of k-clusterings; and (ii), each step cannot increase the sum-of-squares error. However, the k-means algorithm does not necessarily terminate with the k-clustering yielding the minimum sum-of-squares. In practice, therefore, the analysis is usually repeated with different initial centres ($ \mu_{i}$) to avoid issues with local minima and with increasing values of $ N$ to find the point at which some overall metric (eg. total distance from the selected centres) ceases to improve.

A criticism of techniques such as k-means is that they provide only `hard' estimates of membership to a particular cluster. But an observation which lies at the boundary between cluster divisions will probably be more accurately specified by having associated probabilities of membership to either cluster. this is the basis for fuzzy clustering; see [104, p.164] and implementation at [107]. Another variation is the Quality Threshold (QT) algorithm which defines a maximum radius for clusters and does not, therefore, require $ N$ to be specified in advance and has found particular favour in genetics research; see eg. [115].


5.2.4 Partitioning by expectation-maximization

Probabilistic mixture distributions are another important class of partitional clustering methods. A mixture distribution is a distribution that is a weighted sum (weights are positive and sum to one) of several component distributions. The components are restricted to some parametric family and the fit of the model to the data is defined as the likelihood of the data according to the model.

Often the expectation-maximization (EM) algorithm is used to adjust the parameters such that the likelihood is increased. The EM algorithm is reminiscent of the k-means algorithm: it also iterates between a step assigning data to clusters and a step that optimises parameters that characterise the clusters. However, whereas the k-means algorithm assigns each data point to a single cluster, the EM algorithm uses a `soft' assignment: that is, each data point is assigned to each cluster, but in a weighted manner. The EM algorithm shares two important properties with the k-means algorithm:

In practice, the standard approach to reduce the risk of finding only a poor locally optimal solution is to start the EM algorithm with a number of different initial parameters and the local optimum yielding the highest likelihood is retained. Further discussion of the EM algorithm and model-based clustering is given in 5.5.

5.2.5 Spectral techniques

Spectral clustering is a relatively recent approach which applies an eigenvalue decomposition to the similarity matrix before clustering begins. This becomes, in effect, a dimensional reduction of the space, selected parts of which may be clustered thereafter. A popular application is in image segmentation where different regions of the image may be treated separately. In data mining, the technique has also been applied to the division of documents available on the world wide web into separate genres [27].

Spectral methods are attractive because they make less severe assumptions on the shape of the clusters than partitioning algorithms and can be very fast, depending on the sparsity of the similarity matrix. Moreover, the implementation of most spectral clustering algorithms is quite easy since the main component is a procedure to find a few eigenvectors of a matrix: this is a well studied problem for which highly optimised implementations are available [176,201].

5.2.6 Comparison of clustering techniques

It is useful to give a brief summary of these techniques for comparison. Agglomerative techniques are the least scalable since they are either $ O\left(n^{2}\log n\right)$ or $ O\left(n^{3}\right)$ depending on whether or not after each merge all distances between all clusters have to be computed. The partitioning algorithms discussed here are $ O\left(nk\right)$ for k clusters (in practice, however, the algorithm will need to be repeated many times for different values of $ k$ and initial centres). In the assignment step each combination of data point and cluster has to be considered to find the optimal assignments. Spectral clustering approaches need the similarity for each of the $ n^{2}$ pairs of points and thus take in principle at least $ O\left(n^{2}\right)$ time to compute the similarity matrix.

Depending on the similarity measure, speed-ups might be possible. For example, one could use a matrix in which an entry $ \left(i,j\right)$ is non-zero only if $ x_{i}$ is among the q-nearest neighbours of $ x_{j}$ or vice-versa. Efficient techniques exist to find nearest neighbours among $ N$ points in time $ O\left(N\log N\right)$ for fixed q. If similarity matrix used in spectral clustering is sparse, iterations of the power method take an amount of time proportional to the number of non-zero entries [201].

The assumptions underlying the EM algorithm seem rather restricting, but are at least well-defined, whereas they are often poorly understood for the other techniques. Some classes of problem seem may be better adapted to each. Interestingly, the set of densities that can be implemented using standard component classes can be increased by mapping the data to a space of much larger dimensionality where the new dimensions are (non-linear) functions of the original variables. Efficient parameter estimation of mixtures in the higher dimensional space is possible using the kernel trick which is discussed in [167,139,83,202] and illustrated in Fig. 21.

Figure 21: Illustration of the `kernel trick' to create linearly separable classes (after [139])
Image 22_home_richardh_reports_ds6_alg_figures_kernel-trick

A detailed account of dealing with missing data is presented in Little and Rubin [145]. This issue causes particular difficulty for clustering techniques since the use of summary statistics is likely to produce a bias in the choice of clusters. Gower's general similarity coefficient appears to be one possibility since the existing data are weighted to provide an initial dissimilarity matrix and missing variables are imputed later. Iterative techniques improve on this mechanism since correlations between other members of the group and the available data can also be accommodated [104].


5.3 Sampling


5.3.1 Techniques

Statisticians have developed many methods of estimating unknown distribution parameters using functions of the existing data. A bootstrap estimator, for example, samples with replacement (therefore suggesting `bootstrapping', since many other samples are possible). Whilst the original objective of cross-validation (see 3.3.4) was to verify replicability of results and that of `jackknife' was to detect outliers, Efron developed bootstrap for inferential purposes [102,33]. Markov Chain Monte Carlo methods use a random walk to provide samples (without replacement), the most common application being that of numerically calculating multi-dimensional integrals [51].


5.3.2 Histograms

With relatively small datasets, visual tools such as HCE are able to provide cues with respect to interesting patterns within the distribution of each variable. These cues, backed up by simple statistical parameters (kurtosis, skew, range etc) are therefore able to suggest regions of interest [174]. Histograms (both 1D and 2D) are also used to give an impression of the density of the data, and can be used to great effect when combined with other forms of visualization, eg. in Parvis [41], Fig. 22.

Figure 22: Histograms overlaid on parallel coordinates plot
Image 23_home_richardh_reports_ds6_alg_figures_parvis

Nevertheless, various authors have cautioned that histograms are unreliable for detecting modalities in the data, since discontinuities at the edges of the bins, bin width and the choice of origin can exert strong unwanted effects, especially in higher dimensions [76,104]. Silverman observed that even for the most `well-behaved' distributions, such as the normal, the effects can be highly counter-intuitive. In ten dimensions for example, data in the tails (which might easily be ignored in lower dimensions) will comprise the bulk of the data8. We are used to the one dimensional case where 90% of the data generally will generally lie within $ \pm1.6$ standard deviations of the mean: for ten dimensional data, this situation is reversed [177]. More powerful of distribution modelling are described in Section 5.4.


5.3.3 Entropy

A further method to assess the nature of distributions arises from information theory. Entropy reflects the information content within a given data stream $ X$ and is defined as $ H$, such that:

$\displaystyle H\left(X\right)=-\sum_{i=1}^{m}P\left(x_{i}\right)\log_{2}P\left(x_{i}\right)$ (17)

where we assume that $ X$ has $ m$ discrete values or bins and that $ P\left(x_{i}\right)$ is the probability that $ X$ lies in the $ i$th bin (or, more strictly, that $ X$ is in state $ x_{i}$; note that $ P\left(x_{i}\right)\log_{2}P\left(x_{i}\right)$ is usually taken to be zero if $ P\left(x_{i}\right)=0$). High values of entropy (reflecting very small probabilities) indicate a more scattered or uniform distribution; low values are more interesting, suggesting that more information is encoded into a smaller number of bits. It should be noted that the negative log value is often used in likelihood expressions, as otherwise, the numbers appear vanishingly small. Related to entropy is the concept of mutual information, $ MI$, between two separate variables defined as:


$\displaystyle MI\left(X;Y\right)$ $\displaystyle =$ $\displaystyle \sum_{x\in X}\sum_{y\in Y}P\left(x,y\right)\log_{2}\left[\frac{P\left(x,y\right)}{P\left(x\right)P\left(y\right)}\right]$ (18)
  $\displaystyle =$ $\displaystyle H\left(X\right)+H\left(X\right)-H\left(X;Y\right)$  

where $ H\left(X;Y\right)$ is the entropy of the joint distribution (and may be computed from the above definitions). Fig. 23 illustrates the potential of $ MI$ over conventional statistics (correlation in this case) where the data show non-linear relationships. Here, the symmetric nature of the data reduces the correlation coefficient to zero, whilst the $ MI$ is clearly high.
Figure 23: Mutual information contrasted with correlation in a hypothetical case.
Image 24_home_richardh_reports_ds6_alg_figures_mutual-info

Two datasets X and Y (100 data points) show a hypothetical dependency $ f\left(x\right)=4x\left(1-x\right)$ (top). The Pearson correlation coefficient is not able to detect a significant correlation as shown in the histogram plot of the dataset compared to 300 realisations of shuffled data (left). Mutual information clearly shows that the two datasets are not statistically independent (right) [after Daub et al [92]].

Indeed, the authors tended to find that these approaches were somewhat complementary: high values of $ MI$ with a low correlation indicating a non-linear correspondence; whilst the occurrence of low values of $ MI$ with high correlation was linked with highly correlated outliers [92] (nb code, using B-spline estimates, is available from [91]).

The concept of entropy may also be extended to vectors of any dimension, where, using the form continuous variables, we have:

$\displaystyle H({\bf x})=-\sum p\left(x\right)\ln p\left(x\right)  dx$ (19)

where $ {\bf x=\left(x_{1},x_{2},\ldots,x_{d}\right)^{T}}$. The value of this form is more obvious when used in conjunction with the quantization techniques described in 5.4.5. Using this form, it is also possible to show that the function of maximum entropy for a given mean and variance is that of the normal distribution [76, p.242].


5.4 Density estimation


5.4.1 Probability density

Density estimation is a powerful addition to the techniques described in 5.3 for sampling the data and modelling the distribution. Simple examples of density plots are shown in Fig. 24, which attempts to increase the data-ink ratio (see 2.2) of scatterplots by employing sample density plots below the diagonal.

Figure 24: 1D, 2D and 3D density plots
Image 25_home_richardh_reports_ds6_alg_figures_contour-scatplot3Image 26_home_richardh_reports_ds6_alg_figures_contour-persplot

Note that despite the sampling, which is limited to a 1000 points in each density plot, most plots show evidence of multiple modalities ie. clusters. Indeed, various clustering techniques can be built onto density estimates, and some examples are described later in this section.

For the most basic formulation, we note that a probability density function $ p\left(x\right)$ specifies that the probability $ P$ of of the variable $ x$ lying in the interval between any two points $ x=a$ and $ x=b$ is given by:

$\displaystyle P\left(x\in\left[a,b\right]\right)=\int_{a}^{b}p\left(x\right)dx$ (20)

The function $ p\left(x\right)$ is normalized so that the sum of Equation 20 is unity if the interval of $ \left[a,b\right]$ corresponds to the whole of $ x$-space ($ x$ may also be a $ d$-dimensional vector if the integral is across the whole region $ \Re^{d}$).

5.4.2 Parameterisation

The form of this probability expression is valuable for several different kinds of density sampling techniques. For brevity, this following sections (5.4.3 and 5.4.4) consider two basic types -- parametric and non-parametric . A third, which attempts to take the best features from both, is sometimes known as `semi-parametric' (or mixture modelling) and is described in Section 5.5 and eg. Bishop [76], in describing feed-forward neural networks (see also [123]). Finally, a technique which arises from vision research at Leeds is also introduced as a method of density estimation: Section 5.4.5 describes vector quantization.

Parametric methods model the distribution using some pre-defined form by attempting to optimise the parameters for the model. The simplest and most widely used is the normal distribution, which is discussed immediately below, along with various methods of parameterisation. The main drawback of this method is that the chosen form may not be capable of providing a good representation (see comments in 1.2.2). Non-parametric methods (eg. kernel-based methods, $ K$-nearest neighbours) make fewer assumptions and hence the model must be drawn entirely from the data. In this case, somewhat paradoxically, the number of parameters needed grows with the number of data points in the model, which can therefore become rather unwieldy. Techniques to mitigate this problem are also discussed.


5.4.3 Parametric methods


5.4.3.1 The normal distribution

A useful feature of the density space defined in Equation 20 is that expected values or functions may be included, eg. for function $ Q\left(x\right)$, we have the expected value $ E$, where:

$\displaystyle E\left[Q\right]=\int Q\left(x\right)p\left(x\right)dx\approx\frac{1}{N}\sum_{n=1}^{N}Q\left(x_{n}\right)$ (21)

Thus, for example, the mean or expected value of $ x$ is $ \mu,$ and the expected variance is $ \sigma^{2}$ where:

$\displaystyle \mu=E\left[x\right]=\int_{-\infty}^{\infty}xp\left(x\right)dx$

$\displaystyle \sigma^{2}=E\left[\left(x-\mu\right)^{2}\right]=\int_{-\infty}^{\infty}\left(x-\mu\right)^{2}p\left(x\right)dx$

In $ d$ dimensions, the general multivariate normal probability density can be written:

$\displaystyle p\left(x\right)=\frac{1}{\left(2\pi\right)^{d/2}\left\vert S\right\vert^{1/2}}\exp\left\{ -\frac{D^{2}}{2}\right\}$ (22)

where $ S$ is the covariance matrix, and $ D$ is the Mahalanobis distance defined in 5.2.1. Using eigenanalysis, it can be shown that a given value of $ D^{2}$ defines a constant probability surface which is a hyperellipsoid, the principal axes of which are given by the eigenvectors $ u_{i}$ of $ S$ which satisfy:

$\displaystyle Su_{i}=\lambda_{i}u_{i}$

for which the corresponding eigenvalues $ \lambda_{i}$ give the variances along the respective principal directions (cf. Equations 7 and 8).


5.4.3.2 Simplification and analysis

Several simplifications of the normal model are possible. For example, by ignoring the off-diagonal elements of $ S$, the contours of constant density are aligned with the main axes and, moreover, the components of $ x$ are said to be statistically independent. This implies that the distribution of $ x$ can also be obtained from the product of the distribution for each component, ie:

$\displaystyle p\left(x\right)=\prod_{i=1}^{d}p\left(x_{i}\right)$

The normal distribution has a number of attractive properties, which may be summarised as [76]:

  1. Analytical simplicity: Many results can be be obtained explicitly eg. moments of the distribution (kurtosis, skew etc).
  2. A good approximation: The central limit theorem states that, under general circumstances, we can expect the mean of $ m$ random variables to be distributed normally. For measurements of naturally occurring phenomena with several constituent components and a large value of $ m$, we can expect the approximation to a normal distribution to be plausible.
  3. Invariance under non-singular linear transformations: The Mahalanobis distance keeps its quadratic form under any such transformation, but the resulting distribution has different mean and covariance parameters.
  4. Normality of marginal and conditional densities: If some variables are integrated out of the distribution (`marginal' variables) or fixed (ie. made `conditional'), the resulting distributions are also normal.
  5. Simplifications for cluster constraints: To produce spherical clusters, only one parameter is required ($ s_{ii}=s$); for clusters of similar geometry, but not necessarily spherical, the diagonalised version of $ S$ can be used (as described above) [110].
  6. Maximisation of entropy: For given values of $ \mu$ and covariance $ S$, the normal density function maximizes entropy. This has particular value within the domain of neural networks, since, by the formulation of cross-entropies, the networks can be trained.
Use of normal density estimates to create discriminant functions is discussed in Section 5.7.3.


5.4.3.3 Parameterisation of the normal density function

The two principal approaches to parameterisation are maximum likelihood and Bayesian inference. Both methods tend to produce similar results, but their conceptual basis is rather different. The first method seeks find the optimum values for the parameters by maximising a likelihood function observed from training data. In the Bayesian approach, the parameters are described by a probability distribution. These procedures are outlined briefly below:

Maximum likelihood: A density function $ p\left(x\right)$ is assumed to depend upon a set of parameters $ \theta=\left(\theta_{1},\ldots,\theta_{M}\right)^{T}$. In a classification problem, such a function would exist for each class, but here we consider only one such function. Given a data set of $ N$ vectors, $ X=\left\{ x_{1},\ldots,x_{N}\right\} $, then making the dependency on $ \theta$ explicit, we write that:

$\displaystyle p\left(X\mid\theta\right)=\prod_{i=1}^{N}p\left(x_{i}\mid\theta\right)\equiv\mathcal{L}\left(\theta\right)$

where $ \mathcal{L}\left(\theta\right)$ can be viewed as a function of $ \theta$ for fixed $ X$ and is known as the likelihood of $ \theta$ for given $ X$. To maximise the likelihood, $ \theta$ is chosen to maximise $ \mathcal{L}\left(\theta\right)$. This corresponds to the intuitively reasonable idea of choosing the value of $ \theta$ which is most likely to give rise to the observed data. In practice, it is often more convenient to consider the negative logarithm of the likelihood, in the form of an error function, $ E$, which should be minimised:

$\displaystyle E=\mathcal{-L}\left(\theta\right)=\sum_{i=1}^{N}\ln p\left(x_{i}\mid\theta\right)$

For most choices of density function, the optimum $ \theta$ would be chosen by some iterative numerical procedure eg. by gradient descent. For multivariate normal density, however, it can be shown that the maximum likelihood estimates of the mean and covariance are the sample average and sample covariance. This simplicity has a number of in that the estimator is biased, especially for small $ N$.

Bayesian inference: In this approach, the parameter vector $ \theta$ is itself associated with a prior probability distribution, which is typically very broad, since little is assumed to be known. After the data are `observed', the posterior probability density is much narrower, suggesting that learning has occurred. The algorithm proceeds by expressing the desired density as:

$\displaystyle p\left(x\mid X\right)=\int p\left(x,\theta\mid X\right)  d\theta=\int p\left(x\mid\theta\right)p\left(\theta\mid X\right)  d\theta$

since $ x$ is independent of $ X$ (it is just our assumed form for the parameterised density) and is specified when $ \theta$ has been set. Thus, instead of finding a particular value for $ \theta$, the Bayesian approach uses a weighted average over all values of $ \theta$. The weighting factor $ p\left(\theta\mid X\right)$, which is the posterior distribution of $ \theta$, is determined by starting from some assumed prior distribution $ p\left(\theta\right)$ and then updating it by using the observed data. Using Bayes theorem, the posterior probability can then be expressed as:

$\displaystyle p\left(\theta\mid X\right)=\frac{p\left(x\mid\theta\right)p\left(...
...eft(\theta\right)}{p\left(X\right)}\prod_{i=1}^{N}p\left(x_{i}\mid\theta\right)$

although, typically, the evaluation of $ p\left(X\right)$ involves a difficult integral only possible analytically by using the conjugate prior. It can be shown that the maximum likelihood and Bayesian methods converge for large $ N$, but for smaller values will tend to give rather different results. Bishop gives further details and describes several other methods [76]. Fraley and Raftery [110] have also introduced the use of eigen-analysis into their parametrisation, see 5.5.


5.4.4 Non-parametric methods


5.4.4.1 Histograms

In the simplest case, a histogram is an example of a non-parametric kernel density estimator. Each data point contributes towards a count within a series of bins placed across the range of the distribution. It is clear that the choice of bin width will play a key role in determining the shape of the histogram: too narrow and the result will appear spiky; too broad and the distribution will begin to look uniform. The choice of origin may also play a determining role in the shape of the histogram, especially `edge' effects may be suspected. The basic histogram may be defined by the operator:

$\displaystyle f\left(x\right)=\frac{1}{nh}\left[number  of  X_{i}  in  same  bin  as  x\right]$ (23)

where $ n$ is the total number of observations and $ h$ the width of the bin.

One advantage of the histogram is that it can be constructed sequentially and the data discarded afterwards, if so desired. A distinct disadvantage, however, is that the breaks at the boundaries of the histogram bins cause discontinuities, which make the technique unsuitable for anything but the most casual use. Another problem occurs in higher dimensions: taking $ M$ intervals in $ d$-dimensional space leads to an exponential growth in the number of bins ($ M^{d}$), inevitably resulting in a density estimate of zero in most bins.


5.4.4.2 Kernel density estimation

If the bins are centred over each data point, this leads to a smoother, but not necessarily smooth, plot (see [100] for examples). We may define the estimator as:

$\displaystyle p\left(x\right)=\frac{1}{nh}\sum_{i=1}^{n}H\left(u\right)$ (24)

where:

$\displaystyle H\left(u\right)=\left\{ \begin{array}{cc}
\frac{1}{2} & if\left\vert\frac{x-X_{i}}{h}\right\vert<1\\
0 & otherwise\end{array}\right.$

and is described as `naive' in this simple form. As above, it will be seen that the choice of $ h$ is again critical. Given also, that higher dimensions may be included, we should write:

$\displaystyle p\left(x\right)=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{h^{d}}H\left(u\right)$ (25)

so that $ h$ may be adapted for any given axis in $ d$-dimensions. It can be shown that as $ h\rightarrow0$, the kernel approaches a delta function and our approximation for $ p\left(x\right)$ tends towards the true density. The effect of this, however, is that the model becomes overfitted to the data, leading to a spiky histogram. On the other hand, and as noted previously, if $ h$ is too broad, the distribution is over-smoothed.

Rectangular, triangular and Gaussian kernels are popular, and several others have been applied (see Silverman [177, Table 3.1]). Their efficiencies, based on the mean integrated square error statistic all exceed 90% and it is common to find that particular kernels have been chosen for ease of use in a given situation. In general, any kernel may be used so long as the following holds:

$\displaystyle H\left(u\right)\geq0$

and:

$\displaystyle \int H\left(u\right)  du=1$


5.4.4.3 K-Nearest Neighbours

The technique of K-Nearest Neighbours (KNN) is another widely applicable technique for density estimation, and with only a small modification may also be applied to classification problems more generally. The main advantage of KNN is that the difficulty of the choice of $ h$ is avoided. It should also be noted, however, that unlike kernel density estimates, there is no theoretical convergence as $ h\rightarrow0$.

From first principles, the probability that a new vector $ x^{\prime}$ drawn from an unknown density function $ p\left(x\right)$ will fall inside some region $ \Re$ of $ x$-space is:

$\displaystyle P=\int_{\Re}p\left(x^{\prime}\right)=dx^{\prime}$

If we have $ N$ data points drawn independently from $ p\left(x\right)$, then the probability that $ K$ of them will fall inside $ \Re$ follows a binomial law. If $ N$ is large and $ \Re$ small, we can make the following approximations:

$\displaystyle P\approx\frac{K}{N}\approx p\left(x\right)V$

where $ V$ is is the volume of $ \Re$ and $ x$ is some point lying within $ \Re$. From this, it follows that we can also approximate $ p\left(x\right)$ by:

$\displaystyle p\left(x\right)\approx\frac{K}{NV}$ (26)

In applying Equation 26, there are two basic approaches which can be adopted. If we fix the volume and determine $ K$ from the data, this leads to a kernel density estimate as described in 5.4.4.2 above. If, however, the volume is varied until it contains a fixed number of $ K$ points, we have the KNN technique. More sophisticated versions of this algorithm, with the aim of reducing storage and processing requirements, are described in Hand and Batchelor [121]. Tree-search techniques may also be employed to speed up the process of finding the nearest neighbours to a given point [188, p.512] and examples, using kd-trees, are described immediately below.


5.4.4.4 Using kd-trees

Kd-trees were introduced as a method to speed-up many statistical analyses by Moore in 1991 [157]. An early application was to KNN which provided $ O\left(\log n\right)$ rather than $ O\left(n\right)$ search times after construction of the tree (an $ O\left(n\log n\right)$ operation) [158]. The construction of the median-based tree is outlined in Algorithm 1
\begin{algorithm}
% latex2html id marker 1553
[!htbp]
Algorithm BUILDKDTREE(P,de...
...\par
\caption{
Kd-tree construction (after \cite{kd_tree_demo})}
\end{algorithm}
and the principle is illustrated in Figure 25.

Figure 25: Kd-tree construction (after [151])
Image 27_home_richardh_reports_ds6_alg_figures_kd-tree

In essence, the $ k$-dimensional space is divided alternately along the median of each dimension until each data point can be collected as the leaf of a binary tree.

In addition, Moore also observed that the search path could be optimised by the storage of additional information at the dividing nodes, such as the bounding box of data below any given node. Indeed, this only really required the storage of two arrays of $ k$ maximum and minimum values at each node which then define the hypercube around that section of the tree. Interestingly, better performance was obtained (empirically) by dividing along the side of longest dimension rather than the median, at least in the initial stages of construction. This tended to promote the creation of larger hypercubes in regions with no data, allowing empty partitions to be quickly discounted.

The storage of additional data become more firmly established with the concept of multi-resolution kd-trees (mrkd-trees), in which the mean and covariance were also stored and where considerably faster performance was reported for kernel regression [94] and mixture model clustering (see 5.5) [155]. Gray and Moore also realised that a whole class of conventionally $ O\left(n^{2}\right)$ point-by-point comparison algorithms (such as n-point correlation functions) could be much speeded-up by the use of dual kd-trees [117]. This method considers query nodes as chunks, also defined within a kd-tree, to compare against an existing tree. By using approximating functions with additional data stored at nodes (an operation called pruning), it was found that performance was improved whilst errors could be bounded to some constant fraction of the estimate.

The use of mrkd-trees for density estimation was explored in some depth in [118]. The authors observed that non-parametric methods take the strongest stance in allowing the data to ``speak for themselves'', but that existing techniques render these methods intractable for large datasets. The pruning rule for the kd-tree approximation is essentially $ N_{T}K\left(\mu_{T}\right)$ where $ N_{T}$ is the number of elements defined at the node, $ \mu_{T}$ is the centroidal mass and $ K\left(\cdot\right)$ is the given kernel (the Epanechnikov form was preferred). A global error bound is achieved by approximating the overall log likelihood using local bounds maintained for each level of the query tree. Further improvements in accuracy can be obtained by constructing a priority queue of pruned nodes and allowing the approximation to be `undone' later if so desired. Empirical results suggested that the dual-tree algorithm approaches $ O\left(n\right)$ scaling although this was not proved.

A limitation observed in Deng and Moore's earlier work [94] was that the technique does not scale indefinitely into higher dimensions. This was said to arise because data points in 10 dimensions and beyond grow increasingly uniform in terms of distance between the points. Even so, beyond this level, Gray and Moore observed that that their algorithm was polynomial rather than exponential in the number of dimensions [118]. Furthermore, it was anticipated that performance should be improved using so-called ball-tree algorithms, even in very high dimensions [156].


5.4.5 Vector quantization

5.4.5.1 Background

In data compression, vector quantization (VQ) is a reduction technique used in `lossy' data compression in which the basic idea is to code (ie. replace with a key) values from a multi-dimensional vector space into values from a discrete subspace of lower dimension. The lower-space vector requires less storage space and the data is thus compressed. The transformation into the subspace is usually achieved through projection, or by using a codebook [28] (see algorithm below). These forms of compression are not explored further here (but see Khumbah and Wegman [141]), although it is noted that entropy-constrained VQ techniques have gained wide acceptance as effective clustering techniques in signal processing [86].

Instead, this section introduces research undertaken at Leeds in the area of video surveillance [137]. This work is of interest since video generates huge datasets and although the individual frames were analysed in 2D, objects within the scene were tracked in terms of position and trajectory, giving rise to 4D vector data. Semi-parametric methods (see 5.5) were considered for this task, but were rejected due to the limitations of handling large datasets and because the underlying distribution was unknown (see 5.5.5). VQ was essentially developed here as a method of non-parametric density estimation, allowing any number of `prototype' clusters to be derived and updated over an arbitrary period of time. Figure 26 gives a visualization of the movement of objects (pedestrians) in a typical scene resulting from this work.

Figure 26: Trajectories of people moving in a scene (from [136]
Image 28_home_richardh_reports_ds6_alg_figures_njohnson-temporal


5.4.5.2 Algorithm

VQ places a set of $ k$ prototypes within a given $ d$-dimensional space $ c_{i}\in\left[0,1\right]^{d}$, referred to as the codebook, which may be updated over $ N$ iterations. The method is outlined in Algorithm 2.
\begin{algorithm}
% latex2html id marker 1611
[!htbp]
\begin{enumerate}
\item Ra...
... iterations.
\end{enumerate}\par
\caption{
Vector quantization}
\end{algorithm}

This procedure follows Kohonen, who showed that such an algorithm is a gradient descent procedure for the approximation of an optimal VQ which acts to minimise the expected squared reconstruction error (see [142]). Moreover, when $ d\gg2$, the point density of prototypes, after the learning phase, approximates the probability density of training data, and each prototype represents (in a nearest-neighbour sense) an approximately equal number of training vectors. The cooling schedule detailed in step 4 was chosen to give a large initial gain which decreased gradually over the first half of the learning process, allowing prototypes to move rapidly into roughly the desired distribution. During the remainder of the learning process, the relatively small fixed gain allowed fine tuning of the prototype distribution as the system reached equilibrium.

The remaining model parameters must be determined experimentally. The number of iterations required to achieve a reasonable distribution is dependent on the number of prototypes, the dimensionality of the feature space, and the attributes of the distribution being estimated, and, in Johnson's experiments, was typically in the order of millions of iterations. In cluster analysis, a `natural number' of prototypes can be determined by observing the reconstruction error for increasing numbers of prototypes. In the estimation of dispersed distributions, such an analysis is only useful in determining an absolute minimum number of prototypes; the number chosen is then essentially arbitrary, more prototypes giving a more detailed representation.

5.4.5.3 Refinements

Johnson cited three areas where specific action was necessary to improve the performance of the algorithm:

To overcome some of these features of the optimisation technique, Johnson implemented and developed a number of extensions, including leaky learning and altruistic VQ. These neural network approaches are outlined in the author's thesis [137] and [138].


5.5 Mixture models (`semi-parametric methods')


5.5.1 Background

In their review in 2002, Fraley and Raftery observed that although considerable research had been undertaken in the area of conventional clustering techniques (see 5.2), there was little systematic guidance associated with these methods for solving basic practical questions such as how many clusters to select, the type of method or how how outliers should be handled. Moreover, the statistical properties of these methods were generally unknown, precluding any kind of formal inference [110]. At that time, it was a relatively recent discovery that mixture models could provide a principled statistical approach to these problems. The problems of determining the number of clusters and of choosing an appropriate clustering method could be recast as statistical model choice problems, different models compared and more components introduced (if needed) to represent outlying data.

This technique has therefore become firmly established as a clustering method, although it is essentially based on density estimation and exists a compromise between the parametric and non-parametric methods described above. In particular it was hoped to combine: (a) the ability of parametric models to be evaluated quickly for each new input value; with (b), the capacity for non-parametric models to accommodate very general distribution models [76]. Earlier work had also established that in some senses these methods of clustering were complementary. Hierarchical clustering based on maximum-likelihood tends to produce reasonably good partitions when started without any information about the groupings, whereas initialisation is critical for EM-based mixture models (see 5.2.4) because the likelihood surface tends to have multiple modes. EM, however, typically produces improved partitions when started from more reasonable partitions generated elsewhere. By initialising EM with partitions from hierarchical agglomeration and by using approximate Bayes factors with the Bayesian Information Criterion (BIC) approximation, a number of researchers had obtained good results in difficult problem areas such as minefield detection. Fraley and Raftery [109] later extended this method to select the parameterisation of the model as well as the number of clusters simultaneously using BIC.


5.5.2 Initial formulation

Given data $ y$ with independent multivariate observations $ y_{1},\ldots,y_{n}$, the likelihood for a mixture model with $ G$ components is:

$\displaystyle \mathcal{L}_{MIX}\left(\theta_{1},\ldots,\theta_{G};\tau_{1},\ldo...
...=1}^{n}\sum_{k=1}^{G}\tau_{k}p_{k}\left(y_{i}\left\vert\theta_{k}\right.\right)$

where $ p_{k}\left(\cdot\right)$ and $ \theta_{k}$ are the density and parameters of the $ k$th component in the mixture and $ \tau_{k}$ is the probability that an observation belongs to the $ k$th component ( $ \tau_{k}\geq0;\sum_{k=1}^{G}\tau_{k}=1$). Most commonly $ p_{k}\left(\cdot\right)$ is the multivariate normal density, parameterised by its mean $ \mu_{k}$ and covariance matrix $ S_{k}$. Some description of the analysis and breakdown of this form has been give in Section 5.4.3.1; other extensions using eigenanalysis and other techniques are summarised in [110].


5.5.3 The EM algorithm

The EM algorithm, introduced by Dempster et al in 1997 [93] is a general approach to maximum likelihood formulation which avoids the difficult problem of solving otherwise highly non-linear coupled equations [76]. The data are viewed as consisting of $ n$ multivariate observations $ x_{i}$ recoverable from $ \left(y_{i},z_{i}\right)$ in which $ y_{i}$ are said to be observed and $ z_{i}$ unobserved. If the $ x_{i}$ are independent and identically distributed (iid) according to a probability distribution $ p$ with parameters $ \theta$, then the complete-data likelihood is:

$\displaystyle \mathcal{L}_{C}\left(x_{i}\left\vert\theta\right.\right)=\prod_{i=1}^{n}p\left(x_{i}\left\vert\theta\right.\right)$ (27)

Further, if the probability that a particular variable is unobserved depends only upon the observed data $ y$ and not on $ z$, then the observed-data likelihood, $ \mathcal{L}_{O}\left(y\left\vert\theta\right.\right)$, can be obtained by integrating $ z$ out of the complete-data likelihood:

$\displaystyle \mathcal{L}_{O}\left(y\left\vert\theta\right.\right)=\int\mathcal{L}_{C}\left(x\left\vert\theta\right.\right)  dz$ (28)

The maximum likelihood estimate (MLE) for $ \theta$ based on the observed data maximises $ \mathcal{L}_{O}\left(y\left\vert\theta\right.\right)$.

The EM algorithm alternates between two steps in which the log-likelihood estimates are refined as follows:

The unobserved portion of the data may involve values that are missing due to non-response and/or quantities that are introduced to reformulate the problem for EM. Under fairly mild conditions, EM can be shown to converge towards a local maximum of the observed-data likelihood. In EM for mixture models, the ``complete data'' are considered to be $ x_{i}=\left(y_{i},z_{i}\right)$ where $ z_{i}=\left(z_{i1},\ldots,z_{iG}\right)$ is the unobserved portion of the data, with:

$\displaystyle z_{ik}=\left\{ \begin{array}{cc}
1 & if  x_{i}  belongs  to  group  k\\
0 & otherwise\end{array}\right.$

Assuming that each $ z_{i}$ is iid according to a multinomial distribution of one draw from $ G$ categories with probabilities $ \tau_{1},\ldots,\tau_{G}$ and that the density of an observation $ y_{i}$ given $ z_{i}$ is given by $ \prod_{i=1}^{n}p_{k}\left(y_{i}\left\vert\theta_{k}\right.\right)^{z_{ik}}$, the resulting complete-data log-likelihood is:

$\displaystyle \ell\left(\theta_{k},\tau_{k},z_{ik}\left\vert x\right.\right)=\s...
...{ik}\log\left[\tau_{k}p_{k}\left(y_{i}\left\vert\theta_{k}\right.\right)\right]$ (29)

The E-step of the EM algorithm for mixture models is given by:

$\displaystyle \hat{z}_{ik}\leftarrow\frac{\hat{\tau_{k}}p_{k}\left(y_{i}\left\v...
...j=1}^{G}\hat{\tau_{j}}p_{j}\left(y_{i}\left\vert\hat{\theta}_{j}\right.\right)}$ (30)

while the M-step involves maximising (29) in terms of $ \tau_{k}$ and $ \theta_{k}$, with $ z_{ik}$ fixed at the values of computed in the E-step, $ \hat{z}_{ik}$. For multivariate normal mixtures, the E-step is given by (30) with $ p_{k}$ is replaced by the Gaussian distribution regardless of the parameterisation. For the M-step, estimates of the means and probabilities have simple closed-form expressions involving the data and $ \hat{z}_{ik}$ from the E-step:

\begin{displaymath}
\begin{array}{ccccc}
\hat{\tau_{k}}\leftarrow\frac{n_{k}}{n}...
...i}}{n_{k}} & & n_{k}\equiv\sum_{i=1}^{n}\end{array}\hat{z}_{ik}\end{displaymath}

Computation of the estimate for covariance depends upon its parameterisation (Fraley and Raftery, for example, give an eigenvalue decomposition [110]).


5.5.4 Model selection

The two most basic issues in conventional cluster analysis -- the clustering method and the number of clusters -- are reduced to that of a single concern in mixture modelling, that is model selection. This will likely involve a trade-off: a simpler model may need more clusters to produce a good representation whilst fewer clusters may suffice for a more complex model. This is equivalent to the problem in conventional techniques with, say, the choice of using spherical or ellipsoidal clusters (see eg. 5.2.3).

One solution is based on Bayesian model selection via Bayes factors and posterior model probabilities. Prior probabilities are chosen (possibly to be equal) and the posterior probability of the model $ M_{k},  k=1,\ldots,G$ given data $ D$ is given by:

$\displaystyle p\left(M_{k}\left\vert D\right.\right)=p\left(D\left\vert M_{k}\right.\right)p\left(M_{k}\right)$

from which $ p\left(D\left\vert M_{k}\right.\right)$ can be obtained by integration. If the prior probabilities are the same, this amounts to choosing the model with the highest integrated likelihood. Alternatively, two models may be compared by using Bayes factors, $ B_{ik}$, that is the ratio of integrated likelihoods:

$\displaystyle B_{12}=p\left(D\left\vert M_{k}\right.\right)/p\left(D\left\vert M_{k}\right.\right)$

favouring $ M_{1}$ if $ B_{12}>1$ and usually taken as very strong evidence if $ B_{12}>100$. A good approximation to the log-likelihood may be obtained by computing the Bayesian inference criterion (BIC), where:

$\displaystyle BIC_{K}=2\log p\left(D\left\vert\hat{\theta_{k}},M_{k}\right.\right)-\nu_{k}\log n\approx2\log p\left(D\left\vert M_{k}\right.\right)$ (31)

where $ \nu_{k}$ is the number of independent parameters to be estimated in model $ M_{k}$. Numerous other methods of model selection have been proposed eg. using the entropy criterion and cross-validation likelihood (see [110]).


5.5.5 Limitations

It should be noted that EM estimation for mixture models has a number of limitations:

  1. The rate of convergence can be slow;
  2. The algorithm breaks down when the covariance associated with one or more components is singular or near-singular, which may well occur if one of the clusters contains few observations;
although EM generally gives good results if the data conform reasonably well to the model and the iteration is started at reasonable values.

Another limitation of model-based clustering is that in high dimensions, the number of parameters per component in multivariate normal mixtures (which allow the orientation to vary between the clusters) grows as the square of the dimension of the data. Moreover, if the dimension of the data is high relative to the number of observations, the covariance estimates in ellipsoidal models will often be singular, causing the EM algorithm to break down (although other models eg. spherical may still be applicable). Often, some kind of dimension reduction strategy is inevitable: see Mukherjee et al [159] and Murtagh et al [162] for examples using eg. PCA.

A major limiting factor is that model-based clustering as described here does not scale to large datasets. In particular, for efficiency reasons, the hierarchical agglomeration step has a memory requirement which is proportional to the square of the number of groups. (By default, each element is usually assigned to a separate partition.) The most usual approach is take a random sample of the data, and treat this sample as training data in a discriminatory analysis problem (see Section 5.7). Another restriction is that the computation time for EM iteration depends only upon the dimensionality of the data when the data can all be held in memory. If this is not the case, then computation time increases considerably and even if this is acceptable, numerical errors due to machine precision may well be a problem. Again, sampling the data may well be the only remaining possibility [110].

5.5.6 An example mixture model analysis

This example follows [103, Sec 6.4] in applying Fraley and Raftery's mclust routines to exoplanet data, but uses an updated dataset of 170 candidate planets around main sequence stars obtained recently (November 2005) from [171] (see Appendix A.4). Only the mass, period and eccentricity variables are used in this analysis. To generate the clusters, maximum-likelihood analysis identifies the model and groupings with the highest BIC number, as shown in Figure 27.

Figure 27: BIC plot for exoplanet data
Image 29_home_richardh_reports_ds6_alg_figures_exo-bic

This shows that 5 clusters has, in fact, the highest probability using model ``4''. The available models for cluster shapes are:

  1. Spherical, equal volume;
  2. Spherical, unequal volume;
  3. Diagonal, equal volume, equal shape;
  4. Diagonal, varying volume, varying shape;
  5. Ellipsoidal, equal volume, shape, and orientation;
  6. Ellipsoidal, varying volume, shape, and orientation;
where these terms refer to constraints upon the parameterisation of the covariance matrix ($ S$) -- see 5.4.3. The mclust routines also provide a plot of the clusters against these variables, reproduced in Fig. 28.
Figure 28: Pairs plot showing mixed modelling clusters
Image 30_home_richardh_reports_ds6_alg_figures_exo-pairs

Confidence regions can also be placed around the clustered regions as demonstrated in Fig. 29.

Figure 29: Confidence regions, placed around clusterings obtained in mclust
Image 31_home_richardh_reports_ds6_alg_figures_exo-class

It is interesting to note that in Everitt's original working of this example, which contained only 101 exoplanets discovered up to October 2002, model ``4'' was also selected, but that the number of clusters was suggested to be three. This combination also appears to be highly rated in the current BIC classification, but since 5 clusters is slightly higher, this is selected by default (the number of clusters can, however, be enforced in underlying EMclust algorithm).


5.6 Variance techniques


5.6.1 Canonical variate analysis

Canonical Variate Analysis (CVA) is used to examine the interrelationships between a number of populations simultaneously with a goal of objectively representing the interrelationships graphically in two or three dimensions. The axes of variation are chosen to maximize the separation between the populations relative to the variation within each of the populations. In algebraic terms, the first canonical variate is the linear combination of variables that maximizes the ratio of between-groups sums of squares to the within-groups sums of squares for a one-way multivariate analysis of variance of the canonical variate scores (see immediately below for definitions and [34]).


5.6.2 Contrast with other variance techniques

CVA, sometimes known as discriminant function analysis, is related to analysis of variance (ANOVA) techniques, since both employ measures of within-group and inter-group variance. These may be defined as:

the results of which may be analysed using eg. Wilks' lambda statistic:

$\displaystyle \Lambda=\left\vert W\right\vert/\left\vert T\right\vert$ (32)

If $ \Lambda$ is low, this indicates that the variation within samples variation is low compared to the total and is evidence that the samples do not come from populations with the same mean vector. Wilks' lambda can also be obtained from the eigenvalues, $ \lambda_{1}\ge\lambda_{2}\ge\ldots\ge\lambda_{p}\ge0$, of $ W^{-1}B$ using the identity:

$\displaystyle \Lambda=\prod_{i=1}^{p}1/\left(1+\lambda_{i}\right)$ (33)

Manly notes that there are various approximations for this technique along with test of exact critical values [147, sec.4.7]. An important application of these variance techniques is to create elbow plots (similar to scree plots; see Fig. 17): plotting the total within group sum-of-squares against the number of clusters allows an estimate for the optimal number of clusters to be obtained [103].

5.6.3 Usage in clustering

Everitt et al have observed that there are numerous variations on techniques employing these measures for clustering directly, using eg. minimisation of $ trace\left(W\right)$, minimisation of $ \left\vert W\right\vert$ and maximisation of $ trace\left(BW^{-1}\right)$, although the first of these is the most common [104]. One problem with this method, however, is that the results are scale-dependent, which presents problems if the data have to be standardised. A further problem is that it appears to impose a spherical structure upon the generated clusters, even when the observed clusters are clearly visible by eye.

The problem of scale-dependency provided the motivation to develop several other methods, using modified criteria:

ie. both are dependent only upon the eigenvalues of the system. Of these methods, the $ \left\vert W\right\vert$ minimisation method has proved to be the most popular as it seems to cope reasonably well with clusters of different shapes, so long as the clusters concerned contain roughly equal numbers of objects and are similar in shape to each other. Various extensions have been put forward to overcome this shape limitation, such as the use of dispersion matrices [104, p.96]. More recently, however, researchers have turned to statistical models to lend assistance in the interpretation of more the traditional techniques (and see [110] and 5.5).


5.7 Classification

5.7.1 Background

Clustering and classification are opposite sides of the same coin, the main difference being that in classification tasks, we are more concerned with labelling the clusters (and thus, by definition, each observation) in some way. The specification of classes might be through some expert presence, eg. a botanist describing the species present in the iris data, or through some discriminant present in the data eg. ellipticity values to distinguish between stars and galaxies. In either case, some observations may be misclassified and hence the concept of validation is important. This issue is discussed in Section 5.7.2. Given the evident similarity between clustering and classification tasks, this section does not this subject in depth, but we extend some examples discussed earlier in the text to illustrate how clustering procedures may be adapted (Sections 5.7.3 and 5.7.4).


5.7.2 Validation

An important technique in classification, especially in data-mining tasks, is that of cross-validation. Several variations can be described [29]:

It is also important to understand the significance of the decision boundary. The simplest form is to choose the point at probability distributions meet, since this will, by definition, give the smallest region of error [76, Fig. 1.15]. The actual choice, however, will depend upon the risks associated with misclassification: eg. misdiagnosing a benign cancer may be far less damaging than failing to observe another that is malignant. In the general case, therefore, it is only necessary to choose the class for which:

$\displaystyle P\left(C_{k}\left\vert x\right.\right)>P\left(C_{j}\left\vert x\right.\right),\; for  j\neq k$

or, more commonly:

$\displaystyle p\left(x\left\vert C_{k}\right.\right)P\left(C_{k}\right)>p\left(x\left\vert C_{j}\right.\right)P\left(C_{j}\right),\; for  j\neq k$ (34)

For risk management, however, it is also necessary to introduce a likelihood function, usually a loss-matrix of weights indicating penalties for misclassification (see eg. [76, Sec. 1.10]).


5.7.3 Parametric discriminant functions

Equation 34 may be expressed as the following discriminant function:

$\displaystyle y_{k}\left(x\right)=\log p\left(x\left\vert C_{k}\right.\right)+\log P\left(C_{k}\right)$ (35)

where $ C_{k}$ denotes the $ k$th class and $ P\left(C_{k}\right)$ the corresponding prior probability. Each new input vector $ x$ is assigned to the class $ C_{k}$ which gives the largest value for the corresponding discriminant $ y_{k}\left(x\right)$. If each of the class-conditional density functions $ p\left(x\left\vert C_{k}\right.\right)$ in (35) is taken to be an independent normal distribution, then we can write:

$\displaystyle y_{k}\left(x\right)=-\frac{1}{2}\left(x-\mu_{k}\right)^{T}S_{k}^{...
..._{k}\right)-\frac{1}{2}\log\left\vert S_{k}\right\vert+\log P\left(C_{k}\right)$ (36)

where the constant terms have been dropped. The decision boundaries, along which $ y_{k}\left(x\right)=y_{j}\left(x\right)$, are therefore quadratic functions in $ d$-dimensional space. An important simplification occurs if $ S_{k}=S$, since the $ \left\vert S_{k}\right\vert$ and $ x^{T}S^{-1}x$ terms are class independent and may be dropped from Equation 36. Also, since $ S$ is symmetric, its inverse will be symmetric and it follows that $ x^{T}S^{-1}\mu=\mu^{T}S^{-1}x$. This gives a set of discriminant functions which can be written in the form:

$\displaystyle y_{k}\left(x\right)=w_{k}^{T}x+w_{k0}$ (37)

where:

$\displaystyle w_{k}^{T}=\mu^{T}S^{-1}$

$\displaystyle w_{k0}=-\frac{1}{2}\mu^{T}S^{-1}\mu_{k}+\log P\left(C_{k}\right)$

The function in Equation 37 are examples of linear discriminants, since they are linear functions of $ x$. Decision boundaries corresponding to $ y_{k}\left(x\right)=y_{j}\left(x\right)$ are then hyperplanar. Linear discriminants are closely related to neural network models which have a single layer of adaptive weights [76].


5.7.4 Non-parametric discrimination

5.7.4.1 Approach:

As with clustering, non-parametric methods would seem to be most offer the widest scope for astronomy, since fewer assumptions about the data are made. This section considers the nearest neighbour (Section 5.7.4.2) and tree-based approaches (5.7.4.3) and briefly describes refinements which may be obtained by boosting and bagging techniques (5.7.4.4). Another technique which can be mentioned is that of class cover catch digraphs, although libraries for these are not yet available; see [149] for discussion of the application to data from the Sloan Digital Sky Survey.


5.7.4.2 Nearest neighbour classifier:

The computation of densities from the nearest neighbour (NN) rule was described in Section 5.4.4.3. A classifier can be generated through the use of Bayes' Theorem by modelling the class-conditional densities for each class separately and then combining them with priors to give posterior probabilities. The KNN technique can be adapted as follows:

$\displaystyle p\left(x\left\vert C_{k}\right.\right)=\frac{K_{k}}{N_{k}V}$ (38)

$\displaystyle P\left(C_{k}\right)=\frac{N_{K}}{N}$

$\displaystyle p\left(C_{k}\left\vert x\right.\right)=\frac{p\left(x\left\vert C_{k}\right.\right)p\left(C_{k}\right)}{p\left(x\right)}=\frac{K_{k}}{K}$

Thus, to minimise the probability of misclassifying a new vector $ x$, it should be assigned to the class $ C_{k}$ for which the ratio $ K_{k}/K$ is the largest [76]. This is known as the K-nearest-neighbour classification rule. When $ K=1$, this is known simply as the NN rule and new samples are classified by calculating the distance to the nearest training case. It is common to select $ K$ to be small and odd to break ties (typically 1, 3 or 5). Larger $ K$ values help reduce the effects of noisy points within the training data set, and the choice of $ K$ is often performed through cross-validation.

One of the chief drawbacks of the NN classifier is that it is slow to execute, since every point is checked, although techniques are available for improving performance. One approach is to pre-sort the training sets in some way, eg. by kd-trees (see 5.4.4.4) or Voronoi cells. Another solution is to choose a subset of the training data such that classification by the $ 1$-NN rule (using the subset) approximates the Bayes classifier. This can result in significant speed improvements as $ K$ can now be limited to 1 and redundant data points have been removed from the training set. These data modification techniques can also improve the performance through removing points that cause misclassifications (see also: condensing and editing [123,43]).


5.7.4.3 Recursive partition trees:

Recursive partitioning developed as data mining tool in the 1990s and provides a reasonably fast, although computer-intensive, choice for non-parametric classification. The CART system is available commercially (Salford Systems [56]), although similar packages are available for S-Plus and R [189]. Sutton argues that the technique has several further advantages: (i) little expertise is required to interpret the results; (ii) important variables are identified as a result; and (iii) the method can also be used to impute missing values [184].

The tree is built by the following process: first the single variable is found which `best' splits the data into two groups. The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made. The resultant model is, with certainty, too complex, and care is needed to decide when to stop. The second stage of the procedure consists of using cross-validation to trim back the full tree.

The details of the forward tree creation and pruning algorithms are somewhat complex, but are fully described in Therneau et al [190]. We note, however, that in the tree-growing process, the split selected at each stage is the one that leads to the greatest reduction in the sum of squared differences between the response values for the learning sample cases, corresponding to a particular node and their sample mean (or possibly the median). The decision may be based upon one or linear combinations of variables and may also include categorical data. An impurity function is usually employed which attempts to minimise this error.

Figure 30 displays a tree generated by the rpart algorithm for the sunspot data described earlier in 4.4.4.

Figure 30: Recursive tree classification for the sunspot data (using [189])
Image 32_home_richardh_reports_ds6_alg_figures_rtree

Only the number of spots, area of spots, number of flares and magnetic classes (4 categories present; see [53]) are included as variates. Interestingly, the first split lies mainly upon the magnetic class, such that the alpha class is placed left in the hierarchy, whilst the remaining classes are distributed to the right. The other divisions are governed by the criteria shown in the figure, with the number of observations in each class given by the ``n='' term, except the node at ``mag_class=b'', which divides the beta and gamma magnetic groups.

It should be noted, however, that Sutton observes several caveats in using recursive partitioning [184, p.317]:


5.7.4.4 Boosting and bagging:

Like recursive tree partitioning (upon which these algorithms depend; see above), these techniques are aimed primarily at reducing the variance associated with prediction from regression. Bagging (a shortened form of `Bootstrap aggregating') is a special case of model averaging approach [79]. The algorithm can be given as [30]:

Later improvements were suggested in [80] to reduce bias. Sutton has observed that bagging works best when the base regression or classification procedure is unstable. That is, when small changes in the learning sample can produce appreciable differences in the predictions obtained.

Boosting is a highly respected method of combining classifiers, which are iteratively created from weighted versions of the learning sample, with the weights adaptively adjusted at each step to give increased weight to cases which were misclassified on the previous step [123]. The final predictions are obtained by weighting the results of the iteratively produced predictors. The technique was originally developed for weak learners ie. a classifier which gives only slightly better results than random guessing. At each stage, the output of the weak learner is then added to the learned function, with some weighting (proportional to how accurate the weak learner is). Then, the data are reweighted: examples that the current learned function gets wrong are ``boosted'' in importance, so that future weak learners will attempt to fix the errors [184].

There are several different boosting algorithms, depending on the exact mathematical form of the strength and weight. One of the most common boosting algorithms is AdaBoost (available for R from [96]). Most boosting algorithms fit into the AnyBoost framework, which shows that boosting performs gradient descent in function space [31].


5.8 Neural Nets

5.8.1 Approach

This section gives a brief overview of neural networks (NNets), which provide many analogues for the dimension reduction and clustering techniques described above. The main difference is that training data will normally have to be provided, so that weights for the NNet can be initialised. The AstroGrid/VOTech project is fortunate to be able to exploit the work of the Astroneural team for astronomical purposes, and this work is briefly reviewed in Section 5.8.4. Section 5.8.2 gives an overview of the history and development of NNet. A simple algorithm is given in 5.8.3. More extensive coverage is available in Bishop [76] and Hastie [123].


5.8.2 Overview of neural networks

Most neural approaches are based on combinations of elementary processors (neurons), each of which takes a number of inputs and generates a single output. Associated with each input is a weight, and the output (in most cases) is then a function of the weighted sum of inputs; this input function may be discrete or continuous, depending on the variety of network. If the inputs are denoted by $ v_{1},v_{2},\ldots$ and the weights by $ w_{1},w,\ldots$, then the total input to the neuron is:

$\displaystyle x=\sum_{i=1}^{n}v_{i}w_{i}-\theta$ (39)

where $ \theta$ is a threshold associated with the neuron. Also associated with the neuron is a transfer function $ f\left(x\right)$ which provides the output; common examples are:

$\displaystyle f\left(x\right)=\left\{ \begin{array}{cc}
0 & if  x\leq0\\
1 & otherwise\end{array}\right.$

and:

$\displaystyle f\left(x\right)=\frac{1}{1+e^{-x}}$

This model saw a lot of use during the earliest phase of development culminating in the concept of the perceptron [76,179]. As these neurons become interconnected, they are intended to mimic actions in the brain. These interconnections may then form a simple system, something which is highly structured or else more difficult non-linearities. Examples may be:


5.8.3 Feed-forward networks

A significant advance in the NNet research came with the development of the back-propagation algorithm which trains strictly layered networks in which it is assumed that at least one layer exists between the input and output (in fact it can be shown that two such `hidden' layers always suffice). Such a NNet is an example of a feed-forward network, in which data are admitted at the inputs and travel towards the outputs, at which the `answer' may be read.

The standard approach to use such networks is to obtain a training set of data ie. a set of labelled vectors.This is used to teach a network with some training algorithm, such that the NNet can perform the association accurately. In `live' mode, the NNet is fed by unknown patterns and the answers are generated according to what has been learned. Back-propagation works by comparing the output of the network to that expected, and computing an error measure based upon the sum of squared differences. This is then minimised using a gradient descent method implemented by changing the weights of the NNet. Denoting a member of the training set by $ v^{i}$, the outputs by $ y^{i}$ and the desired outputs by $ \omega^{i}$, the error is:

$\displaystyle E=\sum_{i}\sum_{j}\left(y_{j}^{i}-\omega_{j}^{i}\right)^{2}$

with updates given by:

$\displaystyle w_{ij}\left(k+1\right)=w_{ij}\left(k\right)-\epsilon\frac{\delta E}{\delta w_{ij}}$

A summary of the procedure is given in Algorithm 3.
\begin{algorithm}
% latex2html id marker 2172
[!htbp]
\begin{list}{}
{
\settowid...
...\emph{epoch}.\end{list}\par
\caption{
Back-propagation learning}
\end{algorithm}
It should be noted that, as presented, this algorithm is rather slow but there is an extensive literature upon speeding-up the algorithm eg. by the introduction of momentum [124]. Numerous other extensions can be be listed, notably by Kohonen [142]for unsupervised learning and Hopfield for optimisation [130].


5.8.4 The Astroneural Project

Early work by the Astroneural project was targeted towards star and galaxy separation in wide field astronomical images. It was found that a feed-forward neural network with two layers was able to generate a robust non-linear PCA reduction of the input images with comparable performance in the classification of stars/galaxies to the SExtractor tool [74][69].

Later, research was focused upon the generation of probabilistic principal surfaces (PPS) in the fields of gene data mining [182] and astronomical visualization [181]. PPS are a nonlinear extension of principal components, such that each node on the PPS is the average of all data points that project near/onto it. From a theoretical standpoint, PPS are a generalisation of the Generative Topographic Mapping. The endpoint is a data reduction from high-dimensional space onto a 3D spherical surface. Having projected the data into the latent sphere, the analysis must localize the most interesting data points (this depends on the specific application). For example, the ones lying farthest away from more dense areas, or the ones lying in the overlapping regions between clusters may be of interest and the system allows the user to select and obtain all the information about the typology of the data.

Typically, however the clusters or outliers detected in this way are hard to interpret and hence it is usual for the PPS analysis results to be used as an input into more conventional clustering algorithms eg. hierarchical clustering. Evaluation of the system using the Great Observatory Origin Deep Survey catalogue (28405 objects with 21 parameters; a `sub-medium' sized table of $ 10^{5}$ entries) showed an initial error rate 60%, which was reduced to 2.15% (best case) after tuning (timing for analysis or tuning was not given). At the current time the developers are seeking to port the code (presently MATLAB-based) into a non-proprietary format.


6 Recommendations


6.1 Summary

This report has attempted to give formal definition to the problem of handling large, multidimensional tables of data which frequently result from searches of astronomical catalogues. It has been helpful to employ the nomenclature of Wegman and Solka in describing such tables as `small', `medium', `large' etc, so that we can speak of different classes of problem and the solutions which may be appropriate. This is the approach adopted in Section 6.2, where it will be seen that, with some reservations, the various visualization techniques described may permit access to the next level of table size.

Although such large tables frequently occur, it is also usually the case that many of the included variables are of secondary interest and can thus be immediately discarded or `hidden'. It is somewhat surprising therefore that many software tools give this process little consideration. It is common for example for menus to be displayed accepting all data by default and giving little informative analysis (eg. HCE) or even proceeding without further ado to draw the display (eg. XmdvTool). For medium or large data tables, this is usually a mistake and will generally cause the application to freeze or crash. The possible format for an improved data selection tool is described in Section 6.4.


6.2 Data size limitations

Table 7 presents the applicability of various sample algorithms and visualization techniques for datasets of small to large sizes.

Table 7: Algorithms and techniques for various dataset sizes

Algorithm/

Technique

Small:
$ 10^{4}$ entries
Medium:
$ 10^{6}$ entries
Large:
$ 10^{8}$ entries
Comments
Visual techniques:
Scatterplot y y y (single) Achievable, but over-plotting is usually an issue
P Coords y y n As above. The eye is usually drawn to outliers, rather than main data blocks
Splatting y y y Scaling may be a problem ie. smaller features may be swamped
Glyphs y n n May not be readable, even for small datasets
         
Column reduction algorithms:
PCA y y probably For larger datasets, possibly non-interactively (eg. using Lapack [68], not tested)
FA y y n Possible, if dimensionality is not too high. (Rotation of factors can be expensive: less demanding methods may lack sound basis)
MDS y n n Not suitable for tables with many observations
Projection pursuit y n n  
         
Row reduction algorithms
Conventional clustering y n n eg. distance-based such as K-means
Density estimation y y? n Mileage will vary upon choice of parametric/ non-parametric types
kd-tree y y y Initial investment in the tree construction may be high, since $ O\left(n\log n\right)$. Tested by [118]
mixed-model clustering y n n As tested with Fraley and Raftery's mclust in R [110]

(P Coords = parallel coordinates)


A useful feature of many of the visualization techniques presented is that they remained reasonably interactive (with some patience) at least with medium-sized datasets, although problems occurred if the number of variables exceeded 20, in that the display may be drawable, but is unlikely to be readable. However, if techniques such as PCA or FA can be meaningfully applied, these will often reduce the number of variables to manageable number. This technique can also be recommended as a prelude to further analysis using eg. parallel coordinate plotting to examine for outliers or errors.

An important exception to this general rule is with the R package itself (although it may be argued that R is more of a statistical than a visualization tool). The authors write:

There are limitations on the types of data that R handles well. Since all data being manipulated by R are resident in memory, and several copies of the data can be created during execution of a function, R is not well suited to extremely large data sets. Data objects that are more than a (few) hundred megabytes in size can cause R to run out of memory [166].
Empirical evidence here suggests that R can import medium-sized tables reasonably well, but that $ O\left(n\log n\right)$ or $ O\left(n^{2}\right)$ type algorithms will usually fail. It is reasonably easy to take random or bootstrap samples, however, to perform clustering, with $ 10000$ samples often giving a reasonably fast compromise. Large tables (from experience, $ >100MB$ ASCII) will usually not load.

Even if it were possible to load such large tables, the data are usually too numerous to assess from ordinary plots and histograms or density plots will be needed. (Density plots are essentially 2D histograms accumulated with a bin-width equal to the pixel resolution of the display.) However, various authors have cautioned against the use of histograms in such situations since noise within the data and the origin and orientation of bins can have a profound effects upon the resulting display (see 5.3.2). Density estimation, which applies a smoothing kernel to data samples, is therefore recommended.

6.3 Enabling tools for the Virtual Observatory

6.3.1 R

Appendix B gives a summary of R routines used in ththiss report and a list of others which were provisionally explored. This list is not exhaustive and it is likely that many other statistical functions would be useful (eg. ANOVA). Besides the size limitations described above, the scripting interface represents presents several disadvantages:

  1. A new programming language has to be mastered and commands entered at the command line (albeit scripts may be ``sourced'').
  2. Having overcome this hurdle, this allows the user considerable flexibility with respect to handling variables, repeating analyses and plot presentation, although there is little visual interaction.
  3. Extensions to Rcmdr and GGobi allow an interactive display of the data, but this is currently `one-way' only eg. events in the GGobi interface cannot presently be translated into R code.
  4. The ability to create scripts, combined with access to operating system commands (via ``system'') create a considerable security issue.
Whilst R encourages interactivity with respect to analyses, therefore, there is little support for assessing findings visually. Nevertheless, there are several possibilities for creating a more interactive display, either by using the C interface directly, or via the RServe sockets interface [198]. The latter permits R code to be run from connections from C++ or Java clients and can include a basic security challenge. More securely still, RServe may be run as a local server and alongside the direct C interface, this is the preferred mode for a newly proposed tool (Eirik), described in Section 6.4.

6.3.2 XmdvTool and other tools

XmdvTool has been released in two versions: a `lite' Java version (which has already been VO-enabled - see [186]) and a larger C++ based version which is able to cope with larger data sets (by employing threading and pre-fetching algorithms [169]) and provides a number of extra utilities (such as column reordering and hierarchical clustering; see 2.4.3). The lite version, however, relies upon a native openGL interface (GL4java) which is no longer available and is only capable of handling very small data sets. As a provisional test, the code has been transferred to make use of the later jogl interface [50], and this shows a greater level of interactivity at least into the small-medium table range for the parallel coordinate display.

It is likely that, with further optimisation, the jogl version could be a credible tool for datasets of medium size, but given the higher-level functionality of the full version, it may be more desirable to place the effort into enabling the full version initially. It is intended that a library of xmlrpc calls can be constructed for Eirik to communicate with the AstroGrid system (eg. Workbench), and hence it is supposed that the conversion of either xmdv versions will be of similar complexity. A faster and simpler alternative would be to adopt the R `one-way' approach: data could be pushed out to XmdvTool in its own ``okc'' ASCII format. This would no doubt be less effective, but a similar approach could be adopted for many other tools, including those only available in binary formats such as CViz and HCE (see 2.4.6).


6.4 A data selection tool: Eirik

Given the various problems highlighted throughout this report with respect to the handling of large datasets, it is suggested that a data selection tool in the manner of Dos Santos (see 2.6), may be a useful technique to develop further. However, a tree-based metaphor for data selection (perhaps like that of the hypviewer in Section 2.5) is thought to be more appropriate. It is hoped that this interface may be combined with `sparkline' summaries of the data (see 2.2.2) to maintain an overview of the data.

Another significant concern is that of simple combinatorial complexity in investigating the relationships between variables. Choosing each possible combination of 2 axes from 10 variables, for example, requires searching through 45 possible plots - in 3D, this figure rises to 120 or 210 plots depending on whether 3 or 4 `axes' are selected. This may represent a colossal investment of the users time, especially if the data are densely packed or scattered throughout the volume, since many zoom-type operations will be needed and the possibility of losing the reference frame and getting lost in the data is quite high. It is proposed that, by linking to R (and other statistical libraries) it should be possible to present density plots of each variable of interest alongside `levelplots' of covariance and mutual information statistics (see 5.3.3) so that groupings and relationships in the data are more visibly exposed. Further consideration of the format of this tool will be given in a future document.

Appendices





A. Datasets


A..1 SuperCOSMOS

The dataset is from SuperCOSMOS Science Archive (see [62] "Data sets" for a further reduced version). The objects have been calibrated both as if they were stars and galaxies with the intention that these two classes should be distinct. A small number of objects may have been mis-classified.


A..2 Iris data set

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica [187]. Although setosa a reasonably distinct, the latter are not linearly separable.


A..3 Quasar data

The catalog contains 46,420 objects from DR3 that have apparent i-band PSF magnitude fainter than 15, absolute i-band magnitudes brighter than -22, contain at least one emission line or are unambiguously broad absorption line quasars, and have highly reliable redshifts. The catalog was created by inspecting all spectra that were either targeted as quasar candidates, or classified as a quasar by the spectroscopic pipelines.


A..4 Exaplanet data

The Extrasolar Planets Encyclopaedia, CNRS - Paris Observatory (last updated 22-11-2005)


A..5 SDSS data

A very large dataset from Sloan Digital Sky Survey, submitted to XmdvTool [62].


A..6 Sunspot and flare activity

Data obtained though an AstroGrid Workflow, looking at the activity of the sun, over the last half-solar cycle (approx 11 years).


B. Example R routines employed in this report


B..1 Principal Components Analysis

princomp is a generic function with "formula" and "default" methods. The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. A preferred method of calculation is to use svd on x, as is done in prcomp. Note that the default calculation uses divisor N for the covariance matrix.


B..2 Factor Analysis

The factor analysis model is: x = Lambda f + e for a p-element row-vector x, a p x k matrix of loadings, a k-element vector of scores and a p-element vector of errors. None of the components other than x is observed, but the major restriction is that the scores be uncorrelated and of unit variance, and that the errors be independent with variances Phi, the uniquenesses. Thus factor analysis is in essence a model for the covariance matrix of x: Sigma = Lambda'Lambda + Psi


B..3 kde2d

Two-dimensional kernel density estimation with an axis-aligned bivariate normal kernel, evaluated on a square grid.


B..4 klaR

Produces an object of class EDAM which is a two dimensional representation of data in a rectangular, equally spaced grid as known from Self-Organizing Maps.


B..5 MClust


B..6 Rggobi

Interface to GGobi (presently one way). GGobi is visualization software for viewing high-dimensional data. A variety of multivariate graphics such as tours and parallel coordinate plots are available. Plots are interactive and multiple plots are linked with user brushing and identification. The methods have grown from exploratory data analysis, an area pioneered by John Tukey, building on the current fast graphically capable desktop computers to enable highly interactive, real-time, dynamic exploration of data.


B..7 RTree

Recursive Partitioning and Regression Trees. This differs from the tree function mainly in its handling of surrogate variables. In most details it follows Breiman et. al. quite closely.


B..8 Other packages

Other useful contributed packages (list not exhaustive!): ade4, cclust, circular, cluster, dr, e1071, ellipse, fastICA, flexcClust, fpc, gclus, GPArotation, gplots, grnnR, kernlab, KernSmooth, lattice, MASS, mcmc, mda, moments, mvpart, pcurve, pvclust, Rcmdr, rgl, Rmpi, RWeka, simpleboot, sm, snow, vegan

Bibliography

1
http://www.nag.co.uk/welcome_iec.asp.
10-10-2005.

2
http://www.amiravis.com.
14-10-2005.

3
http://www.star.bris.ac.uk/~ mbt/topcat/.
22-09-2005.

4
http://aladin.u-strasbg.fr/aladin.gml.
03-10-2005.

5
http://www.euro-vo.org/pub/workshop/presentation/EUROVO PresentationGAVO.ppt.
03-10-2005.

6
http://www.ivoa.net/Documents/.
03-10-2005.

7
http://davis.wpi.edu/~ xmdv/datasets/supercoss.html.
26-10-2005.

8
http://web.cs.wpi.edu/~ matt/courses/cs563/talks/glyphs/chernoff.html.
04-10-2005.

9
http://www.quantlet.com/mdstat/scripts/mva/htmlbook/mvahtmlnode9.html.
26-10-2005.

10
http://www.cs.ubc.ca/~ tmm/papers/tj/.
26-10-2005.

11
http://www.inxight.com/.
10-10-2005.

12
http://www.kartoo.com/.
10-10-2005.

13
http://www.touchgraph.com/.
10-10-2005.

14
http://www.opendx.org/.
10-10-2005.

15
http://matplotlib.sourceforge.net/.
17-10-2005.

16
http://www.kdnuggets.com/software/visualization.html.
03-10-2005.

17
http://otal.umd.edu/Olive/Product.html.
03-10-2005.

18
http://www.sai.msu.su/sal/D/1/index.shtml.
03-10-2005.

19
http://www.visbox.com/wallMain.html.
11-10-2005.

20
http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-13.html.
23-09-2005.

21
http://www.lc.leidenuniv.nl/awcourse/oracle/em.920/a86647/gls.htm.
22-09-2005.

22
http://www.datamodel.org/DataModelCardinality.html.
23-09-2005.

23
http://en.wikipedia.org/wiki/Dimensional_database.
17-10-2005.

24
http://linkage.rockefeller.edu/wli/glossary/stat.html.
22-09-2005.

25
http://www.geocities.com/mohamedqasem/vectorquantization/vq.html.
22-09-2005.

26
http://www.statsoft.com/textbook/stfacan.html.

27
http://en.wikipedia.org/wiki/Data_clustering.
21-12-2005.

28
http://en.wikipedia.org/wiki/Vector_quantization.
8-12-2005.

29
http://en.wikipedia.org/wiki/Cross-validation.
9-12-2005.

30
http://en.wikipedia.org/wiki/Bootstrap_Aggregating.
10-12-2005.

31
http://en.wikipedia.org/wiki/Boosting.
10-12-2005.

32
Astronomical information processing system (aips++) library.
http://aips2.nrao.edu/docs/aips++.html.
21-12-2005.

33
Bootstrapping.
http://en.wikipedia.org/wiki/Bootstrap.
31-01-2006.

34
Canonical variate analysis.

http://palaeo-electronica.org/2003_2/egypt/methods.htm.
11-11-2005.

35
CRAN Contributed packages.
http://cran.r-project.org/.
17-12-2005.

36
Data mining.
http://en.wikipedia.org/wiki/Data_mining.

37
DS6 planning stage 01.

http://wiki.eurovotech.org/bin/view/VOTech/DS6PlanningStage01.

38
DS6 planning stage 02.

http://wiki.eurovotech.org/bin/view/VOTech/DS6PlanningStage02.

39
DS6 software survey.

http://wiki.eurovotech.org/bin/view/VOTech/DS6SoftwareSurvey.
23-09-2005.

40
Eigenvalue, eigenvector and eigenspace.
http://en.wikipedia.org/wiki/Eigenvalue.
31-01-2006.

41
F. lederman.

http://home.subnet.at/flo/mv/parvis/introduction.html.
03-10-2005.

42
Graphics and web design based on edward tufte's principles.

http://www.washington.edu/computing/training/560/zz-tufte.html.
07-10-2005.

43
Handwritten digit recognition: Nearest neighbour classifier.

http://www.robots.ox.ac.uk/~ dclaus/digits/neighbour.htm.
9-12-2005.

44
Hierarchical clustering explorer.
http://www.cs.umd.edu/hcil/hce/.

45
HiSee.
http://hisee.sourceforge.net/.
10-10-2005.

46
Icasso: software for investigating the reliability of ica estimates by clustering and visualization.

http://www.cis.hut.fi/projects/ica/icasso/.
13-11-2005.

47
IDL.
http://www.rsinc.com/idl/.
03-10-2005.

48
The iris database.

ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases/REAL/iris.
04-11-2005.

49
IVOA UCD1+ Documentation.

http://cdsweb.u-strasbg.fr/w/doc/UCD/.
23-09-2005.

50
JOGL.
https://jogl.dev.java.net/.

51
Markov chain monte carlo.
http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo.
31-05-2006.

52
MIRAGE.
http://cm.bell-labs.com/who/tkh/mirage/.
27-10-2005.

53
Mount wilson sunspot magnetic classification.

http://www.spaceweather.com/glossary/magneticclasses.html.
12-10-2005.

54
Pca - principal component analysis.
http://www.eng.man.ac.uk/mech/merg/Research/datafusion.org.uk/pca.html.
30-01-2006.

55
The plotutils package.

http://www.gnu.org/software/plotutils/.
17-20-2005.

56
Salford systems.
http://www.salford-systems.com.
10-12-2005.

57
Spotfire.
http://www.spotfire.com/products/decisionsite.cfm.
03-10-2005.

58
Standard eigenvalue and singular value problems.

http://www.netlib.org/lapack/lug/node29.html.
31-10-2005.

59
Third edition of the SDSS Quasar catalog.

http://www.sdss.org/dr3/products/value_added/qsocat_dr3.html.
04-11-2005.

60
Visualization toolkit.
http://public.kitware.com/VTK/.
10-10-2005.

61
What's new in Amos 6.0.

http://www.spss.com/amos/whats_new.htm.
28-09-2005.

62
XmdvTool.
http://davis.wpi.edu/~ xmdv/.

63
XmdvTool skyserver data set.

http://davis.wpi.edu/~ xmdv/datasets/skyServer.html.
23-09-2005.

64
Readings in information visualization -- using vision to think.
1999.

65
H.-M. Adorf, G. Lemson, W. Voges, H. Enke, and M. Steinmetz.
Astronomical catalogues -- simultaneous querying and matching.
In F. Ochsenbein, M. Allen, and D. Egret, editors, Astronomical Data Analysis Software and Systems XIII, volume 314, pages 281-284, 2003.

66
C. Ahlberg and B. Shneiderman.
The Craft of Information Visualization, chapter 1 (Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays), pages 7-13.
Morgan Kaufmann Publishers, 2002.

67
A.A. Al-Subaihi.
Variable selection in multivariable regression using sas/iml.
Journal of Statistical Software, 7(12), 2002.

68
E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen.
LAPACK Users' Guide.
Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition, 1999.

69
S. Andreon, G. Gargiulo, G. Longo, R. Tagliaferri, and N. Capuano.
Neural nets and star/galaxy separation in wide field astronomical images.
In Neural Networks (IJCNN '99), International Joint Conference on, volume 6, pages 3810-3815, July 1999.

70
D.F. Andrews.
Plots of high dimensional data.
Biometrics, 28:126-136, 1972.

71
D.A. Bader and J. Jaja.
Practical parallel algorithms for dynamic data redistribution.
10th International Parallel Processing Symposium (IPPS '96), 00:292-295, 1996.

72
U. Becciani, C. Gheller, M. Comparato, and A. Costa.
VisIVO: A VO enabled tool for scientific visualization and data analysis.
http://wiki.eurovotech.org/bin/view/VOTech/VisIVO.
10-10-2005.

73
D. Beerman, T. Munzner, and G. Humphreys.
Scalable, robust visualization of very large trees.
In K.W. Brodlie, D.J. Duke, and K.I. Joy, editors, Eurovis Data Visualisation 2005 (Eurographics - IEEE VGTC Symposium on Visualization), 2005.
10-10-2005.

74
E. Bertin and S. Arnouts.
Sextractor: Software for source extraction.
Astronomical Astrophysics Supplement, 117:393, 1996.

75
J. Bertin.
Graphics and graphic information-processing.
Walter de Gruyter, Berlin/New York, 1981.
Translated by W.J. Berg and P. Scott.

76
C.M. Bishop.
Neural Networks for Pattern Recognition.
Oxford University Press, 2002.

77
C.P. Botha and F.H. Post.
New technique for transfer function specification in direct volume rendering using real-time visual feedback.
In S.K. Mun, editor, Proceedings of the SPIE International Symposium on Medical Imaging - Visualization, Image-Guided Procedures, and Display, volume 4681, 2002.

78
J.W. Bradley and R.W. West.
Interactive java tools for exploring high-dimensional data.
Journal of Statistical Software, 6(1), 2001.

79
L. Breiman.
Bagging predictors.
Machine Learning, 24(2):123-140, 1996.

80
L. Breiman.
Using iterated bagging to debias regressions.
Machine Learning, 45(3):261-277, Dec 2001.

81
A. Buja, D.F. Swayne, M. Littman, N. Dean, and H. Hofmann.
XGvis: Interactive data visualization with multidimensional scaling.
Journal of Computational and Graphical Statistics, 2001.

82
J.B. Carroll.
An analytical solution for approximating simple structure in factor analysis.
Psychometrika, 18:23-38, 1953.

83
C-C. Chang and C-J. Lin.
LIBSVM - A library for support vector machines.
http://www.csie.ntu.edu.tw/~ cjlin/libsvm/.

84
C. Chatfield and A.J. Collins.
Introduction to Multivariate Analysis.
Chapman and Hall, London, New York, 1980 (reprinted 1989).

85
C. Chen.
Visualization viewpoints.
In T-M. Rhyne, editor, IEEE Computer Graphics and Applications, July/August 2005.

86
P.A. Chou, T. Lookabaugh, and R.M. Gray.
Entropy-constrained vector quantization.
IEEE Transactions on Acoustics, Speech, and Signal Processing, 37:31-42, 1989.

87
W.S. Cleveland and M.E. McGill, editors.
Dynamic graphics for statistics.
Wadsworth & Brooks/Cole, 1988.

88
D. Cook.
Calibrate your eyes to recognize high-dimensional shapes from their low-dimensional projections.
Journal of Statistical Software, 2(6), 1997.

89
R.D. Cook.
Regression Graphics.
John Wiley & Sons, 1998.

90
T.F. Cox and M.A.A. Cox.
Multidimensional Scaling.
Number 88 in Monographs on Statistics and Applied Probability. Chapman & Hall/CRC, 2nd edition, 2001.

91
C. Daub.
mi - Software for calculating mutual information using B-spline functions.
http://sonnhammer.cgb.ki.se/carsten/mi.html, Apr 2005.
http://www.biomedcentral.com/1471-2105/5/118/.

92
C.O. Daub, R. Steuer, J. Selbig, and S. Kloska.
Estimating mutual information using B-spline functions ? an improved similarity measure for analysing gene expression data.
BMC Bioinformatics, 5:118, 2004.

93
A.P. Dempster, N.M. Laird, and D.B. Rubin.
Maximum likelihood from incomplete data via the EM Algorithm.
Journal of the Royal Statistical Society, 39(1):1-38., 1977.

94
K. Deng and A.W. Moore.
Multiresolution instance-based learning.
In Proceedings of 12th International Joint Conference on Artificial Intelligence, San Fransisco, 1995. Morgan Kauffman.

95
S. Derriere, N. Gray, J. McDowell, R. Mann, F. Ochsenbein, P. Osuna, A.P. Martinez, G. Rixon, and R. Williams.
UCD in the IVOA context.
Astronomical Data Analysis Software and Systems XIII, 314:315-318, 2004.

96
M. Dettling.
http://stat.ethz.ch/~ dettling.
10-12-2005.

97
I.S. Dhillon, D.S. Modha, and W.S. Spangler.
Class visualization of high-dimensional data with applications.
Technical report, IBM Almaden Research Center, 650 Harry Road, San Jose, 1999.

98
S. dos Santos and K. Brodlie.
Gaining understanding of multivariate and multidimensional data through visualization.
Computers and Graphics, 28(3):311-318, June 2004.

99
D.J. Duke, K.W. Brodlie, and D.A. Duce.
Building an ontology of visualization.
In VIS '04: Proceedings of the conference on Visualization '04, page 598.7, Washington, DC, USA, 2004. IEEE Computer Society.

100
T. Duong.
An introduction to kernel density estimation.

http://web.maths.unsw.edu.au/~ tduong/seminars/intro2kde/, May 2001.
02-12-2005.

101
W.D. Dupont and W.D. Plummer Jr.
Density distribution sunflower plots.
Journal of Statistical Software, 8(3), 2003.

102
B. Efron.
Bootstrap methods: Another look at the jackknife.
The Annals of Statistics, 7:1-26, 1979.

103
B. Everitt.
An R and S-Plus Companion to Multivariate Analysis.
Texts in statistics. Springer-Verlag London Ltd, 2005.

104
B.S. Everitt, S. Landau, and M. Leese.
Cluster analysis.
Arnold / Oxford University Press, London/New York, 2001.

105
B. Falissard.
The psy package for R.
http://cran.r-project.org/doc/packages/psy.pdf, June 2005.

106
M. Finch and P.A. Howarth.
A comparison between two methods of controlling movement within a virtual environment.

http://www.lboro.ac.uk/departments/hu/groups/viserg/9606.htm.
10-10-2005.

107
F.Klawonn and R. Kruse.
Derivation of fuzzy classification rules from multidimensional data.
ftp://fuzzy.cs.uni-magdeburg.de/pub/papers/fuzz_class_rule.ps.gz, 1997?

108
I.K. Fodor.
A survey of dimension reduction techniques.
Technical Report UCRL-ID-148494, U.S. Department of Energy/Lawrence Livermore National Laboratory, May 2002.

109
C. Fraley and A.E. Raftery.
How many clusters? which clustering method? answers via model-based cluster analysis.
The Computer Journal, 41(8):578-588, 1998.

110
C. Fraley and A.E. Raftery.
Model-based clustering, discriminant analysis, and density estimation.
Journal of the American Statistical Association, 97(458):611-630, Jun 2002.

111
M. Friendly.
Gallery of data visualization: The best and worst of statistical graphics.

http://www.math.yorku.ca/SCS/Gallery/noframes.html, Dec 2002.
28-10-2005.

112
C. García-Osirio.

http://cis.paisley.ac.uk/fyfe-ci0/cgo/VIIP2004/GTnD.avi.
20-10-2005.

113
C. García-Osirio.
Data Mining and Visualization.
PhD thesis, University of Paisley, UK, 2005.

114
general.
Data Mining and Data Visualization, volume 24 of Handbook of Statistics.
Elsevier North-Holland, Amsterdam, 2005.

115
Silicon Genetics.
QT clustering.

http://www.silicongenetics.com/Support/GeneSpring/GSnotes/analysis_guides /qt_clustering.pdf.
19-11-2005.

116
N. Govindaraju, N. Raghuvanshi, and D. Manocha.
Fast and approximate stream mining of quantiles and frequencies using graphics processors.
In Proceedings of ACM SIGMOD 2005, 2005.

117
A. Gray and A. Moore.
N-Body Problems in statistical learning.
In T.K. Leen and T.G. Dietterich, editors, Advances in Neural Information Processing Systems. MIT Press, 2001.

118
A.G. Gray and A.W. Moore.
Nonparametric density estimation: Toward computational tractability.
In D. Barbará and C. Kamath, editors, Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, USA, May 2003.

119
F.M. Ham and I. Kostanic.
Partial least-squares: theoretical issues and engineering applications in signal processing.
Mathematical Problems in Engineering, 2(1):63-93, 1996.

120
J. Han, A. An, and N. Cercone.
CViz: An interactive visualization system for rule induction.
In AI '00: Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence, volume 1822 of Lecture Notes In Computer Science, pages 214-226, London, UK, 2000. Springer-Verlag.

121
D.J. Hand and B.G. Batchelor.
Experiments on the edited condensed nearest neighbor rule.
Information Sciences, 14(3):171-180, 1978.

122
D.J. Hand, H. Mannila, and P. Smyth.
Principles of Data Mining (Adaptive Computation and Machine Learning).
Bradford Books, MIT Press, MA, 2001.

123
T. Hastie, R. Tibshirani, and J. Friedman.
The elements of statistical learning : data mining, inference, and prediction.
Springer, New York; London, 2001.

124
S. Haykin.
Neural Networks.
Macmillan College Publishing Company, Inc, New York, 1994.

125
J. Heer, S.K. Card, and J.A. Landay.
Prefuse: a toolkit for interactive information visualization.
Human Factors in Computing Systems (CHI 2005), 2005.

126
A.E. Hendrickson and P.O. White.
Promax: a quick method for rotation to orthogonal oblique structure.
British Journal of Statistical Psychology, 17:65?-70, 1964.

127
T. Hofmann, H. Wendler, and B. Froehlich.
The i-Disc: A tool to visualize and explore topic maps.
In K.W. Brodlie, D.J. Duke, and K.I. Joy, editors, Eurovis Data Visualisation 2005 (Eurographics - IEEE VGTC Symposium on Visualization), 2005.

128
M. Hopf and T. Ertl.
Hierarchical splatting of scattered data.
In Visualization (VIS 2003, IEEE), pages 433-440, Oct. 2003.

129
M. Hopf, M. Luttenberger, and T. Ertl.
Hierarchical splatting of scattered 4d data.
In Computer Graphics and Applications IEEE, volume 24, pages 64-72, July-Aug. 2004.

130
J.J. Hopfield.
Learning algorithms and probability distributions in feed-forward and feed-back networks.
In Proceedings of the National Academy of Sciences of the USA, volume 84, pages 8429-8433, 1987.

131
Y. Hyun.
Walrus - graph visualization tool.

http://www.caida.org/tools/visualization/walrus/.
10-10-2005.

132
A. Hyvarinen.
Algorithms for ICA.

http://www.cis.hut.fi/aapo/papers/NCS99web/node36.html, April 1999.
26-10-2005.

133
A. Inselberg.
N-dimensional graphics, part i - lines and hyperplanes.
Technical Report G320-2711, IBM LASC, IBM LA Scientific Center, 1981.

134
J.E. Jackson.
A User's Guide to Principal Components.
Wiley Series in Probability and Mathematical Statistics. Wiley-Interscience, 1991.
Reprinted 2003.

135
R.I. Jenrich and P.F. Sampson.
Rotation for simple loadings.
Psychometrika, 31:313-323, 1966.

136
N. Johnson.

http://www.comp.leeds.ac.uk/vision/proj/neilj/welcome.html.
8-12-2005.

137
N. Johnson.
Learning Object Behaviour Models.
PhD thesis, The University of Leeds, UK, 1998.

138
N. Johnson and D. Hogg.
Learning the distribution of object trajectories for event recognition.
In Proceedings of the British Machine Vision Conference, volume 2, pages 583-592, Sept 1995.

139
M.I. Jordan.
Advanced topics in learning & decision making: The kernel trick.

http://www.cs.berkeley.edu/~ jordan/courses/281B-spring04/lectures/lec3.pdf.
26-11-2005.

140
A.S. Kaplunovsky.
Factor analysis in environmental studies.
HAIT Journal of Science and Engineering B, 2(1-2):54-94, 2005.

141
N.-A. Khumbah and E.J. Wegman.
Data Compression by Geometric Quantization, pages 35-48.
Elsevier (North-Holland), 2003.

142
T. Kohonen.
The self-organizing map.
In Proceedings of the IEEE, volume 78, pages 1464-1480, Sept 1990.

143
J. Krüger, P. Kipfer, P. Kondratieva, and R. Westermann.
A particle system for interactive visualization of 3d flows.
In Visualization and Computer Graphics, IEEE Transactions on, volume 11, Nov.-Dec. 2005.

See http://wwwcg.in.tum.de/Research/Projects/VisContest05.

144
S. Lessels and R.A. Ruddle.
Changes in navigational behaviour produced by a wide field of view and a high fidelity visual scene.
In Proceedings of the 10th Eurographics Symposium on Virtual Environments (EGVE'04), pages 71-78. Eurographics Association, 2004.

145
R.J.A. Little and D.B. Rubin.
Statistical analysis with missing data.
Wiley, New York; Chichester, 2002.

146
G. Longo.
Astroneural.

http://people.na.infn.it/~ astroneural/.
27-09-2005.

147
B.F.J. Manly.
Multivariate Statistical Methods: A Primer.
Chapman & Hall/CRC, 3rd edition, 2005.

148
R.G. Mann.
AstroGrid: the UK's Virtual Observatory initiative.
Astronomical Data Analysis Software and Systems XI, 281:3-6, 2002.

149
D.J. Marchette, E.J. Wegman, and C.E. Priebe.
Data Mining and Data Visualization, volume 24 of Handbook of Statistics, chapter 12 (Fast Algorithms for Classification Using Class Cover Catch Digraphs), pages 331-358.
Elsevier North-Holland, Amsterdam, 2005.

150
K.V. Mardia, J.T. Kent, and J.M. Bibby.
Multivariate Analysis.
Probability and Mathematical Statistics. Academic Press, London, 1979.

151
J. Marner.
Computational geometry: An interactive demo showing the construction of 2-dimensional kd-trees.

http://www.rolemaker.dk/nonRoleMaker/uni/algogem/kdtree.htm, May 2001.

152
R. Massey, A. Refregier, C.J. Conselice, and D.J. Bacon.
Image simulation with shapelets.
MON.NOT.ROY.ASTRON.SOC., 348:214, 2004.

153
MathWorks.
Non-classical multidimensional scaling.

http://www.mathworks.com/products/demos/statistics/mdscaledemo.html#5.
25-11-2005.

154
B.H. McCormick, T.A. DeFanti, and M.D. Brown.
Visualization in scientific computing.
Computer Graphics, 21(6):15-21, 1988.

155
A. Moore.
Very fast EM-based mixture model clustering using multiresolution kd-trees.
In M. Kearns and D. Cohn, editors, Advances in Neural Information Processing Systems, pages 543-549, 340 Pine Street, 6th Fl., San Francisco, CA 94104, April 1999. Morgan Kaufman.

156
Andrew W. Moore.
The anchors hierarchy: Using the triangle inequality to survive high dimensional data.
In UAI '00: Proceedings of the 16th Conference in Uncertainty in Artificial Intelligence, pages 397-405, Stanford University, Stanford, California, USA, June 2000.

157
A.W. Moore.
Efficient Memory based Learning for Robot Control.
PhD thesis, Carnegie Mellon University, 1991.

158
A.W. Moore.
A tutorial on kd-trees.
Technical Report 209, Computer Laboratory, University of Cambridge, 1991.
Extract from PhD thesis.

159
S. Mukherjee, E.D. Feigelson, G.J. Babu, F. Murtagh, C. Fraley, and A. Raftery.
Three types of gamma-ray bursts.
Astrophysical Journal, (astro-ph/9802085):1-24, 1998.

160
T. Munzner.
Exploring large graphs in 3D hyperbolic space.
In IEEE Computer Graphics and Applications, volume 18, pages 18-23, July/August 1998.

161
T. Munzner, F. Guimbretiere, S. Tasiran, L. Zhang, and Y. Zhou.
Treejuxtaposer: Scalable tree comparison using focus+context with guaranteed visibility.
In ACM Transactions on Graphics: (SIGGRAPH 2003), volume 22, pages 453-462, 2003.

162
F. Murtagh, J-L. Starck, and M.W. Berry.
Overcoming the curse of dimensionality in clustering by means of the wavelet transform.
The Computer Journal, 43(2):107-120, 2000.

163
A.K. Nag, A. Mitra, and S. Mitra.
Multiple outlier detection in multivariate data using self-organizing maps.
Computational Statistics, 20:245-264, 2005.

164
R Development Core Team.
R: A language and environment for statistical computing.
R Foundation for Statistical Computing, Vienna, Austria, 2004.
ISBN 3-900051-00-3.

165
A. Refregier.
Shapelets: I. a method for image analysis.
MON.NOT.ROY.ASTRON.SOC., 338:35, 2003.

166
B. Ripley, D. Bates, and S. DebRoy.
R data import/export.
http://cran.r-project.org/doc/manuals/R-data.html.
18-12-2005.

167
C.C. Rodríguez.
The kernel trick.
http://omega.albany.edu:8008/machine-learning-dir/notes-dir/ker1/ker1.pdf, October 2004.

168
C. Röver, N. Raabe, K. Luebke, and U. Ligges.
klaR: A package including various classification tools.
http://www.ci.tuwien.ac.at/Conferences/useR-2004/abstracts/Roever+Raabe+Luebke+Ligges.pdf.
15-11-2005.

169
E.A. Rundensteiner, M.O. Ward, J. Yang, and P.R. Doshi.
XmdvTool: Visual interactive data exploration and trend discovery of high-dimensional data sets.
In ACM SIGMOD CONFERENCE 2002, Wisconsin, June 2002.

170
R. Saunders.
Self-organising map.

http://www.arch.usyd.edu.au/~ rob/java/applets/neuro/SelfOrganisingMapDemo.html.
15-11-2005.

171
J. Schneider.
The extrasolar planets encyclopaedia.
http://vo.obspm.fr/exoplanetes/encyclo/index.php, 2005.
CNRS - Paris Observatory (last updated 22-11-2005).

172
W. Schroeder, K. Martin, and W. Lorensen.
The Visualization Toolkit An Object-Oriented Approach To 3D Graphics.
Kitware, Inc., 2005.

173
D.W. Scott.
Multivariate Density Estimation: Theory, Practice, Visualization.
Wiley Series in Probability and Mathematical Statistics. Wiley Interscience, 1992.

174
J. Seo and B. Shneiderman.
Knowledge discovery in high dimensional data: Case studies and a user survey for an information visualization tool.
IEEE Transactions on Visualization and Computer Graphics, 2005.

175
J. Seo and B. Shneiderman.
A rank-by-feature framework for interactive exploration of multidimensional data.
Information Visualization, 4(2):99-113, 2005.

176
J. Shi and J. Malik.
Normalized cuts and image segmentationpattern analysis and machine intelligence,.
In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 22, pages 888-905, Aug 2000.

177
B.W. Silverman.
Density Estimation for Statistics and Data Analysis.
Chapman & Hall, London & New York, 1986.

178
L. Smith.
An introduction to neural networks.
http://www.cs.stir.ac.uk/~ lss/NNIntro/InvSlides.html.
20-10-2005.

179
M. Sonka, V. Hlavac, and R. Boyle.
Image processing, analysis, and machine vision.
Pacific Grove, CA ; London, 2nd edition, 1999.

180
I. Spence and S. Lewandowsky.
Robust multidimensional scaling.
Psychometrika, 54:501-513, 1989.

181
A. Staiano, A. Ciaramella, L. De Vinco, G. Longo, G. Raiconi, R. Tagliaferri, R. Amato, C. Del Mondo, G. Mangano, and G. Miele.
Visualization, clustering and classification of multidimensional astronomical data.
In Computer Architecture for Machine Perception, 2005. CAMP 2005, Proceedings, pages 141-146, July 2005.

182
A. Staiano, L. De Vinco, A. Ciaramella, G. Raiconi, R. Tagliaferri, R. Amato, G. Longo, C. Donalek, G. Miele, and D. Di Bernardo.
Probabilistic principal surfaces for yeast gene microarray data mining.
In Data Mining (ICDM 2004). Proceedings of Fourth IEEE International Conference, pages 202-208, Nov 2004.

183
StatSoft, Inc.
Statsoft: electronic textbook.

http://www.statsoft.com/textbook/stathome.html.
21-09-2005.

184
C.D. Sutton.
Data Mining and Data Visualization, volume 24 of Handbook of Statistics, chapter 11 (Classification and Regression Trees, Bagging and Boosting), pages 303-329.
Elsevier North-Holland, Amsterdam, 2005.

185
D. F. Swayne, D. Cook, and A. Buja.
XGobi: Interactive dynamic data visualization in the X window system.
Journal of Computational and Graphical Statistics, 7(1), 1998.

186
J.D. Taylor.
VO-enabled version of xmdv-lite.
http://software.astrogrid.org/votech/ds6/xmdv/.
31-02-2006.

187
R Development Core Team.
R data import/export.
http://cran.r-project.org/doc/manuals/R-data.pdf, June 2005.
23-09-2005.

188
S. Theodoridis and K. Koutroumbas.
Pattern Recognition.
Academic Press, San Diego & London, 1999.

189
T. Therneau and E. Atkinson.
Mayo clinic division of biostatistics.

http://mayoresearch.mayo.edu/mayo/research/biostat/splusfunctions.cfm.
12-10-2005.

190
T.M. Therneau and E.J. Atkinson.
An introduction to recursive partitioning using the rpart routines.
Technical Report 61, Department of Health Science Research, Mayo Clinic, Rochester, Minnesota, 1997.

191
J.J. Thomas and K.A. Cook, editors.
Illuminating the Path: The Research and Development Agenda for Visual Analytics.
IEEE Press, National Visualization and Analytics Center, 2005.

192
E.R. Tufte.
PowerPoint does rocket science.

http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id =0001yB&topic_id=1.
06-10-2005.

193
E.R. Tufte.
Sparklines: theory and practice.

http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id =0001OR&topic_id=1.
06-10-2005.

194
E.R. Tufte.
Visual Explanations: Images and Quantities, Evidence and Narrative, chapter 2: Visual and Statistical Thinking: Displays of Evidence for Making Decisions.
Graphics Press, 1997.

195
E.R. Tufte.
Visual Explanations: Images and Quantities, Evidence and Narrative.
Graphics Press, Cheshire, Connecticut, 1997.

196
E.R. Tufte.
The Visual Display of Quantitative Information.
Graphics Press, Cheshire, Connecticut, 2nd edition, 2001.

197
J.W. Tukey.
Exploratory Data Analysis.
Addison-Wesley, Reading, Mass, 1974.

198
S. Urbanek.
RServe.
http://stats.math.uni-augsburg.de/Rserve/.
31-01-2006.

199
M. Valle.
http://www.cscs.ch/~ mvalle/visualization/tools.html.
03-10-2005.

200
J.J. van Wijk.
The value of visualization.
In C. Silva, E. Groeller, and H. Rushmeier, editors, Proceedings of IEEE Visualization 2005, 2005.

201
J.J. Verbeek.
Mixture Models for Clustering and Dimension Reduction.
PhD thesis, Universiteit van Amsterdam/ASCI graduate school, 2004.

202
J. Wang, J. Lee, and C. Zhang.
Kernel trick embedded gaussian mixture models.
In The 14th International Conference on Algorithmic Learning Theory (ALT2003), volume 2842 of Lecture Notes in Artificial Intelligence, pages 159-174, Hokkaido University, Sapporo, Japan, 2003.

203
M. Ward.
A taxonomy of glyph placement strategies for multidimensional data visualization.

http://davis.wpi.edu/%7Exmdv/docs/jinfovis02_glyphpos.pdf.
03-10-2005.

204
G.R. Warnes.
HYDRA: a java library for markov chain monte carlo.
Journal of Statistical Software, 7(4), 2002.

205
E.J. Wegman.
Hyperdimensional data analysis using parallel coordinates.
Journal of the American Statistical Association, 85(411):664-675, Sept 1990.

206
E.J. Wegman and J.L. Solka.
Data Mining and Data Visualization, volume 24 of Handbook of Statistics, chapter 1 (Statistical Data Mining), pages 1-46.
Elsevier North-Holland, Amsterdam, 2005.

207
S. Weisberg.
Dimension reduction regression in R.
Journal of Statistical Software, 7(1), 2002.

208
J. Welling and M. Derthick.
Visualization of large multi-dimensional datasets.
In R.J. Brunner, S.G. Djorgovski, and A. Szalay., editors, Proceedings of Virtual Observatories of the Future, number arXiv:astro-ph/0008186, 2000.

http://www.cs.cmu.edu/~ sage/Papers/wellingj.pdf.

209
I.H. Witton and E. Frank.
Data mining : practical machine learning tools and techniques with Java implementations.
Morgan Kaufmann, 1999.

210
C-H. Wu and S-J. Horng.
Fast and scalable selection algorithms with applications to median filtering.
IEEE Transactions on Parallel and Distributed Systems, 14(10):983-992, October 2003.

211
J. Yang, W. Peng, M.O. Ward, and E.A. Rundensteiner.
Interactive hierarchical dimension ordering, spacing and filtering for exploration of high dimensional datasets.
IEEE Symposium on Information Visualization (InfoVis 2003), pages 105-112, October 2003.

212
J. Yang, M.O. Ward, and E.A. Rundensteiner.
Interactive hierarchical displays: A general framework for visualization and exploration of large multivariate data sets.
Computers and Graphics Journal, 27:265-283, 2002.

213
J. Yang, M.O. Ward, and E.A. Rundensteiner.
Interring: An interactive tool for visually navigating and manipulating hierarchical structures.
IEEE Symposium on Information Visualization 2002 (InfoVis 2002), pages 77-84, 2002.

214
F. Young.
http://forrest.psych.unc.edu/.
10-10-2005.

About this document ...

Dimension Reduction Algorithms
for
Data Mining and Visualization

This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -no_subdir -split 0 -show_section_numbers /tmp/lyx_tmpdir27699TVm770/lyx_tmpbuf1/alg1.tex

The translation was initiated by richard holbrey on 2006-02-01


Footnotes

... limited1
And even controversial - see the `liefactor' gallery at [111].
...statsoft.com2
Some data miners may find this distinction to be somewhat crude. Hand et al, for example, give the definition: ``Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner [122].''
...Wijk20053
Carl Sagan once said: Humans are good at discerning subtle patterns that are really there, but equally so at imagining them when they are altogether absent (quoted in [104]).
...Cleveland19884
A scale which is too small loses detail; too large a scale, and the eye loses sensitivity to the rate of change.
...R-stats5
Lowess is a version of a locally weighted scatterplot smoothing technique. Each smoothed value is determined by a linear polynomial taking into account the values of data within a particular span of values of the criterion variable, but giving most weight to the central value of the span, less and less weight to more distant values, and zero weight to values outside the span .
... cladistics6
A branch of biology that determines the evolutionary relationships between living things based on derived similarity.
... formula7
Kruskal defined this as STRESS 1.
... data8
Over 50% of the data will, on average, fall at points where the density is less than a hundredth of the maximum value.

next_inactive up previous
richard holbrey
2006-02-01