Richard Holbrey (VOTech/DS6)
The need for visual insight into data has been understood for nearly a century, but many years the field only progressed within the realm of statistics (see eg. Tukey [197]). Attempts to popularise the subject more generally were somewhat limited1 and developments in data mining and visualization owed little to this work: data mining researchers were spurred on by the advance of database technology and the creation of huge commercial datasets in early 1990s [209] (see 1.2); the impetus for a more dedicated visualization effort came in 1987, with the publication of the NSF Report [154], which recognised the need to render the huge datasets resulting from engineering advances in fluid dynamics, brain scanning etc.
Since that time, researchers have expended a good deal of effort developing a series of powerful, interactive applications. Indeed, some analysts are now able to consider the fidelity of their visualizations by direct comparison with the original data [77]. Recently, however, the literature reveals a slightly more introspective mood [200]. The growth of `visual analytics', around questions of homeland security, has emphasised the danger of information overload and the need to harness new technology, such as high resolution and stereo monitors, in order to maintain an overview of the data [191]. Other researchers have stressed the need to grasp the semantics of visualization, and for the creation of ontologies, to ease communication between scientists using, for example, the e-Science grid [99].
Despite these advances, the problem of finding suitable methods for the exploration and analysis of large astronomical datasets remains formidable [148,208]. This document was instigated at the request of members of the VOTech DS6 team in August 2005 with the aim of providing a synthesis of relevant data mining and visualization algorithms, towards Stage 2 of the project [38]. The emphasis is on methods of bridging the gap between accurate representations of the data and the capabilities of current technology and users (see 1.2, below). Methods of dimension reduction of data are therefore prominent. Some mathematical background is given in this area (especially with respect to eigenanalysis) although readers are generally referred to other texts: supporting references have been drawn from Jackson [134], Manly [147], Rao et al [206] and Hastie et al [123].
Data mining (DM) involves the retrospective analysis of large bodies of data, with the intention of recovering new facets which were hitherto unknown. Statisticians, however, remain divided over the need for the formulation of a priori hypotheses. On one hand, researchers have welcomed the chance to try new methods upon existing data sets without such constrictions. On the other, this has meant the abandonment of the probabilistic roots of the profession. In particular, it is true that data which has not been generated by deliberate random sampling cannot be generalised to the population at large. Nevertheless, the sheer size of some data sets indicates a measure of validity.
In this light, Wegman and Solka have defined a number of feasibility
limits for data mining based on current processing power, the cost
of storage and streaming capabilities [206]. For
example, an algorithm was considered feasible if, given the size of
the data set and the complexity of the algorithm applied, it could
be completed in less than a week. At the time of writing (2004), this
was thought to be achievable for most
type problems
without need for special computing resources. For
algorithms, eg. clustering, this limit restricted the size of data
table which could be processed to around
entries ie. `medium'
sized data tables, see Table 1.
|
|
A defining characteristic of DM is that data sets are often so large that -- without special hardware -- it may be impossible to load the complete dataset into memory at once (see [36]). It follows that the objective of DM searches is usually more exploratory than experimental ie. performed to generate hypotheses, rather than to test them. Also, it suggests that methods are needed initially to process the data as a stream, perhaps with some loss of precision, and that, later, a subset of the data will need to be selected.
A further distinction is sometimes made between exploratory data analysis (EDA) and DM. EDA makes use of various techniques designed specifically to identify patterns in multivariate data (or univariate sequences), such as factor analysis, clustering and so on (see Section 3). But the focus of DM is on producing a solution that can generate useful predictions. DM researchers, therefore, have a much greater acceptance of `black box' approaches, such as neural networks. Whilst these are certainly capable of generating valid predictions, it is not possible to identify the nature of the interrelations between the variables on which the predictions are based [183]2.
Scott also points out that exploratory analyses should ideally be non-parametric, despite the fact that assumptions about normality are common. Many parametric techniques are reasonably robust to deviations from normality, but for exploratory work the results can occasionally be very unpredictable [173]. This point has recently been echoed by Gray and Moore [118], who agree that non-parametric methods ``are generally the method of choice in exploratory data analyses...[but] often come at the heaviest computational cost.'' This issue is discussed further in Section 5.3.
The role of graphics in data analysis has long been recognized as important, if not universally. In the 1930s, Fisher wrote: ``Diagrams prove nothing, but bring outstanding features readily to the eye; they are...no substitute for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them'' [173]. Despite this realization, however, it was not until the 1960s that Tukey finally made them respectable [196, p.53]; and not until 1987 that visualization was recognized as separate discipline with distinct aims.
Since that time, a good deal of effort has been expended to develop a number of powerful, highly interactive applications: Iris Explorer [1], Amira [2], Spotfire [57], Visage [208] etc). Humans are particularly good at detecting certain patterns and interactivity-based approaches are aimed at engaging our senses in order to aid our cognitive abilities. The hope is, in effect, to let the user `see' something new and from this perspective, interactivity is vital [66,64,175]. Against this view, however, various costs must be set:
The approach adopted here is essentially borne of practical necessity.
Queries of astronomical catalogues generally result in large data
tables which may be hundreds of columns wide (representing the number
of measured variables) and contain
-
rows (the
number of observations), placing these data sets easily within Wegman
and Solka's medium to large categories (
table entries,
see above). In preliminary work up to this point, it was noted that
some software tools -- perhaps the majority? -- work reasonably
well with data sets of the order of MB (`small' to `medium' sized),
but struggle with data tables of larger sizes for two main reasons:
It was thought convenient to treat the task of dimension reduction as two separate issues relating to (a) column reduction, and (b), row reduction. These topics are introduced in Sections 3.2 and 3.3. Readers who are already familiar with this territory are referred to the more succinct report by Fodor [108]. The treatment here is rather more in depth and a number of themes are apparent. The first, and probably the strongest, is that of the analysis of covariance leading to the techniques of principal components and factor analysis (amongst others). Indeed, an understanding of the difficulties and differences between these techniques is useful towards the understanding of the subject (multivariate analysis) as a whole. For this reason, these techniques are described in some detail in Section 4.
Another important theme is that of adopting a cyclic approach to data mining. A misleading term, which occurs in the literature, is knowledge discovery. The implication is that, after some processing step, domain experts should able to interpret newly `mined' data with much greater clarity than was previously possible. Wegman and Solka point out that this only really becomes possible if a considerable amount of effort is expended in cleaning up the data first. Any analysis must therefore be considered as part of a general data flow. Fig. 1 (adapted from [206, fig. 1]) illustrates this flow as a cycle to emphasise that each step in the process should be seen as one of data refinement, rather than as an immediate endpoint and visualization is no exception
. Performing a principal components analysis, for example, may not produce immediately usable results (since the axes are difficult to interpret) but could be used to maximise the linear separation of clusters (for visualization) or to reduce the set of variables under study (selection criteria).
The development of astronomical instruments and procedures for data gathering and storage is beyond the scope of this report. Nevertheless, it is useful to say something about the level of processing which can normally be expected. It is common, for example, for data to be calibrated depending upon the instrument used before the data is stored, but additional steps may be required to achieve the highest possible levels of precision. A number of software tools are able to assist here, such as the Aips++ library [32], and others provide facilities for immediate analysis, notably IDL [47], Aladin [4] and Xmatch [65,5]. Also, there are standardisation issues to do with the adoption of various naming schemes, units and so on, although it is hoped that the creation of the IVOA UCD and VOTable standards [6] may make this aspect easier to handle in future.
One of the most obvious features of Fig. 1 is that curation and preparation of the data forms the bulk of the work. In particular, care is needed to remove errors (noise, missing values etc) and to select appropriate variates for analysis. This may require domain expertise, since it may be necessary to:-
Given the number of different but relevant domains upon which this research might touch, the scope of the present report is restricted to:-
Visualization is a complex research area which has developed over the past few decades by building on theory from information design, computer graphics, human-computer interaction and cognitive science. In view of this conglomeration of ideas, the title (`limits of visual thinking') is intended to reflect that the development of computing algorithms must be matched by an understanding of the psychology of the user. In other words, it is not sufficient for researchers to demonstrate a given algorithmic complexity to ensure a successful visualization, since little is achieved if the display hinders rather than helps. However, rigorous user testing of visualization software is both difficult and rare [85]. In order to get some appreciation for the current limits of hardware and software, therefore, this section gives a brief survey of available tools but begins by revisiting some of the guidelines for good graphic design set out by Tufte (see also [42]) and Bertin [75].
Tufte has recently published a harsh critique of NASA engineers' usage of MicroSoft PowerPoint© instead of producing formal written reports prior to the Columbia disaster in 2003 [192]. This is a new variation of an old complaint: in the past, many authors have often exempted themselves from working with the data at hand and allowed new formats to obscure the end conclusions. Inappropriate scaling and ordering of data are common symptoms (see 2.2.4). Ultimately though, this is an issue of integrity and authors should be encouraged to think through, prepare and attach their own names to their own work [196,195, ch. 2].
Bertin observed that tabulating data is an essential part of constructing a well-defined problem, and may require a good deal of imagination. But a homogeneous table is unlikely to be effective in capturing the eye. The next stage of the process, therefore, involves graphic illustration, but this should leave open the question of the organisation of the table. Indeed it is unlikely that the first effort will be optimal. Exposing contrasts in the data involves some skill at the removal of unnecessary data-ink, juxtaposition and scaling [87]4. This may be automated to some extent, since it is partly based on patterns of sectors in the data or graphic, but still requires (re-)organisation in line with semantic reasoning [75].
Similarly, Tufte has argued that presenting similar graphics side-by-side can enhance the visualization by suggesting motion between the images, as if in animation. This motion indicates greater dimensionality and, hence, helps to reinforce the differences between each `frame' [195, ch. 6]. It is also one of the guiding principles behind the scattergram displays (see 2.4.2) and those of Andrew's Curves (see 2.4.5). Tufte has also proposed that `sparklines' can give intense word-sized contrasts, as shown in Fig. 2, which may be still more effective when used in multiples [193].
A common pitfall of new graphic designs (and 2D/3D computer displays) is to emphasise line-art and to overlook supporting orientation or scaling data. Tufte describes the effect as one of proliferating `chartjunk'. Often, the added weight of this detail tends to impose on the graphic and masks some of the meaning. Two quantities describe this problem:
In [196], Tufte advocates the removal of all non-data ink and any redundant ink. The axes of a scatterplot, for example (see 2.4.2), might be trimmed to show only the range of the data or to show median and quartiles. In this way, the data-ink ratio is maximized and the possibility for gratuitous self-decoration and moiré effects is removed. At the same time, thinned or greyed-out lines and varied line thickness can be used to great effect, if, say, a background grid is really needed.
In computer graphics, it might be supposed that the equivalent to the data-ink ratio would be data pixels (pixels representing data elements) to non-background window pixels; the data density might be data pixels to window pixels. Although providing useful concepts, these quantities may not be entirely appropriate, however, for electronic displays (see below).
Whilst data may not always contain cause and effect type relationships, the organisation of data plots is critical if such relationships are to be exposed. An oft-quoted example is that of the problem of `O'-ring failures on the Challenger space shuttle in 1986. NASA engineers went to some lengths to describe the risk prior to the launch, by using plots of rocket-shaped symbols with various kinds of cross-hatching to indicate particular failures. The rocket symbols, however, contributed nothing to the readability of the graphic and since they were ordered by the testing sequence, the explanatory variable (temperature) was completely obscured.
Tufte's reordering the plot, so that the symbols appear at the correct location along the temperature axis, makes the decision to launch looks a good deal riskier [195, p.49]; by removing the symbols and using a weighted metric to describe an overall damage statistic, this relationship is very much clearer [195, p.45]. More generally in scientific testing, it is necessary to employ experimental controls to provide a baseline for comparison. This baseline then permits us to relate variations in cause with variations in effect. Without such controls, we must assume that there may be many possible causes [195, p. 50-52].
Although some graphical techniques are useful for standardisation, it is useful to consider some of the basic requirements for cleaning data before proceeding. In the first place, the question of missing data needs to be tackled. Many software packages simply ignore data units with missing entries and this seems acceptable where there is sufficient data to do so. Often, however, this may not be possible or may even be unwise if there is some underlying cause which demands further explanation [104,147] (also note the earlier comments about keeping data in Section 1.3.2).
Scatterplots (see below) are undeniably useful for initial assessment, and it would be unusual today if this were not one of the analyst's first steps. Apart from gaining insight into the distribution, the nature and influence of outliers needs to be gauged. One method is to plot regression lines over the scatterplot, which may be locally weighted if the distribution seems uneven (see Fig. 4 and Appendix A.1 for a description of the data). Finding and possibly removing the `boundary' or convex hull of the data or is one way of assessing the influence outliers, see Fig. 3.
![]()
![]()
|
In these plots, outliers do not appear to be particularly distant
and keeping or removing the boundary points does not affect the adjacent
chi-plots. The basic principle of these is to assess the independence
of variables using the expected
-distribution, where dependency
is indicated by deviation from the central band running horizontally
across the centre of each plot. In this case, the lower plots show
a greater level of structure (ie. less independence), since relatively
few points lie in this zone [103]. Note that the presence
of two supposed groups may be more rigorously tested by the use of
canonical variate analysis (see 5.6.1).
The detection of outliers is discussed with reference to self-organised
maps in Section 4.6.
A common requirement before proceeding with many forms of analysis
is to scale the data, so that the range or units of the data do not
figure. Conversion to Z-score (autoscaling) is probably the
most common technique, so that the distribution has unit variance,
although division by the sample range has been shown to be more robust
in many clustering applications [104]. Principal components
analysis and some clustering techniques provide ready made methods
of normalization which are discussed later. For the multivariate case,
autoscaling can be achieved by employing the covariance matrix,
, where the transformation is:
Given the high dimensionality of most astronomical datasets, reduction to 2/3D for visualization purposes seems unavoidable. Also, the preference for many researchers is naturally for 2D graphics, so that (amongst other things) publication drawings can be considered. Various techniques for plotting in 2D are considered in this section.
Where the dimensionality gets very high, however, several problems
arise with reduction to 2D [173,205]. If we consider
the hypercube
, the
diagonals,
,
from the origin to the corners are of the form
.
The angle between the diagonals and the Euclidean axes,
,
is given by
, where:
|
In higher dimensions, the volume of hyperspace also appears distorted,
so that the distribution of data is also counter-intuitive. Since
the area of a circle can be written as
and the volume
of sphere as
, we can generalise the volume of a hypersphere
as
. It is easy to show that the volume of a thin hypershell
relative to that of the hypersphere approaches 1 as
increases.
In other words, most of the volume occurs within the shell. The result
is that if the data were uniformly distributed within the hypersphere,
most data would, in fact, lie within the shell. However, by projecting
the data onto a 2D plane, most of the data will appear to lie near
the origin of a circle, thus giving an unrealistic impression [205].
A typical scattergram (ie. a collection of scatterplots) is shown in Fig. 4, showing some of the SuperCOSMOS data set as depicted in R, with colour used as an additional `dimension' to reflect different classes in the data (here red implies `star', green a `galaxy'). Due to the linear nature of the plotting algorithm, the number of points which can be plotted is almost unbounded (but see below). There is a good deal of redundancy in the display (a poor data-ink ratio) since each smaller graphic is plotted twice, although statisticians seem to prefer this, so there is no particular bias to `x' or `y' axes. Some applications can make use of the plot windows on the diagonal eg. to plot histograms, as is the case here. The blue curvilinear feature in the scatterplots is a locally weighted regression (LOWESS) curve, after Cleveland [164]5.
Nevertheless, a practical limit of 10-15 variates exists for most displays. The use of `coplots' in R or dimensional stacking in XmdvTool [211] can make some relationships more explicit for, say, up to 5 variables.
Scott has observed that where the number of observations is large,
the areas of greatest density lose visual impact and the eye is drawn
to outlying data in the tails of the distribution [173].
One response has been the development of sunflower plots, which make
use of hexagonal bins to form a 2D histogram [101],
Fig. 5. Individual observations are
plotted when there is only
observation per bin as in a conventional
scatter plot. Each bin with from
observations contains a light
sunflower. Other bins contain a dark sunflower. In a light sunflower
each petal represents one observation. In a dark sunflower, each petal
represents
observations. (A dark sunflower with
petals represents
between
and
observations.)
Described by Inselberg over two decades ago [133], many tools are now capable of rendering parallel coordinates. The basic premise is very simple: instead of plotting data by using orthogonal axes, the axes are maintained parallel and the data plotted as lines across the axes. In this space, therefore, a point is mapped to a continuous line (and vice versa). This has the effect that the amount of data-ink increases relative to the background, but may obscure the display. Also, since it is common for the axes to be automatically scaled to the maximum and minimum for each variable, it is usually difficult to assess directly the correlation between any pair (cf. [205]).
Examples using the XmdvTool display are shown in Fig. 6, using a subset of 20 variables from the SuperCOSMOS dataset (1000 rows) [7].
(a)
![]()
(b)
![]()
|
A further problem is apparent when the number of variables increases (beyond, say, 15-20) in that it is difficult to separate the axes or associate axes with the respective variables. In fact, as the distance between axes decreases, the display becomes saturated with `data-ink' and any visible pattern is lost. An interesting response to the saturation problem was proposed by Lederman and implemented in Parvis [41]. Here, the author has created a fading brush technique and is also able to highlight single strands of data to assist the user. As a more innovative use of the display, histograms can also be plotted against the axes to indicate the basic data distribution.
A classic example of the use of glyphs was that of Chernoff faces, Fig. 7.
The assumption behind this approach is that the human eye is particularly sensitive to facial expressions, so that by mapping variables to particular features (eyebrow height, mouth curvature etc), we may be able to distinguish characteristic elements of the data more easily. Many other types of glyph, including stars, weathervanes and profiles, have been proposed. Ward has observed that the success of these is highly dependent upon colour and glyph placement [203], since they may be discrete or overplotted as required.Tufte has observed that despite the quixotic character of Chernoff faces, `ink' may saved by halving the faces, since they are usually symmetrical [195]. One problem which frequently occurs, however, is that of representing the key in glyph plots, eg. see Fig. 8. The use of text in the key creates a fairly large mental load and has the effect of distancing the user, even for this small data set.
![]()
|
An interesting implementation of the `juxtaposition' principle (2.2.2)
is that of Andrew's curves [70], in which each observation
is represented by a combination of the (scaled) variates,
,
into a sine wave of the form :
Several tools provide a greatly enhanced level of interactivity by combining displays and allowing highlights to be broadcast from one to the other, eg. Mirage [52]. One of the first tools to make use of so-called `brushing' of data between displays was Arc (itself based on XlispStat) which - even to this day - remains one of the few tools capable of generating regression planes to describe data [89]. XGobi [185] and XGVis [81] also adopted the brushing technique, adding various methods to display data in parallel coordinate, 3D rotation and making use of tours (animations) particularly with in conjunction with dimensional scaling (see 4.6 below, CViz [120] and HiSee [45]). Another valuable property of these systems is that data can can be excluded to examine the effect of outliers.
Several commercial systems have now been developed in this idiom, such as Spotfire [57]. One system which deserves further mention, however, is Hierarchical Clustering Explorer (HCE) [44], see Fig. 9.
A unique feature of this tool is that 1D and 2D histograms are used to suggest interesting distributions (and hence variates) to the user (but note 3.2.2). However, the use of a clustering algorithm as data are input forbids the processing of `medium' or larger data tables (see Table 1). This issue also affects XmdvTool when using the InterRing [213] or clustering data selection tools.
In cladistics6, trees are common tools for representing the distance between species graphically. To represent such trees within limited computer displays, however, various forms of non-linear presentation have been adopted using `backbone' spanning trees and focus/context motifs, such as the i-Disc [127], or hyperbolic forms, such as HypViewer, see Fig. 10, which shows the current UCD1+ hierarchy [160].
This system was capable of displaying some
Graph-based techniques have also been adopted within the visual programming environments of IRIS Explorer [1] and OpenDX [14] (see also ViSta [214]). Here, customised applications exist as modules of code which can be connected together graphically, so that data appears to flow around the system. For astronomy, VisIVO, which also employs a modular library approach (using VTK [60,172]), is currently being adapted for virtual observatory use [72]. A natural tendency of these systems has been towards the display of 3D, spatial data, making use of hardware rendering.
These tools have been particularly successful at visualizing scalar data and flow problems (see eg. [143] for more recent developments). For more general multidimensional data, there are several challenges. Figure 11 shows an example from VTK using `splatting' with the quasar data [59]. This technique aggregates the data by overlapping areas or volumes associated with each data point (here the u, g and r variables are described by the three axes; and i by the attributed colour scalar.
Higher dimensional (say 5D or 6D) data sets can also be rendered if glyphs, animation or contours are also be employed, although issues of occlusion and human perception (which varies particularly for colour) begin to take precedence [203]. However, the technique of hierarchical `splatting' (see Fig. 12) may offer faster rendering of very large tables of unstructured data [129].Working in 3D (and higher) therefore requires that a number of issues must be faced:-
It has already been observed that only the least complex (eg.
)
schemes are able to cope with `large' data sets as proposed. For this
reason, the simpler schemes here are described more thoroughly here
than might have been expected. Although this list is not exhaustive,
numerous web-pages exist for comparisons elsewhere [39,16,17,18,199].
In addition, it is convenient here to make a number of further observations:
The `curse of dimensionality' (following Bellman (1961) [20]) refers to the exponential growth of hypervolume as a function of dimensionality. In most cases, this issue is manifested as one of data distribution, which is usually very uneven within the hyperspace. Hence data may be abundant in some regions, but inadequate in others, especially if statistical inferences need to be drawn.
For astronomy, however, it is recognized that some variables are of greater interest than others. In particular, error terms, although important, are generally of lesser interest than their respective observational parameters. Hence there may be a hierarchy of data terms, which may be different for each researcher. The proposed UCD1+ terminology also suggests a data hierarchy [95,49] which might be exploited. It was proposed to examine this issue further by investigating current tree-building tools, and developing, if necessary, some new method of configurconfiguringating the display of data (see Section 6.4). The remainder of this section considers methods of table reduction, by column and then row.
The main column (ie. variate) reduction methods are summarised in Fig. 14.
A brief description of the elements in this figure is given below.
These techniques aim to give focus to particular units of data (often called `individuals') and, in this sense, provide a narrowing of the extent of the data. For example, the focus might be used as an input to select rows from a table. The techniques are summarised in Fig. 15 and below.
The techniques of factor analysis, principal components etc have developed over the course of the 20th Century, fuelled by research in fields as diverse as psychology and ballistics. One of the original aims was to extract the line or plane of maximal variance. Such a line, plane or hyperplane -- known as the principal component -- is actually one of regression about which the orthogonal error is minimised, see Fig. 16.
![]()
|
The advantage of this kind of error minimisation is that it allows any one variable to `predict' another (within certain tolerances) and these are then described as predictor and response variables. Also, if this axis is adopted as the main component of interest (typically, by transforming the data to this axis), then the remaining variance (orthogonal to this plane) appears to be procedural in origin. Moreover, the variables under this transformation are now said to be uncorrelated ie. it is no longer true that one variable may be used to predict the other. Only after this step can the variables be considered to be independent and hence this term is avoided in multivariate analysis.
Factor analysis (FA) has often been confused with principal component's analysis (PCA) as the techniques are similar, although the underlying mathematical model is rather different [84]. Given the importance of these techniques to the present discussion, some background theory to FA and PCA is given below.
If we have
factors (usually columns of data, each representing
a separate variable), we may denote the factors as
.
Generally, we may suppose that these factors can be linearly combined
for each of
observations, to give an overall index,
(
), which can be written:
If
is large, however, it is advantageous to select the factors
(and coefficients) of greatest interest -- usually those factors
which display some interesting distribution or the greatest variance.
If some factors are to be omitted because their contribution is thought
to be negligible, then the error term,
, in each of the expressions
of Equation 4 increases. That is to say, this
term acknowledges that some error has been introduced.
In Equation 4, the coefficients
represent a matrix, denoted
. But the usual starting point for
this matrix is the sample covariance matrix,
, which is defined
by the relation:
It turns out that the factors,
, may then be drawn from the
characteristic forms of the
or
matrices, which
arise from eigenanalysis [103]. In fact, if
is
nonsingular (ie. the determinant of
, ie.
is non-zero), it can be shown that:
We may visualize this transformation as a rotation of the axes along
the axes of maximal variation and, if normalised, the columns of
give the direction cosines between the old and new axes. Determinants
may be used to obtain such roots for small problems, but for large
systems, iterative `power' methods are more appropriate. The Lapack
library provides a well-tested and freely available set of such routines,
using eg. single value decomposition [58,68].
It is useful to consider several further extensions to Equation 7 so that we may elucidate the process of PCA and data transformation (including `sphering'; see Equation 1). Jackson [134], for example, employs the identities:
|
The object of PCA is the linear transformation of a group of correlated
variables, such that some set of optimal conditions is achieved. The
most important condition is that the new variables are uncorrelated.
Mathematically, this is ensured by the requirement that
(using Kronecker's delta function:
if
, or 0 otherwise). Even so, these identities suggest many
possible transforms. From the above,
might be written as
(the inverse square root of covariance), so that the data sphering
problem of Equation 1 can be solved in this
way (although for large problems, the Lapack library offers more viable
inversion routines). Hence, we might apply the
-transform:
The choice of the forms of scaling presented above will be influenced
by the type of application and the nature of inferential statistics
required. In practice, however, it is often more appropriate to work
with the correlation matrix,
, rather than that of covariance
(
), although this may be less powerful statistically. The reason
for this is that PCA is sensitive to additional variance which may
be introduced by:
Amongst the many properties of latent roots, we may cite two of special interest:
Another useful technique -- the `scree' plot -- depicts each eigenvalue as a sequence from largest to smallest. Those components which then lie above the `scree' at the base of the slope are thought to be the most useful candidates for further analysis, see Fig. 17.
An alternative stopping point is the Kaiser criterion [26], according to which we only retain factors with eigenvalues greater than 1. In essence this is like saying that, unless a factor extracts at least as much information as the equivalent of one original variable, we drop it. It is noted that a more sophisticated test, developed by Bartlett and extended by Anderson, also exists to provide a significance test for the equality of smaller roots. Having chosen to adopt a given number of components, it is also possible to assess the error from the `lost' components by inverting the system represented by Equation 4 and using residual analysis [134, p.34,87].
Although used widely for decades, many authors have given warnings
about the wisdom of interpreting PCA results. Jackson has observed
that the physical importance of a component is not necessarily related
to the size of the characteristic root; indeed, the equivalent variation
in a lesser component might have a greater effect on the human eye
or ear and hence scaling might be a more important factor in visualization.
The use of inferential statistics such as Hotelling
-test
may help, although this would probably not qualify as an exploratory
technique [134, p.55].
Everitt, on the other hand, favours the `labelling' of PCs, albeit
with caution, and advocates the re-scaling of variables (eg. by negating)
if this makes the process easier [103]. Whether or not
labels can be attached, however, a useful graphical technique is to
take the first two or three components and perform pairs plots (or
a scatterplot; see 2.4.2) to identify clusters and
outliers. To facilitate this, PCA `scores' can be obtained by taking
the element-wise product of the
chosen components with
data
terms ie.
for
.
Fig. 18 shows the use of PCA scores in two different problems. In the upper plot (a), the iris data are separated largely by the first principal component and although the groups are not distinct, there is relatively little overlap (cf Section 2.4.3). In the lower plot (b), three components are distinguished by classification into two groups, ie. star or galaxy for the SuperCOSMOS data (see A.1). This difference is highlighted by quadratic regression surfaces through these scores (cf Fig. 4).
(a)
![]()
(b)
![]()
|
A surprising feature of PCA is that, although the vector components are uncorrelated, the individual coefficients are not. On reflection, this is less surprising within each component since the terms are usually constrained by their sum (see Table 3) but it turns out that the coefficients of adjacent components are also often strongly correlated. This would seem to indicate that correlation tables and perhaps (graphically) parallel coordinates may be able to help, especially if components with small roots might be important, since, otherwise, these would likely be rejected.
A more widespread technique of re-interpretation is to attempt to
rotate the PCs until they are aligned with some notionally simpler
system, usually so that the coefficients are as close to zero or one
as possible. If the new system has
components assembled as
,
then the requirements may be stated as:
Instead of rotating the system to maximise (or minimise) the influence
of variables of interest, Mardia et al [150] have
suggested that PCA may be used to remove variables directly. First,
PCA is undertaken and a method selected to find the most appropriate
number of components,
, such as by scree-plot. Then one begins
a process of deselection by examining the least significant PC and
rejecting the variable associated with the largest coefficient. The
variable associated with the largest coefficient of the second least
significant PC is then rejected and so on, until only
variables
remain.
It is noted here that PCA has also been developed to provide an analysis of principal curves, particularly useful where it is suspected that the primary relationships between variables are non-linear. This approach is available in S-PLUS and R pcurve library, based on work by Hastie and De'ath (see [35]), and has also been applied to astronomy data via neural networks (see 5.8.4).
One of the main objections to PCA is the absence an underlying statistical model. It is difficult, therefore, to interpret the meaning of the resulting components with any certainty or to give an estimate of the effect of sampling error. Whilst FA overcomes some of these issues, it introduces new problems. To start with, it must be assumed that the underlying factors actually exist, and that, where there is more than one factor, that the factors are independent of each other and any error terms (see Equation 4) [134].
For many authors, another disquieting feature was the willingness of statisticians to employ rotation techniques to find more readily amenable factors. Since these rotations often appeared to be equally plausible, this suggested a considerable degree of subjectivity. The result, particularly for FA, was a considerable amount of adverse publicity. In 1989, Chatfield and Collins observed that few cases existed where the technique had proven to be useful outside the field of psychology [84]. But in recent years, a significant step-forward has been the development and inclusion of maximum-likelihood procedures in software packages, such as SPSS [61] and R, and applications of FA in science and engineering are much more common (see eg. [119,140]).
In PCA, the fact that the sum of the latent roots and the sum of the original variance is the same (4.3.1) is often interpreted as supposing that PCA accounts for all the variation in the data. FA, on the other hand, is said to explain correlations. In the early 20th Century, Spearman observed that the ratio of off-diagonal covariance was often similar between rows, suggesting the presence of underlying `factors' [147].
The general FA model is stated in Equation 4.
Ideally, the variable
in this model is the
th test score
with mean zero and unit variance; the coefficients
are the
factor loadings for the
th test; and
(
)
are uncorrelated common factors, also with mean zero and unit variance.
It is assumed that the variables(
) are correlated with the
factors, but are otherwise uncorrelated with each other. With this
model:
If the provisional loadings for the analysis are generated by PCA, then the resulting test is one of principal factors. The procedure for performing the analysis has 3 steps:
Modern software tools permit more rigorous techniques, such as confirmatory analysis, as well as the more exploratory methods discussed here [147]. In general, however, it should be noted that that FA is likely to be more successful than PCA if there is reason to believe that there is some underlying structure to the data.
Everitt notes that despite the large number of published algorithms available to do rotations, only a few are in general use. These are summarised in Table 4.
|
Despite the indeterminacy of particular rotations, it is clear that the distribution of points remains invariant. FA simply seeks to describe this distribution as `simply' as possible. In some ways, a more intractable problem is that, unlike PCA, it is difficult to calculate scores using these factors since the relationship essentially has to be inverted. However, some attempt can be made, if it is assumed that the distribution is normal [103, p.76].
As a simple example, Table 5 shows
loadings for factor analysis (with promax rotation) and PCA
(unrotated) of sunspot data contributed by S. Dalla of AstroGrid (see
Appendix A.6; in press). The first factor
has tended to combine variates associated with the magnitude and size
of sunspots; the second, the number of spots and longitude; the third
is bound to the variable `mindistar', the great circle angle associated
with he nearest adjacent sunspot; and the fourth relates mainly to
time along the sun's 11 year cycle (denoted `dt'). It is straightforward
to obtain plots of scores on this factor data to identify outliers,
anomalies or other patterns using eg. parallel coordinates.
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
For data which shows a strong degree of correlation, both FA and PCA should give similar results and the converse is also true: poorly correlated data will give components similar to the original variables, whilst FA will generate meaningless results. Table 6 gives a comparison of the two techniques.
|
In general, the application of PCA or FA depends very much on the problem at hand. If the data contain a large number of relatively anonymous or indistinct variables, such as shapelet coefficients [152,165], then PCA is the more obvious choice. This is also true if it is suspected that many variables are contributing very little variance to the results, in which case, a `weeding-out' of such variables (eg. by association with the smallest eigenvalues) is recommended.
On the other hand, if there are fewer, more distinct variables (as with the sunspot data), then PCA is unlikely to be helpful. The result will probably be a mixing of these variables which will be highly counter-intuitive. In this case, FA with rotation (preferably using maximum likelihood procedures) may give a more satisfactory grouping of variables, which will be all the more plausible if this agrees with the hypothesis under test or some notion of the underlying behaviour of the system. By definition, PCA is entirely arbitrary -- no underlying model can be supported, and it is not possible to attach significance to the results.
This form of analysis dates back to the work of Hotelling in the 1930s,
in which he was interested whether reading ability was related to
mathematical ability in schoolchildren. Realising that the observed
variables: reading power (
) and reading speed (
),
and mathematical power (
) and mathematical speed (
),
naturally fell into two groups, he proposed that:
If a table of distances (or proximity matrix),
, can
be defined between objects (or observations), it is usually possible
to employ multidimensional scaling (MDS) to give a lower dimensional
representation. This immediately suggests that such an algorithm is
since
operations are
needed to build the distance table. The observations will rarely lie
in the same plane, however, and hence these distances will need to
be adjusted if they are to be plotted in the lower dimensional space.
This leads to the concept of a stress function in MDS which
will usually need to be minimised.
In some respects, MDS presents the opposite problem to PCA, although
components still provide the solution. Our starting point, however,
is the matrix
, from which were are attempting to determine
a data matrix,
, but usually in some `reduced' form ie. with
lower dimensionality.
In classical MDS, the distance matrix,
, is usually defined by
the Euclidean distance, so that:
where
contains the first
eigenvectors and
,
the first
eigenvalues. Sometimes, however, the distances are
not Euclidean, so that
is no longer positive definite and negative
eigenvalues will occur. This leads to coordinates (sometimes called
principal coordinates) which are complex.
For plotting purposes, however, the best-fitting
-dimensional
representation is given by the
eigenvectors of
corresponding
to the
largest (positive) eigenvalues. As an example, Fig. 19
shows a projection of sunspot data (from longitude/latitude information)
alongside that for flare activity, which can be cycled as an animated
gif file.
The adequacy of a
-dimensional representation can be judged by
the size of the criterion:
A simple regression procedure for performing MDS can be described as follows [147]:
It should be noted that this algorithm relates to metric scaling ie. some
linear or polynomial relationship exists between
and
.
In non-metric scaling, only the ordering of the data distances is
important. This gives more flexibility to obtain clearer low-dimensional
representations.
Cox and Cox have observed that very little work has been done on diagnostics for MDS, other than the use of stress formulae and Shepard plots (scatterplots of the interpoint distances against the original dissimilarities, eg. see [153]). Some attempt, however, has been made to visualize disparities by use of interactive plots using colour `smearing' and density estimates, though the effects of extreme outliers can be rather devastating [90, p.94]. Nevertheless, Spence and Lewandowsky have proposed a number of more robust parameterisations, using transformations with medians [180]. Dissimilarities may also be grouped, especially for larger data sets and this is the principle behind the implementation of CViz [97].
Non-linear techniques include Sammon and self-organised (or Kohonen) maps (see the demo at [170]; [163] for discussion of outliers, and 5.8 below). Kohonen maps were first introduced in the 1980s and have found many practical applications, but been criticised for their ad hoc approach to convergence and handling of missing or categorical data. Verbeek, however, has proposed various extensions to improve these criteria [201].
Fig. 20, gives some impression of Kohonen maps using the R package `KlaR' [168] with the iris data set (Appendix A.2) and shows six separate plots by starting with different seed parameters.
The starting parameters provide an initial random distribution of the data across a rectangular grid. Each data point is represented by an eight direction edge map, the final configuration of which is dictated through iterations of a stress function. As with many of these component-based and neural network techniques, this is rather difficult understand intuitively. However, a notable feature of this visualization is that, unlike the hierarchical clustering results described earlier, there are only two clearly separated groups, although a third is suggested in most of the plots. Since two of the groups are not linearly separable (see 2.4.3), this display would appear to give an accurate reflection of the data, despite the evident variation in the plots.
Correspondence analysis (CA) is closely related to classical MDS,
except that it is applied to
-squared distances,
following the assumption that most multivariate problems will give
a reasonable approximation to the
-distribution and, consequently,
that the proportions of the data in each row and column provide the
basis for measurement. Everitt has observed that correspondence analysis
(CA) is essentially a technique for displaying multivariate categorical
data by deriving appropriate coordinates to represent both the row
and column variables. In effect, CA can be regarded as either [103, p.104]:
Independent components analysis (ICA) can be thought of as another factor rotation method, where the goal is to find rotations that maximise certain independence criteria. The assumption is that a given signal will be a mixed response and where this is so, blind ICA separation of the signal gives very good results. It is also used for signals that are not supposed to be generated by a mixing for analysis purposes.
The statistical method finds the independent components (aka factors, latent variables or sources) by maximising the statistical independence of the estimated components. Non-Gaussianity, motivated by the central limit theorem, is one method for measuring the independence of the components. Non-Gaussianity can be measured, for instance, by kurtosis or approximations of negentropy. Typical algorithms for ICA use centering, whitening and dimensionality reduction as preprocessing steps in order to simplify and reduce the complexity of the problem for the actual iterative algorithm. Whitening and dimension reduction can be achieved with principal component analysis or singular value decomposition. Algorithms for ICA include infomax, FastICA and JADE, but there are many others also. The ICA method is not able to extract the actual number of source signals, the order of the source signals, nor the signs or the scales of the sources [108,132].
A major problem in application of independent component analysis (ICA) is that the reliability of the estimated independent components is not known. Firstly, the finite sample size induces statistical errors in the estimation. Secondly, as real data never exactly follows the ICA model, the contrast function used in the estimation may have many local minima which are all equally good, or the practical algorithm may not always perform properly, for example getting stuck in local minima with strongly suboptimal values of the contrast function.
One implementation which is relevant here is Icasso (available as a package for MATLAB; see [46]). This tool attempts to provide an exploratory visualization method for investigating the relations between estimates from FastICA, by running the algorithm many times with different initial values or with differently bootstrapped data sets (see 3.3.2). Reliable estimates are said to correspond to tight clusters, and unreliable ones to points which do not belong to any such cluster.
Projection pursuit (PP) is a type of statistical technique which involves finding the most `interesting' possible projections in multidimensional data. Often, projections which deviate more from a Normal distribution are considered to be more interesting. PP is a linear method that, unlike PCA and FA, can incorporate higher than second-order information, and thus is useful for non-Gaussian datasets. It is more computationally intensive than second-order methods.
Given a projection index that defines the most interesting direction, PP looks for the directions that optimise that index. As the Gaussian distribution is the least interesting distribution (having the least structure), projection indices usually measure some aspect of non-Gaussianity. If, however, one uses the second-order maximum variance, with orthogonal projections, PP yields the familiar PCA. A commonly used higher-order projection index is based on the negative Shannon entropy. A commonly used higher-order projection index is based on negative Shannon entropy. The FastICA algorithm for ICA can also used to find projection pursuit directions (see above) [108].
Regression methods can be used for dimension reduction when the goal
is to model a response variable
in terms of a set of
variables. The regression function can be linear or non-linear. In
traditional analyses, it was generally assumed that the explanatory
variables (
) were carefully selected and uncorrelated. In
current data mining applications, however, those assumptions rarely
hold. Variable selection, or dimension reduction, is therefore needed
for such cases. A well-known statistical variable selection method
is step-wise regression, where different models are fitted using different
subsets of the explanatory variables. The results are then compared
by calculating various goodness-of-fit measures, and the subset with
the best measure is chosen as the explanatory variables with the reduced
dimension. A similar approach, selecting the most relevant features
by evaluating random subsets of the original features, is called the
wrapper method in the machine learning community [108,67,207].
This section introduces a number of row-reduction techniques which might be described generally as clustering methods. Conventional cluster analysis (3.3.3) comprises at least three different techniques: hierarchical, partitional (notably by k-means) and spectral. Other algorithms, such as vector quantization (5.4.5), provide mechanisms to perform clustering, or what might be better described as proto-clustering. Density sampling is useful both for modelling the distribution of data and as the basis for a broad class of model-based clustering techniques (Section 5.4). Classification (5.7) may be defined as a supervised form of clustering, implying that some data partitions already exist, either by the efforts of human experts (supervised learning), or by previous processing and statistical testing (unsupervised learning).
An important reason for introducing these various techniques is that
clustering is an NP-complex problem ie. one which can only really
be solved by trying every available option. This is usually impossible,
even with the fastest processors available. As an example, choosing
5 groups from a 100 individuals would require searching through some
possible
partitions. Some compromise must be found and
it is quite likely that initial solutions will not be optimal; steps
or analyses will often need to be repeated before the results are
satisfactory [103].
The essential aim of clustering is to define groups which possess
some average property, but which is distinct from that of other groups.
It is common to employ distance-based metrics since this is probably
the most intuitive, though Euclidean distance is not always the most
obvious choice. Various distance based measures for
-dimensional
problems are defined here for convenience:
Distance-based techniques rely upon the chosen metric to assess the
difference between any two data observations. As there are
pairs in any data set, this immediately suggests
complexity. The complexity of the distance metric also has an impact
(see [114, p.25] and [90, Table 1.1]
for a more complete list). The Mahalanobis distance function increases
with increasing distance between group centres and with decreasing
within-group variation (see 5.6.2),
but is similar to Euclidean distance if the correlation between variables
is slight. It is particularly useful if:
Hierarchical or agglomerative techniques employ a distance metric (such as the above), but there is no clear notion of the number of clusters. Instead, a tree of clusters is formed either from the bottom (`agglomerative') or from the top (`divisive'). Agglomerative clustering starts by considering each data point as a single cluster and proceeds by iteratively merging the two most similar clusters. Divisive clustering algorithms start with one cluster containing all the data and proceed by iteratively splitting the clusters into two disjoint subsets.
The agglomerative and divisive methods differ greatly in their computational
demands. Suppose we have
data points, then the number of possible
mergers an agglomerative algorithm has to consider in the first step
is
(any two data points can be merged to form
a cluster of size two). In total,
mergers have to be made to
construct the complete cluster hierarchy, and in total
possible mergers have to be considered. The number of possible splits
a divisive algorithm has to consider in the first step alone is already
. Therefore, divisive algorithms that consider
every possible split to find the optimal split will already be intractable
for moderately sized data sets. Agglomerative algorithms therefore
scale much better.
The method of agglomeration, however, differs in the (dis)similarity
measure that is used. Most dissimilarity measures,
, between
clusters A and B are defined in terms of a distance measure
between elements of the clusters. Well-known measures are:
Other methods include centroid linkage, median linkage and sum-of-squares error (see Equation 16). The choice of these may be dependent upon the problem at hand: single linkage tends to emphasise outliers, but performs poorly if the clusters are non-spherical, leading to the phenomenon known as `chaining'. Centroid linkage appears to be better if the cluster sizes vary. Above all, it should be realised that hierarchical clustering techniques may give very different results on the same data.
Some estimate of the most appropriate number of clusters can be obtained by monitoring the point at which the difference between pairs gives the maximum difference overall, and this is the approach adopted in HCE [174]. More formal methods (eg. the upper tail rule) are described in [104, p.77], but it should be noted that there is no universally agreed consensus and that `informal and subjective criteria' often predominate.
For partitioning techniques, the number of clusters,
, must be
known (or supposed) in advance. Their main attraction over the hierarchical
methods is that the algorithms are generally much more efficient.
Their main drawback is that assumptions on the shape of the clusters
have to be made in advance. In k-means, for example, there
is an implicit assumption that the data exhibits compact spherical
clusters. Also note that whilst some work has been done on estimating
the number of clusters from the data, this issue remains generally
unresolved.
The k-means algorithm is probably the most common partitioning-type algorithm, but expectation-maximisation is also popular (see 5.2.4). Given an initial k-clustering the k-means algorithm adapts this clustering to reduce the sum-of-squares error criterion given by:
The k-means algorithm alternates between two steps to improve
a given k-clustering. The idea is to treat
in (16)
as a function of both the cluster centres
and the assignment
of data points to the clusters. In the first step we fix the assignment
of data to clusters and minimise the error with respect to the cluster
means
. The optimal cluster centres are given by the mean
of the data assigned to each cluster. After this step the error (as
function of both the assignments and cluster means) equals
.
In the second step we fix the cluster means and minimise the error
with respect to the assignment of the data to the clusters. This amounts
to assigning every point
to the cluster
.
After these two steps we obtain a new k-clustering for which
cannot be larger than for the previous k-clustering
[104,201].
The algorithm is guaranteed to terminate after a finite number of
steps since: (i) there are a finite number of k-clusterings;
and (ii), each step cannot increase the sum-of-squares error. However,
the k-means algorithm does not necessarily terminate with the
k-clustering yielding the minimum sum-of-squares. In practice,
therefore, the analysis is usually repeated with different initial
centres (
) to avoid issues with local minima and with increasing
values of
to find the point at which some overall metric (eg. total
distance from the selected centres) ceases to improve.
A criticism of techniques such as k-means is that they provide
only `hard' estimates of membership to a particular cluster. But an
observation which lies at the boundary between cluster divisions will
probably be more accurately specified by having associated probabilities
of membership to either cluster. this is the basis for fuzzy
clustering; see [104, p.164] and implementation at
[107]. Another variation is the Quality Threshold
(QT) algorithm which defines a maximum radius for clusters and does
not, therefore, require
to be specified in advance and has found
particular favour in genetics research; see eg. [115].
Probabilistic mixture distributions are another important class of partitional clustering methods. A mixture distribution is a distribution that is a weighted sum (weights are positive and sum to one) of several component distributions. The components are restricted to some parametric family and the fit of the model to the data is defined as the likelihood of the data according to the model.
Often the expectation-maximization (EM) algorithm is used to adjust the parameters such that the likelihood is increased. The EM algorithm is reminiscent of the k-means algorithm: it also iterates between a step assigning data to clusters and a step that optimises parameters that characterise the clusters. However, whereas the k-means algorithm assigns each data point to a single cluster, the EM algorithm uses a `soft' assignment: that is, each data point is assigned to each cluster, but in a weighted manner. The EM algorithm shares two important properties with the k-means algorithm:
Spectral clustering is a relatively recent approach which applies an eigenvalue decomposition to the similarity matrix before clustering begins. This becomes, in effect, a dimensional reduction of the space, selected parts of which may be clustered thereafter. A popular application is in image segmentation where different regions of the image may be treated separately. In data mining, the technique has also been applied to the division of documents available on the world wide web into separate genres [27].
Spectral methods are attractive because they make less severe assumptions on the shape of the clusters than partitioning algorithms and can be very fast, depending on the sparsity of the similarity matrix. Moreover, the implementation of most spectral clustering algorithms is quite easy since the main component is a procedure to find a few eigenvectors of a matrix: this is a well studied problem for which highly optimised implementations are available [176,201].
It is useful to give a brief summary of these techniques for comparison.
Agglomerative techniques are the least scalable since they are either
or
depending on
whether or not after each merge all distances between all clusters
have to be computed. The partitioning algorithms discussed here are
for k
clusters (in practice, however, the algorithm will need to be repeated
many times for different values of
and initial centres). In the
assignment step each combination of data point and cluster has to
be considered to find the optimal assignments. Spectral clustering
approaches need the similarity for each of the
pairs of points
and thus take in principle at least
time to
compute the similarity matrix.
Depending on the similarity measure, speed-ups might be possible.
For example, one could use a matrix in which an entry
is non-zero only if
is among the q-nearest neighbours
of
or vice-versa. Efficient techniques exist to find nearest
neighbours among
points in time
for
fixed q. If similarity matrix used in spectral clustering is
sparse, iterations of the power method take an amount of time proportional
to the number of non-zero entries [201].
The assumptions underlying the EM algorithm seem rather restricting, but are at least well-defined, whereas they are often poorly understood for the other techniques. Some classes of problem seem may be better adapted to each. Interestingly, the set of densities that can be implemented using standard component classes can be increased by mapping the data to a space of much larger dimensionality where the new dimensions are (non-linear) functions of the original variables. Efficient parameter estimation of mixtures in the higher dimensional space is possible using the kernel trick which is discussed in [167,139,83,202] and illustrated in Fig. 21.
A detailed account of dealing with missing data is presented in Little and Rubin [145]. This issue causes particular difficulty for clustering techniques since the use of summary statistics is likely to produce a bias in the choice of clusters. Gower's general similarity coefficient appears to be one possibility since the existing data are weighted to provide an initial dissimilarity matrix and missing variables are imputed later. Iterative techniques improve on this mechanism since correlations between other members of the group and the available data can also be accommodated [104].
Statisticians have developed many methods of estimating unknown distribution parameters using functions of the existing data. A bootstrap estimator, for example, samples with replacement (therefore suggesting `bootstrapping', since many other samples are possible). Whilst the original objective of cross-validation (see 3.3.4) was to verify replicability of results and that of `jackknife' was to detect outliers, Efron developed bootstrap for inferential purposes [102,33]. Markov Chain Monte Carlo methods use a random walk to provide samples (without replacement), the most common application being that of numerically calculating multi-dimensional integrals [51].
With relatively small datasets, visual tools such as HCE are able to provide cues with respect to interesting patterns within the distribution of each variable. These cues, backed up by simple statistical parameters (kurtosis, skew, range etc) are therefore able to suggest regions of interest [174]. Histograms (both 1D and 2D) are also used to give an impression of the density of the data, and can be used to great effect when combined with other forms of visualization, eg. in Parvis [41], Fig. 22.
Nevertheless, various authors have cautioned that histograms are unreliable
for detecting modalities in the data, since discontinuities at the
edges of the bins, bin width and the choice of origin can exert strong
unwanted effects, especially in higher dimensions [76,104].
Silverman observed that even for the most `well-behaved' distributions,
such as the normal, the effects can be highly counter-intuitive. In
ten dimensions for example, data in the tails (which might easily
be ignored in lower dimensions) will comprise the bulk of the data8. We are used to the one dimensional case where 90% of the data generally
will generally lie within
standard deviations of the mean:
for ten dimensional data, this situation is reversed [177].
More powerful of distribution modelling are described in Section 5.4.
A further method to assess the nature of distributions arises from
information theory. Entropy reflects the information content
within a given data stream
and is defined as
, such that:
![]()
Two datasets X and Y (100 data points) show a hypothetical dependency |
The concept of entropy may also be extended to vectors of any dimension, where, using the form continuous variables, we have:
Density estimation is a powerful addition to the techniques described in 5.3 for sampling the data and modelling the distribution. Simple examples of density plots are shown in Fig. 24, which attempts to increase the data-ink ratio (see 2.2) of scatterplots by employing sample density plots below the diagonal.
Note that despite the sampling, which is limited to a 1000 points in each density plot, most plots show evidence of multiple modalities ie. clusters. Indeed, various clustering techniques can be built onto density estimates, and some examples are described later in this section.
For the most basic formulation, we note that a probability density
function
specifies that the probability
of
of the variable
lying in the interval between any two points
and
is given by:
The form of this probability expression is valuable for several different kinds of density sampling techniques. For brevity, this following sections (5.4.3 and 5.4.4) consider two basic types -- parametric and non-parametric . A third, which attempts to take the best features from both, is sometimes known as `semi-parametric' (or mixture modelling) and is described in Section 5.5 and eg. Bishop [76], in describing feed-forward neural networks (see also [123]). Finally, a technique which arises from vision research at Leeds is also introduced as a method of density estimation: Section 5.4.5 describes vector quantization.
Parametric methods model the distribution using some pre-defined form
by attempting to optimise the parameters for the model. The simplest
and most widely used is the normal distribution, which is discussed
immediately below, along with various methods of parameterisation.
The main drawback of this method is that the chosen form may not be
capable of providing a good representation (see comments in 1.2.2).
Non-parametric methods (eg. kernel-based methods,
-nearest neighbours)
make fewer assumptions and hence the model must be drawn entirely
from the data. In this case, somewhat paradoxically, the number of
parameters needed grows with the number of data points in the model,
which can therefore become rather unwieldy. Techniques to mitigate
this problem are also discussed.
A useful feature of the density space defined in Equation 20
is that expected values or functions may be included, eg. for
function
, we have the expected value
, where:
In
dimensions, the general multivariate normal probability density
can be written:
Several simplifications of the normal model are possible. For example,
by ignoring the off-diagonal elements of
, the contours of constant
density are aligned with the main axes and, moreover, the components
of
are said to be statistically independent. This implies that
the distribution of
can also be obtained from the product of
the distribution for each component, ie:
The normal distribution has a number of attractive properties, which may be summarised as [76]:
The two principal approaches to parameterisation are maximum
likelihood and Bayesian inference. Both methods tend to produce
similar results, but their conceptual basis is rather different. The
first method seeks find the optimum values for the parameters by maximising
a likelihood function observed from training data. In the Bayesian
approach, the parameters are described by a probability distribution.
These procedures are outlined briefly below:
Maximum likelihood: A density function
is
assumed to depend upon a set of parameters
.
In a classification problem, such a function would exist for each
class, but here we consider only one such function. Given a data set
of
vectors,
, then making
the dependency on
explicit, we write that:
where
can be viewed as a function
of
for fixed
and is known as the likelihood
of
for given
. To maximise the likelihood,
is chosen to maximise
. This corresponds
to the intuitively reasonable idea of choosing the value of
which is most likely to give rise to the observed data. In practice,
it is often more convenient to consider the negative logarithm of
the likelihood, in the form of an error function,
, which should
be minimised:
For most choices of density function, the optimum
would
be chosen by some iterative numerical procedure eg. by gradient descent.
For multivariate normal density, however, it can be shown that the
maximum likelihood estimates of the mean and covariance are the sample
average and sample covariance. This simplicity has a number of in
that the estimator is biased, especially for small
.
Bayesian inference: In this approach, the parameter vector
is itself associated with a prior probability distribution,
which is typically very broad, since little is assumed to be known.
After the data are `observed', the posterior probability density is
much narrower, suggesting that learning has occurred. The algorithm
proceeds by expressing the desired density as:
In the simplest case, a histogram is an example of a non-parametric kernel density estimator. Each data point contributes towards a count within a series of bins placed across the range of the distribution. It is clear that the choice of bin width will play a key role in determining the shape of the histogram: too narrow and the result will appear spiky; too broad and the distribution will begin to look uniform. The choice of origin may also play a determining role in the shape of the histogram, especially `edge' effects may be suspected. The basic histogram may be defined by the operator:
One advantage of the histogram is that it can be constructed sequentially
and the data discarded afterwards, if so desired. A distinct disadvantage,
however, is that the breaks at the boundaries of the histogram bins
cause discontinuities, which make the technique unsuitable for anything
but the most casual use. Another problem occurs in higher dimensions:
taking
intervals in
-dimensional space leads to an exponential
growth in the number of bins (
), inevitably resulting in a
density estimate of zero in most bins.
If the bins are centred over each data point, this leads to a smoother, but not necessarily smooth, plot (see [100] for examples). We may define the estimator as:
Rectangular, triangular and Gaussian kernels are popular, and several others have been applied (see Silverman [177, Table 3.1]). Their efficiencies, based on the mean integrated square error statistic all exceed 90% and it is common to find that particular kernels have been chosen for ease of use in a given situation. In general, any kernel may be used so long as the following holds:
The technique of K-Nearest Neighbours (KNN) is another widely applicable
technique for density estimation, and with only a small modification
may also be applied to classification problems more generally. The
main advantage of KNN is that the difficulty of the choice of
is avoided. It should also be noted, however, that unlike kernel density
estimates, there is no theoretical convergence as
.
From first principles, the probability that a new vector
drawn from an unknown density function
will fall
inside some region
of
-space is:
In applying Equation 26, there are two basic approaches
which can be adopted. If we fix the volume and determine
from
the data, this leads to a kernel density estimate as described in
5.4.4.2 above. If, however, the volume
is varied until it contains a fixed number of
points, we have
the KNN technique. More sophisticated versions of this algorithm,
with the aim of reducing storage and processing requirements, are
described in Hand and Batchelor [121]. Tree-search
techniques may also be employed to speed up the process of finding
the nearest neighbours to a given point [188, p.512]
and examples, using kd-trees, are described immediately below.
Kd-trees were introduced as a method to speed-up many statistical
analyses by Moore in 1991 [157]. An early application
was to KNN which provided
rather than
search times after construction of the tree (an
operation) [158]. The construction of the median-based
tree is outlined in Algorithm 1
and the principle is illustrated in Figure 25.
In addition, Moore also observed that the search path could be optimised
by the storage of additional information at the dividing nodes, such
as the bounding box of data below any given node. Indeed, this only
really required the storage of two arrays of
maximum and minimum
values at each node which then define the hypercube around
that section of the tree. Interestingly, better performance was obtained
(empirically) by dividing along the side of longest dimension rather
than the median, at least in the initial stages of construction. This
tended to promote the creation of larger hypercubes in regions with
no data, allowing empty partitions to be quickly discounted.
The storage of additional data become more firmly established with
the concept of multi-resolution kd-trees (mrkd-trees), in which the
mean and covariance were also stored and where considerably faster
performance was reported for kernel regression [94] and
mixture model clustering (see 5.5) [155].
Gray and Moore also realised that a whole class of conventionally
point-by-point comparison algorithms (such
as n-point correlation functions) could be much speeded-up
by the use of dual kd-trees [117]. This method
considers query nodes as chunks, also defined within a kd-tree, to
compare against an existing tree. By using approximating functions
with additional data stored at nodes (an operation called pruning),
it was found that performance was improved whilst errors could be
bounded to some constant fraction of the estimate.
The use of mrkd-trees for density estimation was explored in some
depth in [118]. The authors observed that non-parametric
methods take the strongest stance in allowing the data to ``speak
for themselves'', but that existing techniques render these methods
intractable for large datasets. The pruning rule for the kd-tree
approximation is essentially
where
is the number of elements defined at the node,
is the centroidal
mass and
is the given kernel (the Epanechnikov
form was preferred). A global error bound is achieved by approximating
the overall log likelihood using local bounds maintained for each
level of the query tree. Further improvements in accuracy can be obtained
by constructing a priority queue of pruned nodes and allowing the
approximation to be `undone' later if so desired. Empirical results
suggested that the dual-tree algorithm approaches
scaling although this was not proved.
A limitation observed in Deng and Moore's earlier work [94] was that the technique does not scale indefinitely into higher dimensions. This was said to arise because data points in 10 dimensions and beyond grow increasingly uniform in terms of distance between the points. Even so, beyond this level, Gray and Moore observed that that their algorithm was polynomial rather than exponential in the number of dimensions [118]. Furthermore, it was anticipated that performance should be improved using so-called ball-tree algorithms, even in very high dimensions [156].
In data compression, vector quantization (VQ) is a reduction technique used in `lossy' data compression in which the basic idea is to code (ie. replace with a key) values from a multi-dimensional vector space into values from a discrete subspace of lower dimension. The lower-space vector requires less storage space and the data is thus compressed. The transformation into the subspace is usually achieved through projection, or by using a codebook [28] (see algorithm below). These forms of compression are not explored further here (but see Khumbah and Wegman [141]), although it is noted that entropy-constrained VQ techniques have gained wide acceptance as effective clustering techniques in signal processing [86].
Instead, this section introduces research undertaken at Leeds in the area of video surveillance [137]. This work is of interest since video generates huge datasets and although the individual frames were analysed in 2D, objects within the scene were tracked in terms of position and trajectory, giving rise to 4D vector data. Semi-parametric methods (see 5.5) were considered for this task, but were rejected due to the limitations of handling large datasets and because the underlying distribution was unknown (see 5.5.5). VQ was essentially developed here as a method of non-parametric density estimation, allowing any number of `prototype' clusters to be derived and updated over an arbitrary period of time. Figure 26 gives a visualization of the movement of objects (pedestrians) in a typical scene resulting from this work.
VQ places a set of
prototypes within a given
-dimensional
space
, referred to as the codebook,
which may be updated over
iterations. The method is outlined
in Algorithm 2.
This procedure follows Kohonen, who showed that such an algorithm
is a gradient descent procedure for the approximation of an optimal
VQ which acts to minimise the expected squared reconstruction error
(see [142]). Moreover, when
, the point density
of prototypes, after the learning phase, approximates the probability
density of training data, and each prototype represents (in a nearest-neighbour
sense) an approximately equal number of training vectors. The cooling
schedule detailed in step 4 was chosen to give a large initial gain
which decreased gradually over the first half of the learning process,
allowing prototypes to move rapidly into roughly the desired distribution.
During the remainder of the learning process, the relatively small
fixed gain allowed fine tuning of the prototype distribution as the
system reached equilibrium.
The remaining model parameters must be determined experimentally. The number of iterations required to achieve a reasonable distribution is dependent on the number of prototypes, the dimensionality of the feature space, and the attributes of the distribution being estimated, and, in Johnson's experiments, was typically in the order of millions of iterations. In cluster analysis, a `natural number' of prototypes can be determined by observing the reconstruction error for increasing numbers of prototypes. In the estimation of dispersed distributions, such an analysis is only useful in determining an absolute minimum number of prototypes; the number chosen is then essentially arbitrary, more prototypes giving a more detailed representation.
Johnson cited three areas where specific action was necessary to improve the performance of the algorithm:
In their review in 2002, Fraley and Raftery observed that although considerable research had been undertaken in the area of conventional clustering techniques (see 5.2), there was little systematic guidance associated with these methods for solving basic practical questions such as how many clusters to select, the type of method or how how outliers should be handled. Moreover, the statistical properties of these methods were generally unknown, precluding any kind of formal inference [110]. At that time, it was a relatively recent discovery that mixture models could provide a principled statistical approach to these problems. The problems of determining the number of clusters and of choosing an appropriate clustering method could be recast as statistical model choice problems, different models compared and more components introduced (if needed) to represent outlying data.
This technique has therefore become firmly established as a clustering method, although it is essentially based on density estimation and exists a compromise between the parametric and non-parametric methods described above. In particular it was hoped to combine: (a) the ability of parametric models to be evaluated quickly for each new input value; with (b), the capacity for non-parametric models to accommodate very general distribution models [76]. Earlier work had also established that in some senses these methods of clustering were complementary. Hierarchical clustering based on maximum-likelihood tends to produce reasonably good partitions when started without any information about the groupings, whereas initialisation is critical for EM-based mixture models (see 5.2.4) because the likelihood surface tends to have multiple modes. EM, however, typically produces improved partitions when started from more reasonable partitions generated elsewhere. By initialising EM with partitions from hierarchical agglomeration and by using approximate Bayes factors with the Bayesian Information Criterion (BIC) approximation, a number of researchers had obtained good results in difficult problem areas such as minefield detection. Fraley and Raftery [109] later extended this method to select the parameterisation of the model as well as the number of clusters simultaneously using BIC.
Given data
with independent multivariate observations
,
the likelihood for a mixture model with
components is:
where
and
are the density
and parameters of the
th component in the mixture and
is the probability that an observation belongs to the
th component
(
). Most commonly
is the multivariate normal density, parameterised by its mean
and covariance matrix
. Some description of the analysis and
breakdown of this form has been give in Section 5.4.3.1;
other extensions using eigenanalysis and other techniques are summarised
in [110].
The EM algorithm, introduced by Dempster et al in 1997 [93]
is a general approach to maximum likelihood formulation which avoids
the difficult problem of solving otherwise highly non-linear coupled
equations [76]. The data are viewed as consisting of
multivariate observations
recoverable from
in which
are said to be observed and
unobserved.
If the
are independent and identically distributed (iid)
according to a probability distribution
with parameters
,
then the complete-data likelihood is:
The EM algorithm alternates between two steps in which the log-likelihood estimates are refined as follows:
Assuming that each
is iid according to a multinomial distribution
of one draw from
categories with probabilities
and that the density of an observation
given
is
given by
,
the resulting complete-data log-likelihood is:
The two most basic issues in conventional cluster analysis -- the clustering method and the number of clusters -- are reduced to that of a single concern in mixture modelling, that is model selection. This will likely involve a trade-off: a simpler model may need more clusters to produce a good representation whilst fewer clusters may suffice for a more complex model. This is equivalent to the problem in conventional techniques with, say, the choice of using spherical or ellipsoidal clusters (see eg. 5.2.3).
One solution is based on Bayesian model selection via Bayes factors
and posterior model probabilities. Prior probabilities are chosen
(possibly to be equal) and the posterior probability of the model
given data
is given by:
It should be noted that EM estimation for mixture models has a number of limitations:
Another limitation of model-based clustering is that in high dimensions, the number of parameters per component in multivariate normal mixtures (which allow the orientation to vary between the clusters) grows as the square of the dimension of the data. Moreover, if the dimension of the data is high relative to the number of observations, the covariance estimates in ellipsoidal models will often be singular, causing the EM algorithm to break down (although other models eg. spherical may still be applicable). Often, some kind of dimension reduction strategy is inevitable: see Mukherjee et al [159] and Murtagh et al [162] for examples using eg. PCA.
A major limiting factor is that model-based clustering as described here does not scale to large datasets. In particular, for efficiency reasons, the hierarchical agglomeration step has a memory requirement which is proportional to the square of the number of groups. (By default, each element is usually assigned to a separate partition.) The most usual approach is take a random sample of the data, and treat this sample as training data in a discriminatory analysis problem (see Section 5.7). Another restriction is that the computation time for EM iteration depends only upon the dimensionality of the data when the data can all be held in memory. If this is not the case, then computation time increases considerably and even if this is acceptable, numerical errors due to machine precision may well be a problem. Again, sampling the data may well be the only remaining possibility [110].
This example follows [103, Sec 6.4] in applying Fraley and Raftery's mclust routines to exoplanet data, but uses an updated dataset of 170 candidate planets around main sequence stars obtained recently (November 2005) from [171] (see Appendix A.4). Only the mass, period and eccentricity variables are used in this analysis. To generate the clusters, maximum-likelihood analysis identifies the model and groupings with the highest BIC number, as shown in Figure 27.
This shows that 5 clusters has, in fact, the highest probability using model ``4''. The available models for cluster shapes are:
Confidence regions can also be placed around the clustered regions as demonstrated in Fig. 29.
It is interesting to note that in Everitt's original working of this example, which contained only 101 exoplanets discovered up to October 2002, model ``4'' was also selected, but that the number of clusters was suggested to be three. This combination also appears to be highly rated in the current BIC classification, but since 5 clusters is slightly higher, this is selected by default (the number of clusters can, however, be enforced in underlying EMclust algorithm).
Canonical Variate Analysis (CVA) is used to examine the interrelationships between a number of populations simultaneously with a goal of objectively representing the interrelationships graphically in two or three dimensions. The axes of variation are chosen to maximize the separation between the populations relative to the variation within each of the populations. In algebraic terms, the first canonical variate is the linear combination of variables that maximizes the ratio of between-groups sums of squares to the within-groups sums of squares for a one-way multivariate analysis of variance of the canonical variate scores (see immediately below for definitions and [34]).
CVA, sometimes known as discriminant function analysis, is related to analysis of variance (ANOVA) techniques, since both employ measures of within-group and inter-group variance. These may be defined as:
If
is low, this indicates that the variation within samples
variation is low compared to the total and is evidence that the samples
do not come from populations with the same mean vector. Wilks' lambda
can also be obtained from the eigenvalues,
,
of
using the identity:
Everitt et al have observed that there are numerous variations
on techniques employing these measures for clustering directly, using
eg. minimisation of
, minimisation of
and maximisation of
, although the first
of these is the most common [104]. One problem with
this method, however, is that the results are scale-dependent, which
presents problems if the data have to be standardised. A further problem
is that it appears to impose a spherical structure upon the generated
clusters, even when the observed clusters are clearly visible by eye.
The problem of scale-dependency provided the motivation to develop several other methods, using modified criteria:
Clustering and classification are opposite sides of the same coin, the main difference being that in classification tasks, we are more concerned with labelling the clusters (and thus, by definition, each observation) in some way. The specification of classes might be through some expert presence, eg. a botanist describing the species present in the iris data, or through some discriminant present in the data eg. ellipticity values to distinguish between stars and galaxies. In either case, some observations may be misclassified and hence the concept of validation is important. This issue is discussed in Section 5.7.2. Given the evident similarity between clustering and classification tasks, this section does not this subject in depth, but we extend some examples discussed earlier in the text to illustrate how clustering procedures may be adapted (Sections 5.7.3 and 5.7.4).
An important technique in classification, especially in data-mining tasks, is that of cross-validation. Several variations can be described [29]:
Equation 34 may be expressed as the following discriminant function:
The function in Equation 37 are examples
of linear discriminants, since they are linear functions of
.
Decision boundaries corresponding to
are then hyperplanar. Linear discriminants are closely related to
neural network models which have a single layer of adaptive weights
[76].
As with clustering, non-parametric methods would seem to be most offer the widest scope for astronomy, since fewer assumptions about the data are made. This section considers the nearest neighbour (Section 5.7.4.2) and tree-based approaches (5.7.4.3) and briefly describes refinements which may be obtained by boosting and bagging techniques (5.7.4.4). Another technique which can be mentioned is that of class cover catch digraphs, although libraries for these are not yet available; see [149] for discussion of the application to data from the Sloan Digital Sky Survey.
The computation of densities from the nearest neighbour (NN) rule was described in Section 5.4.4.3. A classifier can be generated through the use of Bayes' Theorem by modelling the class-conditional densities for each class separately and then combining them with priors to give posterior probabilities. The KNN technique can be adapted as follows:
One of the chief drawbacks of the NN classifier is that it is slow
to execute, since every point is checked, although techniques are
available for improving performance. One approach is to pre-sort the
training sets in some way, eg. by kd-trees (see 5.4.4.4)
or Voronoi cells. Another solution is to choose a subset of the training
data such that classification by the
-NN rule (using the subset)
approximates the Bayes classifier. This can result in significant
speed improvements as
can now be limited to 1 and redundant data
points have been removed from the training set. These data modification
techniques can also improve the performance through removing points
that cause misclassifications (see also: condensing and editing
[123,43]).
Recursive partitioning developed as data mining tool in the 1990s and provides a reasonably fast, although computer-intensive, choice for non-parametric classification. The CART system is available commercially (Salford Systems [56]), although similar packages are available for S-Plus and R [189]. Sutton argues that the technique has several further advantages: (i) little expertise is required to interpret the results; (ii) important variables are identified as a result; and (iii) the method can also be used to impute missing values [184].
The tree is built by the following process: first the single variable is found which `best' splits the data into two groups. The data is separated, and then this process is applied separately to each sub-group, and so on recursively until the subgroups either reach a minimum size or until no improvement can be made. The resultant model is, with certainty, too complex, and care is needed to decide when to stop. The second stage of the procedure consists of using cross-validation to trim back the full tree.
The details of the forward tree creation and pruning algorithms are somewhat complex, but are fully described in Therneau et al [190]. We note, however, that in the tree-growing process, the split selected at each stage is the one that leads to the greatest reduction in the sum of squared differences between the response values for the learning sample cases, corresponding to a particular node and their sample mean (or possibly the median). The decision may be based upon one or linear combinations of variables and may also include categorical data. An impurity function is usually employed which attempts to minimise this error.
Figure 30 displays a tree generated by the rpart algorithm for the sunspot data described earlier in 4.4.4.
Only the number of spots, area of spots, number of flares and magnetic classes (4 categories present; see [53]) are included as variates. Interestingly, the first split lies mainly upon the magnetic class, such that the alpha class is placed left in the hierarchy, whilst the remaining classes are distributed to the right. The other divisions are governed by the criteria shown in the figure, with the number of observations in each class given by the ``n='' term, except the node at ``mag_class=b'', which divides the beta and gamma magnetic groups.It should be noted, however, that Sutton observes several caveats in using recursive partitioning [184, p.317]:
Like recursive tree partitioning (upon which these algorithms depend; see above), these techniques are aimed primarily at reducing the variance associated with prediction from regression. Bagging (a shortened form of `Bootstrap aggregating') is a special case of model averaging approach [79]. The algorithm can be given as [30]:
Boosting is a highly respected method of combining classifiers, which are iteratively created from weighted versions of the learning sample, with the weights adaptively adjusted at each step to give increased weight to cases which were misclassified on the previous step [123]. The final predictions are obtained by weighting the results of the iteratively produced predictors. The technique was originally developed for weak learners ie. a classifier which gives only slightly better results than random guessing. At each stage, the output of the weak learner is then added to the learned function, with some weighting (proportional to how accurate the weak learner is). Then, the data are reweighted: examples that the current learned function gets wrong are ``boosted'' in importance, so that future weak learners will attempt to fix the errors [184].
There are several different boosting algorithms, depending on the exact mathematical form of the strength and weight. One of the most common boosting algorithms is AdaBoost (available for R from [96]). Most boosting algorithms fit into the AnyBoost framework, which shows that boosting performs gradient descent in function space [31].
This section gives a brief overview of neural networks (NNets), which provide many analogues for the dimension reduction and clustering techniques described above. The main difference is that training data will normally have to be provided, so that weights for the NNet can be initialised. The AstroGrid/VOTech project is fortunate to be able to exploit the work of the Astroneural team for astronomical purposes, and this work is briefly reviewed in Section 5.8.4. Section 5.8.2 gives an overview of the history and development of NNet. A simple algorithm is given in 5.8.3. More extensive coverage is available in Bishop [76] and Hastie [123].
Most neural approaches are based on combinations of elementary processors
(neurons), each of which takes a number of inputs and generates a
single output. Associated with each input is a weight, and the output
(in most cases) is then a function of the weighted sum of inputs;
this input function may be discrete or continuous, depending on the
variety of network. If the inputs are denoted by
and the weights by
, then the total input to the
neuron is:
This model saw a lot of use during the earliest phase of development culminating in the concept of the perceptron [76,179]. As these neurons become interconnected, they are intended to mimic actions in the brain. These interconnections may then form a simple system, something which is highly structured or else more difficult non-linearities. Examples may be:
A significant advance in the NNet research came with the development of the back-propagation algorithm which trains strictly layered networks in which it is assumed that at least one layer exists between the input and output (in fact it can be shown that two such `hidden' layers always suffice). Such a NNet is an example of a feed-forward network, in which data are admitted at the inputs and travel towards the outputs, at which the `answer' may be read.
The standard approach to use such networks is to obtain a training
set of data ie. a set of labelled vectors.This is used to teach a
network with some training algorithm, such that the NNet can perform
the association accurately. In `live' mode, the NNet is fed by unknown
patterns and the answers are generated according to what has been
learned. Back-propagation works by comparing the output of the network
to that expected, and computing an error measure based upon the sum
of squared differences. This is then minimised using a gradient descent
method implemented by changing the weights of the NNet. Denoting a
member of the training set by
, the outputs by
and
the desired outputs by
, the error is:
A summary of the procedure is given in Algorithm 3.
It should be noted that, as presented, this algorithm is rather slow
but there is an extensive literature upon speeding-up the algorithm
eg. by the introduction of momentum [124]. Numerous
other extensions can be be listed, notably by Kohonen [142]for
unsupervised learning and Hopfield for optimisation [130].
Early work by the Astroneural project was targeted towards star and galaxy separation in wide field astronomical images. It was found that a feed-forward neural network with two layers was able to generate a robust non-linear PCA reduction of the input images with comparable performance in the classification of stars/galaxies to the SExtractor tool [74][69].
Later, research was focused upon the generation of probabilistic principal surfaces (PPS) in the fields of gene data mining [182] and astronomical visualization [181]. PPS are a nonlinear extension of principal components, such that each node on the PPS is the average of all data points that project near/onto it. From a theoretical standpoint, PPS are a generalisation of the Generative Topographic Mapping. The endpoint is a data reduction from high-dimensional space onto a 3D spherical surface. Having projected the data into the latent sphere, the analysis must localize the most interesting data points (this depends on the specific application). For example, the ones lying farthest away from more dense areas, or the ones lying in the overlapping regions between clusters may be of interest and the system allows the user to select and obtain all the information about the typology of the data.
Typically, however the clusters or outliers detected in this way are
hard to interpret and hence it is usual for the PPS analysis results
to be used as an input into more conventional clustering algorithms
eg. hierarchical clustering. Evaluation of the system using the Great
Observatory Origin Deep Survey catalogue (28405 objects with 21 parameters;
a `sub-medium' sized table of
entries) showed an initial
error rate 60%, which was reduced to 2.15% (best case) after tuning
(timing for analysis or tuning was not given). At the current time
the developers are seeking to port the code (presently MATLAB-based)
into a non-proprietary format.
This report has attempted to give formal definition to the problem of handling large, multidimensional tables of data which frequently result from searches of astronomical catalogues. It has been helpful to employ the nomenclature of Wegman and Solka in describing such tables as `small', `medium', `large' etc, so that we can speak of different classes of problem and the solutions which may be appropriate. This is the approach adopted in Section 6.2, where it will be seen that, with some reservations, the various visualization techniques described may permit access to the next level of table size.
Although such large tables frequently occur, it is also usually the case that many of the included variables are of secondary interest and can thus be immediately discarded or `hidden'. It is somewhat surprising therefore that many software tools give this process little consideration. It is common for example for menus to be displayed accepting all data by default and giving little informative analysis (eg. HCE) or even proceeding without further ado to draw the display (eg. XmdvTool). For medium or large data tables, this is usually a mistake and will generally cause the application to freeze or crash. The possible format for an improved data selection tool is described in Section 6.4.
Table 7 presents the applicability
of various sample algorithms and visualization techniques for datasets
of small to large sizes.
|
(P Coords = parallel coordinates)
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
An important exception to this general rule is with the R package itself (although it may be argued that R is more of a statistical than a visualization tool). The authors write:
There are limitations on the types of data that R handles well. Since all data being manipulated by R are resident in memory, and several copies of the data can be created during execution of a function, R is not well suited to extremely large data sets. Data objects that are more than a (few) hundred megabytes in size can cause R to run out of memory [166].Empirical evidence here suggests that R can import medium-sized tables reasonably well, but that
Even if it were possible to load such large tables, the data are usually too numerous to assess from ordinary plots and histograms or density plots will be needed. (Density plots are essentially 2D histograms accumulated with a bin-width equal to the pixel resolution of the display.) However, various authors have cautioned against the use of histograms in such situations since noise within the data and the origin and orientation of bins can have a profound effects upon the resulting display (see 5.3.2). Density estimation, which applies a smoothing kernel to data samples, is therefore recommended.
Appendix B gives a summary of R routines used in ththiss report and a list of others which were provisionally explored. This list is not exhaustive and it is likely that many other statistical functions would be useful (eg. ANOVA). Besides the size limitations described above, the scripting interface represents presents several disadvantages:
XmdvTool has been released in two versions: a `lite' Java version (which has already been VO-enabled - see [186]) and a larger C++ based version which is able to cope with larger data sets (by employing threading and pre-fetching algorithms [169]) and provides a number of extra utilities (such as column reordering and hierarchical clustering; see 2.4.3). The lite version, however, relies upon a native openGL interface (GL4java) which is no longer available and is only capable of handling very small data sets. As a provisional test, the code has been transferred to make use of the later jogl interface [50], and this shows a greater level of interactivity at least into the small-medium table range for the parallel coordinate display.
It is likely that, with further optimisation, the jogl version could be a credible tool for datasets of medium size, but given the higher-level functionality of the full version, it may be more desirable to place the effort into enabling the full version initially. It is intended that a library of xmlrpc calls can be constructed for Eirik to communicate with the AstroGrid system (eg. Workbench), and hence it is supposed that the conversion of either xmdv versions will be of similar complexity. A faster and simpler alternative would be to adopt the R `one-way' approach: data could be pushed out to XmdvTool in its own ``okc'' ASCII format. This would no doubt be less effective, but a similar approach could be adopted for many other tools, including those only available in binary formats such as CViz and HCE (see 2.4.6).
Given the various problems highlighted throughout this report with respect to the handling of large datasets, it is suggested that a data selection tool in the manner of Dos Santos (see 2.6), may be a useful technique to develop further. However, a tree-based metaphor for data selection (perhaps like that of the hypviewer in Section 2.5) is thought to be more appropriate. It is hoped that this interface may be combined with `sparkline' summaries of the data (see 2.2.2) to maintain an overview of the data.
Another significant concern is that of simple combinatorial complexity in investigating the relationships between variables. Choosing each possible combination of 2 axes from 10 variables, for example, requires searching through 45 possible plots - in 3D, this figure rises to 120 or 210 plots depending on whether 3 or 4 `axes' are selected. This may represent a colossal investment of the users time, especially if the data are densely packed or scattered throughout the volume, since many zoom-type operations will be needed and the possibility of losing the reference frame and getting lost in the data is quite high. It is proposed that, by linking to R (and other statistical libraries) it should be possible to present density plots of each variable of interest alongside `levelplots' of covariance and mutual information statistics (see 5.3.3) so that groupings and relationships in the data are more visibly exposed. Further consideration of the format of this tool will be given in a future document.
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.70)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -no_subdir -split 0 -show_section_numbers /tmp/lyx_tmpdir27699TVm770/lyx_tmpbuf1/alg1.tex
The translation was initiated by richard holbrey on 2006-02-01