Richard Holbrey (VOTech/DS6)
The need for visual insight into data has been understood for nearly a century, but many years the field only progressed within the realm of statistics (see eg. Tukey [197]). Attempts to popularise the subject more generally were somewhat limited1 and developments in data mining and visualization owed little to this work: data mining researchers were spurred on by the advance of database technology and the creation of huge commercial datasets in early 1990s [209] (see 1.2); the impetus for a more dedicated visualization effort came in 1987, with the publication of the NSF Report [154], which recognised the need to render the huge datasets resulting from engineering advances in fluid dynamics, brain scanning etc.
Since that time, researchers have expended a good deal of effort developing a series of powerful, interactive applications. Indeed, some analysts are now able to consider the fidelity of their visualizations by direct comparison with the original data [77]. Recently, however, the literature reveals a slightly more introspective mood [200]. The growth of `visual analytics', around questions of homeland security, has emphasised the danger of information overload and the need to harness new technology, such as high resolution and stereo monitors, in order to maintain an overview of the data [191]. Other researchers have stressed the need to grasp the semantics of visualization, and for the creation of ontologies, to ease communication between scientists using, for example, the e-Science grid [99].
Despite these advances, the problem of finding suitable methods for the exploration and analysis of large astronomical datasets remains formidable [148,208]. This document was instigated at the request of members of the VOTech DS6 team in August 2005 with the aim of providing a synthesis of relevant data mining and visualization algorithms, towards Stage 2 of the project [38]. The emphasis is on methods of bridging the gap between accurate representations of the data and the capabilities of current technology and users (see 1.2, below). Methods of dimension reduction of data are therefore prominent. Some mathematical background is given in this area (especially with respect to eigenanalysis) although readers are generally referred to other texts: supporting references have been drawn from Jackson [134], Manly [147], Rao et al [206] and Hastie et al [123].
Data mining (DM) involves the retrospective analysis of large bodies of data, with the intention of recovering new facets which were hitherto unknown. Statisticians, however, remain divided over the need for the formulation of a priori hypotheses. On one hand, researchers have welcomed the chance to try new methods upon existing data sets without such constrictions. On the other, this has meant the abandonment of the probabilistic roots of the profession. In particular, it is true that data which has not been generated by deliberate random sampling cannot be generalised to the population at large. Nevertheless, the sheer size of some data sets indicates a measure of validity.
In this light, Wegman and Solka have defined a number of feasibility
limits for data mining based on current processing power, the cost
of storage and streaming capabilities [206]. For
example, an algorithm was considered feasible if, given the size of
the data set and the complexity of the algorithm applied, it could
be completed in less than a week. At the time of writing (2004), this
was thought to be achievable for most
type problems
without need for special computing resources. For
algorithms, eg. clustering, this limit restricted the size of data
table which could be processed to around
entries ie. `medium'
sized data tables, see Table 1.
|
|
A defining characteristic of DM is that data sets are often so large that -- without special hardware -- it may be impossible to load the complete dataset into memory at once (see [36]). It follows that the objective of DM searches is usually more exploratory than experimental ie. performed to generate hypotheses, rather than to test them. Also, it suggests that methods are needed initially to process the data as a stream, perhaps with some loss of precision, and that, later, a subset of the data will need to be selected.
A further distinction is sometimes made between exploratory data analysis (EDA) and DM. EDA makes use of various techniques designed specifically to identify patterns in multivariate data (or univariate sequences), such as factor analysis, clustering and so on (see Section 3). But the focus of DM is on producing a solution that can generate useful predictions. DM researchers, therefore, have a much greater acceptance of `black box' approaches, such as neural networks. Whilst these are certainly capable of generating valid predictions, it is not possible to identify the nature of the interrelations between the variables on which the predictions are based [183]2.
Scott also points out that exploratory analyses should ideally be non-parametric, despite the fact that assumptions about normality are common. Many parametric techniques are reasonably robust to deviations from normality, but for exploratory work the results can occasionally be very unpredictable [173]. This point has recently been echoed by Gray and Moore [118], who agree that non-parametric methods ``are generally the method of choice in exploratory data analyses...[but] often come at the heaviest computational cost.'' This issue is discussed further in Section 5.3.
The role of graphics in data analysis has long been recognized as important, if not universally. In the 1930s, Fisher wrote: ``Diagrams prove nothing, but bring outstanding features readily to the eye; they are...no substitute for such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in explaining the conclusions founded upon them'' [173]. Despite this realization, however, it was not until the 1960s that Tukey finally made them respectable [196, p.53]; and not until 1987 that visualization was recognized as separate discipline with distinct aims.
Since that time, a good deal of effort has been expended to develop a number of powerful, highly interactive applications: Iris Explorer [1], Amira [2], Spotfire [57], Visage [208] etc). Humans are particularly good at detecting certain patterns and interactivity-based approaches are aimed at engaging our senses in order to aid our cognitive abilities. The hope is, in effect, to let the user `see' something new and from this perspective, interactivity is vital [66,64,175]. Against this view, however, various costs must be set:
The approach adopted here is essentially borne of practical necessity.
Queries of astronomical catalogues generally result in large data
tables which may be hundreds of columns wide (representing the number
of measured variables) and contain
-
rows (the
number of observations), placing these data sets easily within Wegman
and Solka's medium to large categories (
table entries,
see above). In preliminary work up to this point, it was noted that
some software tools -- perhaps the majority? -- work reasonably
well with data sets of the order of MB (`small' to `medium' sized),
but struggle with data tables of larger sizes for two main reasons:
It was thought convenient to treat the task of dimension reduction as two separate issues relating to (a) column reduction, and (b), row reduction. These topics are introduced in Sections 3.2 and 3.3. Readers who are already familiar with this territory are referred to the more succinct report by Fodor [108]. The treatment here is rather more in depth and a number of themes are apparent. The first, and probably the strongest, is that of the analysis of covariance leading to the techniques of principal components and factor analysis (amongst others). Indeed, an understanding of the difficulties and differences between these techniques is useful towards the understanding of the subject (multivariate analysis) as a whole. For this reason, these techniques are described in some detail in Section 4.
Another important theme is that of adopting a cyclic approach to data mining. A misleading term, which occurs in the literature, is knowledge discovery. The implication is that, after some processing step, domain experts should able to interpret newly `mined' data with much greater clarity than was previously possible. Wegman and Solka point out that this only really becomes possible if a considerable amount of effort is expended in cleaning up the data first. Any analysis must therefore be considered as part of a general data flow. Fig. 1 (adapted from [206, fig. 1]) illustrates this flow as a cycle to emphasise that each step in the process should be seen as one of data refinement, rather than as an immediate endpoint and visualization is no exception
. Performing a principal components analysis, for example, may not produce immediately usable results (since the axes are difficult to interpret) but could be used to maximise the linear separation of clusters (for visualization) or to reduce the set of variables under study (selection criteria).
The development of astronomical instruments and procedures for data gathering and storage is beyond the scope of this report. Nevertheless, it is useful to say something about the level of processing which can normally be expected. It is common, for example, for data to be calibrated depending upon the instrument used before the data is stored, but additional steps may be required to achieve the highest possible levels of precision. A number of software tools are able to assist here, such as the Aips++ library [32], and others provide facilities for immediate analysis, notably IDL [47], Aladin [4] and Xmatch [65,5]. Also, there are standardisation issues to do with the adoption of various naming schemes, units and so on, although it is hoped that the creation of the IVOA UCD and VOTable standards [6] may make this aspect easier to handle in future.
One of the most obvious features of Fig. 1 is that curation and preparation of the data forms the bulk of the work. In particular, care is needed to remove errors (noise, missing values etc) and to select appropriate variates for analysis. This may require domain expertise, since it may be necessary to:-
Given the number of different but relevant domains upon which this research might touch, the scope of the present report is restricted to:-
Visualization is a complex research area which has developed over the past few decades by building on theory from information design, computer graphics, human-computer interaction and cognitive science. In view of this conglomeration of ideas, the title (`limits of visual thinking') is intended to reflect that the development of computing algorithms must be matched by an understanding of the psychology of the user. In other words, it is not sufficient for researchers to demonstrate a given algorithmic complexity to ensure a successful visualization, since little is achieved if the display hinders rather than helps. However, rigorous user testing of visualization software is both difficult and rare [85]. In order to get some appreciation for the current limits of hardware and software, therefore, this section gives a brief survey of available tools but begins by revisiting some of the guidelines for good graphic design set out by Tufte (see also [42]) and Bertin [75].
Tufte has recently published a harsh critique of NASA engineers' usage of MicroSoft PowerPoint© instead of producing formal written reports prior to the Columbia disaster in 2003 [192]. This is a new variation of an old complaint: in the past, many authors have often exempted themselves from working with the data at hand and allowed new formats to obscure the end conclusions. Inappropriate scaling and ordering of data are common symptoms (see 2.2.4). Ultimately though, this is an issue of integrity and authors should be encouraged to think through, prepare and attach their own names to their own work [196,195, ch. 2].
Bertin observed that tabulating data is an essential part of constructing a well-defined problem, and may require a good deal of imagination. But a homogeneous table is unlikely to be effective in capturing the eye. The next stage of the process, therefore, involves graphic illustration, but this should leave open the question of the organisation of the table. Indeed it is unlikely that the first effort will be optimal. Exposing contrasts in the data involves some skill at the removal of unnecessary data-ink, juxtaposition and scaling [87]4. This may be automated to some extent, since it is partly based on patterns of sectors in the data or graphic, but still requires (re-)organisation in line with semantic reasoning [75].
Similarly, Tufte has argued that presenting similar graphics side-by-side can enhance the visualization by suggesting motion between the images, as if in animation. This motion indicates greater dimensionality and, hence, helps to reinforce the differences between each `frame' [195, ch. 6]. It is also one of the guiding principles behind the scattergram displays (see 2.4.2) and those of Andrew's Curves (see 2.4.5). Tufte has also proposed that `sparklines' can give intense word-sized contrasts, as shown in Fig. 2, which may be still more effective when used in multiples [193].
A common pitfall of new graphic designs (and 2D/3D computer displays) is to emphasise line-art and to overlook supporting orientation or scaling data. Tufte describes the effect as one of proliferating `chartjunk'. Often, the added weight of this detail tends to impose on the graphic and masks some of the meaning. Two quantities describe this problem:
In [196], Tufte advocates the removal of all non-data ink and any redundant ink. The axes of a scatterplot, for example (see 2.4.2), might be trimmed to show only the range of the data or to show median and quartiles. In this way, the data-ink ratio is maximized and the possibility for gratuitous self-decoration and moiré effects is removed. At the same time, thinned or greyed-out lines and varied line thickness can be used to great effect, if, say, a background grid is really needed.
In computer graphics, it might be supposed that the equivalent to the data-ink ratio would be data pixels (pixels representing data elements) to non-background window pixels; the data density might be data pixels to window pixels. Although providing useful concepts, these quantities may not be entirely appropriate, however, for electronic displays (see below).
Whilst data may not always contain cause and effect type relationships, the organisation of data plots is critical if such relationships are to be exposed. An oft-quoted example is that of the problem of `O'-ring failures on the Challenger space shuttle in 1986. NASA engineers went to some lengths to describe the risk prior to the launch, by using plots of rocket-shaped symbols with various kinds of cross-hatching to indicate particular failures. The rocket symbols, however, contributed nothing to the readability of the graphic and since they were ordered by the testing sequence, the explanatory variable (temperature) was completely obscured.
Tufte's reordering the plot, so that the symbols appear at the correct location along the temperature axis, makes the decision to launch looks a good deal riskier [195, p.49]; by removing the symbols and using a weighted metric to describe an overall damage statistic, this relationship is very much clearer [195, p.45]. More generally in scientific testing, it is necessary to employ experimental controls to provide a baseline for comparison. This baseline then permits us to relate variations in cause with variations in effect. Without such controls, we must assume that there may be many possible causes [195, p. 50-52].
Although some graphical techniques are useful for standardisation, it is useful to consider some of the basic requirements for cleaning data before proceeding. In the first place, the question of missing data needs to be tackled. Many software packages simply ignore data units with missing entries and this seems acceptable where there is sufficient data to do so. Often, however, this may not be possible or may even be unwise if there is some underlying cause which demands further explanation [104,147] (also note the earlier comments about keeping data in Section 1.3.2).
Scatterplots (see below) are undeniably useful for initial assessment, and it would be unusual today if this were not one of the analyst's first steps. Apart from gaining insight into the distribution, the nature and influence of outliers needs to be gauged. One method is to plot regression lines over the scatterplot, which may be locally weighted if the distribution seems uneven (see Fig. 4 and Appendix A.1 for a description of the data). Finding and possibly removing the `boundary' or convex hull of the data or is one way of assessing the influence outliers, see Fig. 3.
![]()
![]()
|
In these plots, outliers do not appear to be particularly distant
and keeping or removing the boundary points does not affect the adjacent
chi-plots. The basic principle of these is to assess the independence
of variables using the expected
-distribution, where dependency
is indicated by deviation from the central band running horizontally
across the centre of each plot. In this case, the lower plots show
a greater level of structure (ie. less independence), since relatively
few points lie in this zone [103]. Note that the presence
of two supposed groups may be more rigorously tested by the use of
canonical variate analysis (see 5.6.1).
The detection of outliers is discussed with reference to self-organised
maps in Section 4.6.
A common requirement before proceeding with many forms of analysis
is to scale the data, so that the range or units of the data do not
figure. Conversion to Z-score (autoscaling) is probably the
most common technique, so that the distribution has unit variance,
although division by the sample range has been shown to be more robust
in many clustering applications [104]. Principal components
analysis and some clustering techniques provide ready made methods
of normalization which are discussed later. For the multivariate case,
autoscaling can be achieved by employing the covariance matrix,
, where the transformation is:
Given the high dimensionality of most astronomical datasets, reduction to 2/3D for visualization purposes seems unavoidable. Also, the preference for many researchers is naturally for 2D graphics, so that (amongst other things) publication drawings can be considered. Various techniques for plotting in 2D are considered in this section.
Where the dimensionality gets very high, however, several problems
arise with reduction to 2D [173,205]. If we consider
the hypercube
, the
diagonals,
,
from the origin to the corners are of the form
.
The angle between the diagonals and the Euclidean axes,
,
is given by
, where:
|
In higher dimensions, the volume of hyperspace also appears distorted,
so that the distribution of data is also counter-intuitive. Since
the area of a circle can be written as
and the volume
of sphere as
, we can generalise the volume of a hypersphere
as
. It is easy to show that the volume of a thin hypershell
relative to that of the hypersphere approaches 1 as
increases.
In other words, most of the volume occurs within the shell. The result
is that if the data were uniformly distributed within the hypersphere,
most data would, in fact, lie within the shell. However, by projecting
the data onto a 2D plane, most of the data will appear to lie near
the origin of a circle, thus giving an unrealistic impression [205].
A typical scattergram (ie. a collection of scatterplots) is shown in Fig. 4, showing some of the SuperCOSMOS data set as depicted in R, with colour used as an additional `dimension' to reflect different classes in the data (here red implies `star', green a `galaxy'). Due to the linear nature of the plotting algorithm, the number of points which can be plotted is almost unbounded (but see below). There is a good deal of redundancy in the display (a poor data-ink ratio) since each smaller graphic is plotted twice, although statisticians seem to prefer this, so there is no particular bias to `x' or `y' axes. Some applications can make use of the plot windows on the diagonal eg. to plot histograms, as is the case here. The blue curvilinear feature in the scatterplots is a locally weighted regression (LOWESS) curve, after Cleveland [164]5.
Nevertheless, a practical limit of 10-15 variates exists for most displays. The use of `coplots' in R or dimensional stacking in XmdvTool [211] can make some relationships more explicit for, say, up to 5 variables.
Scott has observed that where the number of observations is large,
the areas of greatest density lose visual impact and the eye is drawn
to outlying data in the tails of the distribution [173].
One response has been the development of sunflower plots, which make
use of hexagonal bins to form a 2D histogram [101],
Fig. 5. Individual observations are
plotted when there is only
observation per bin as in a conventional
scatter plot. Each bin with from
observations contains a light
sunflower. Other bins contain a dark sunflower. In a light sunflower
each petal represents one observation. In a dark sunflower, each petal
represents
observations. (A dark sunflower with
petals represents
between
and
observations.)
Described by Inselberg over two decades ago [133], many tools are now capable of rendering parallel coordinates. The basic premise is very simple: instead of plotting data by using orthogonal axes, the axes are maintained parallel and the data plotted as lines across the axes. In this space, therefore, a point is mapped to a continuous line (and vice versa). This has the effect that the amount of data-ink increases relative to the background, but may obscure the display. Also, since it is common for the axes to be automatically scaled to the maximum and minimum for each variable, it is usually difficult to assess directly the correlation between any pair (cf. [205]).
Examples using the XmdvTool display are shown in Fig. 6, using a subset of 20 variables from the SuperCOSMOS dataset (1000 rows) [7].
(a)
![]()
(b)
![]()
|
A further problem is apparent when the number of variables increases (beyond, say, 15-20) in that it is difficult to separate the axes or associate axes with the respective variables. In fact, as the distance between axes decreases, the display becomes saturated with `data-ink' and any visible pattern is lost. An interesting response to the saturation problem was proposed by Lederman and implemented in Parvis [41]. Here, the author has created a fading brush technique and is also able to highlight single strands of data to assist the user. As a more innovative use of the display, histograms can also be plotted against the axes to indicate the basic data distribution.
A classic example of the use of glyphs was that of Chernoff faces, Fig. 7.
The assumption behind this approach is that the human eye is particularly sensitive to facial expressions, so that by mapping variables to particular features (eyebrow height, mouth curvature etc), we may be able to distinguish characteristic elements of the data more easily. Many other types of glyph, including stars, weathervanes and profiles, have been proposed. Ward has observed that the success of these is highly dependent upon colour and glyph placement [203], since they may be discrete or overplotted as required.Tufte has observed that despite the quixotic character of Chernoff faces, `ink' may saved by halving the faces, since they are usually symmetrical [195]. One problem which frequently occurs, however, is that of representing the key in glyph plots, eg. see Fig. 8. The use of text in the key creates a fairly large mental load and has the effect of distancing the user, even for this small data set.
![]()
|
An interesting implementation of the `juxtaposition' principle (2.2.2)
is that of Andrew's curves [70], in which each observation
is represented by a combination of the (scaled) variates,
,
into a sine wave of the form :
Several tools provide a greatly enhanced level of interactivity by combining displays and allowing highlights to be broadcast from one to the other, eg. Mirage [52]. One of the first tools to make use of so-called `brushing' of data between displays was Arc (itself based on XlispStat) which - even to this day - remains one of the few tools capable of generating regression planes to describe data [89]. XGobi [185] and XGVis [81] also adopted the brushing technique, adding various methods to display data in parallel coordinate, 3D rotation and making use of tours (animations) particularly with in conjunction with dimensional scaling (see 4.6 below, CViz [120] and HiSee [45]). Another valuable property of these systems is that data can can be excluded to examine the effect of outliers.
Several commercial systems have now been developed in this idiom, such as Spotfire [57]. One system which deserves further mention, however, is Hierarchical Clustering Explorer (HCE) [44], see Fig. 9.
A unique feature of this tool is that 1D and 2D histograms are used to suggest interesting distributions (and hence variates) to the user (but note 3.2.2). However, the use of a clustering algorithm as data are input forbids the processing of `medium' or larger data tables (see Table 1). This issue also affects XmdvTool when using the InterRing [213] or clustering data selection tools.
In cladistics6, trees are common tools for representing the distance between species graphically. To represent such trees within limited computer displays, however, various forms of non-linear presentation have been adopted using `backbone' spanning trees and focus/context motifs, such as the i-Disc [127], or hyperbolic forms, such as HypViewer, see Fig. 10, which shows the current UCD1+ hierarchy [160].
This system was capable of displaying some
Graph-based techniques have also been adopted within the visual programming environments of IRIS Explorer [1] and OpenDX [14] (see also ViSta [214]). Here, customised applications exist as modules of code which can be connected together graphically, so that data appears to flow around the system. For astronomy, VisIVO, which also employs a modular library approach (using VTK [60,172]), is currently being adapted for virtual observatory use [72]. A natural tendency of these systems has been towards the display of 3D, spatial data, making use of hardware rendering.
These tools have been particularly successful at visualizing scalar data and flow problems (see eg. [143] for more recent developments). For more general multidimensional data, there are several challenges. Figure 11 shows an example from VTK using `splatting' with the quasar data [59]. This technique aggregates the data by overlapping areas or volumes associated with each data point (here the u, g and r variables are described by the three axes; and i by the attributed colour scalar.
Higher dimensional (say 5D or 6D) data sets can also be rendered if glyphs, animation or contours are also be employed, although issues of occlusion and human perception (which varies particularly for colour) begin to take precedence [203]. However, the technique of hierarchical `splatting' (see Fig. 12) may offer faster rendering of very large tables of unstructured data [129].Working in 3D (and higher) therefore requires that a number of issues must be faced:-
It has already been observed that only the least complex (eg.
)
schemes are able to cope with `large' data sets as proposed. For this
reason, the simpler schemes here are described more thoroughly here
than might have been expected. Although this list is not exhaustive,
numerous web-pages exist for comparisons elsewhere [39,16,17,18,199].
In addition, it is convenient here to make a number of further observations:
The `curse of dimensionality' (following Bellman (1961) [20]) refers to the exponential growth of hypervolume as a function of dimensionality. In most cases, this issue is manifested as one of data distribution, which is usually very uneven within the hyperspace. Hence data may be abundant in some regions, but inadequate in others, especially if statistical inferences need to be drawn.
For astronomy, however, it is recognized that some variables are of greater interest than others. In particular, error terms, although important, are generally of lesser interest than their respective observational parameters. Hence there may be a hierarchy of data terms, which may be different for each researcher. The proposed UCD1+ terminology also suggests a data hierarchy [95,49] which might be exploited. It was proposed to examine this issue further by investigating current tree-building tools, and developing, if necessary, some new method of configurconfiguringating the display of data (see Section 6.4). The remainder of this section considers methods of table reduction, by column and then row.
The main column (ie. variate) reduction methods are summarised in Fig. 14.
A brief description of the elements in this figure is given below.
These techniques aim to give focus to particular units of data (often called `individuals') and, in this sense, provide a narrowing of the extent of the data. For example, the focus might be used as an input to select rows from a table. The techniques are summarised in Fig. 15 and below.
The techniques of factor analysis, principal components etc have developed over the course of the 20th Century, fuelled by research in fields as diverse as psychology and ballistics. One of the original aims was to extract the line or plane of maximal variance. Such a line, plane or hyperplane -- known as the principal component -- is actually one of regression about which the orthogonal error is minimised, see Fig. 16.
![]()
|
The advantage of this kind of error minimisation is that it allows any one variable to `predict' another (within certain tolerances) and these are then described as predictor and response variables. Also, if this axis is adopted as the main component of interest (typically, by transforming the data to this axis), then the remaining variance (orthogonal to this plane) appears to be procedural in origin. Moreover, the variables under this transformation are now said to be uncorrelated ie. it is no longer true that one variable may be used to predict the other. Only after this step can the variables be considered to be independent and hence this term is avoided in multivariate analysis.
Factor analysis (FA) has often been confused with principal component's analysis (PCA) as the techniques are similar, although the underlying mathematical model is rather different [84]. Given the importance of these techniques to the present discussion, some background theory to FA and PCA is given below.
If we have
factors (usually columns of data, each representing
a separate variable), we may denote the factors as
.
Generally, we may suppose that these factors can be linearly combined
for each of
observations, to give an overall index,
(
), which can be written:
If
is large, however, it is advantageous to select the factors
(and coefficients) of greatest interest -- usually those factors
which display some interesting distribution or the greatest variance.
If some factors are to be omitted because their contribution is thought
to be negligible, then the error term,
, in each of the expressions
of Equation 4 increases. That is to say, this
term acknowledges that some error has been introduced.
In Equation 4, the coefficients
represent a matrix, denoted
. But the usual starting point for
this matrix is the sample covariance matrix,
, which is defined
by the relation:
It turns out that the factors,
, may then be drawn from the
characteristic forms of the
or
matrices, which
arise from eigenanalysis [103]. In fact, if
is
nonsingular (ie. the determinant of
, ie.
is non-zero), it can be shown that:
We may visualize this transformation as a rotation of the axes along
the axes of maximal variation and, if normalised, the columns of
give the direction cosines between the old and new axes. Determinants
may be used to obtain such roots for small problems, but for large
systems, iterative `power' methods are more appropriate. The Lapack
library provides a well-tested and freely available set of such routines,
using eg. single value decomposition [58,68].
It is useful to consider several further extensions to Equation 7 so that we may elucidate the process of PCA and data transformation (including `sphering'; see Equation 1). Jackson [134], for example, employs the identities:
|
The object of PCA is the linear transformation of a group of correlated
variables, such that some set of optimal conditions is achieved. The
most important condition is that the new variables are uncorrelated.
Mathematically, this is ensured by the requirement that
(using Kronecker's delta function:
if
, or 0 otherwise). Even so, these identities suggest many
possible transforms. From the above,
might be written as
(the inverse square root of covariance), so that the data sphering
problem of Equation 1 can be solved in this
way (although for large problems, the Lapack library offers more viable
inversion routines). Hence, we might apply the
-transform:
The choice of the forms of scaling presented above will be influenced
by the type of application and the nature of inferential statistics
required. In practice, however, it is often more appropriate to work
with the correlation matrix,
, rather than that of covariance
(
), although this may be less powerful statistically. The reason
for this is that PCA is sensitive to additional variance which may
be introduced by:
Amongst the many properties of latent roots, we may cite two of special interest:
Another useful technique -- the `scree' plot -- depicts each eigenvalue as a sequence from largest to smallest. Those components which then lie above the `scree' at the base of the slope are thought to be the most useful candidates for further analysis, see Fig. 17.
An alternative stopping point is the Kaiser criterion [26], according to which we only retain factors with eigenvalues greater than 1. In essence this is like saying that, unless a factor extracts at least as much information as the equivalent of one original variable, we drop it. It is noted that a more sophisticated test, developed by Bartlett and extended by Anderson, also exists to provide a significance test for the equality of smaller roots. Having chosen to adopt a given number of components, it is also possible to assess the error from the `lost' components by inverting the system represented by Equation 4 and using residual analysis [134, p.34,87].
Although used widely for decades, many authors have given warnings
about the wisdom of interpreting PCA results. Jackson has observed
that the physical importance of a component is not necessarily related
to the size of the characteristic root; indeed, the equivalent variation
in a lesser component might have a greater effect on the human eye
or ear and hence scaling might be a more important factor in visualization.
The use of inferential statistics such as Hotelling
-test
may help, although this would probably not qualify as an exploratory
technique [134, p.55].
Everitt, on the other hand, favours the `labelling' of PCs, albeit
with caution, and advocates the re-scaling of variables (eg. by negating)
if this makes the process easier [103]. Whether or not
labels can be attached, however, a useful graphical technique is to
take the first two or three components and perform pairs plots (or
a scatterplot; see 2.4.2) to identify clusters and
outliers. To facilitate this, PCA `scores' can be obtained by taking
the element-wise product of the
chosen components with
data
terms ie.
for
.
Fig. 18 shows the use of PCA scores in two different problems. In the upper plot (a), the iris data are separated largely by the first principal component and although the groups are not distinct, there is relatively little overlap (cf Section 2.4.3). In the lower plot (b), three components are distinguished by classification into two groups, ie. star or galaxy for the SuperCOSMOS data (see A.1). This difference is highlighted by quadratic regression surfaces through these scores (cf Fig. 4).
(a)
![]()
(b)
![]()
|
A surprising feature of PCA is that, although the vector components are uncorrelated, the individual coefficients are not. On reflection, this is less surprising within each component since the terms are usually constrained by their sum (see Table 3) but it turns out that the coefficients of adjacent components are also often strongly correlated. This would seem to indicate that correlation tables and perhaps (graphically) parallel coordinates may be able to help, especially if components with small roots might be important, since, otherwise, these would likely be rejected.
A more widespread technique of re-interpretation is to attempt to
rotate the PCs until they are aligned with some notionally simpler
system, usually so that the coefficients are as close to zero or one
as possible. If the new system has
components assembled as
,
then the requirements may be stated as:
Instead of rotating the system to maximise (or minimise) the influence
of variables of interest, Mardia et al [150] have
suggested that PCA may be used to remove variables directly. First,
PCA is undertaken and a method selected to find the most appropriate
number of components,
, such as by scree-plot. Then one begins
a process of deselection by examining the least significant PC and
rejecting the variable associated with the largest coefficient. The
variable associated with the largest coefficient of the second least
significant PC is then rejected and so on, until only
variables
remain.
It is noted here that PCA has also been developed to provide an analysis of principal curves, particularly useful where it is suspected that the primary relationships between variables are non-linear. This approach is available in S-PLUS and R pcurve library, based on work by Hastie and De'ath (see [35]), and has also been applied to astronomy data via neural networks (see 5.8.4).
One of the main objections to PCA is the absence an underlying statistical model. It is difficult, therefore, to interpret the meaning of the resulting components with any certainty or to give an estimate of the effect of sampling error. Whilst FA overcomes some of these issues, it introduces new problems. To start with, it must be assumed that the underlying factors actually exist, and that, where there is more than one factor, that the factors are independent of each other and any error terms (see Equation 4) [134].
For many authors, another disquieting feature was the willingness of statisticians to employ rotation techniques to find more readily amenable factors. Since these rotations often appeared to be equally plausible, this suggested a considerable degree of subjectivity. The result, particularly for FA, was a considerable amount of adverse publicity. In 1989, Chatfield and Collins observed that few cases existed where the technique had proven to be useful outside the field of psychology [84]. But in recent years, a significant step-forward has been the development and inclusion of maximum-likelihood procedures in software packages, such as SPSS [61] and R, and applications of FA in science and engineering are much more common (see eg. [119,140]).
In PCA, the fact that the sum of the latent roots and the sum of the original variance is the same (4.3.1) is often interpreted as supposing that PCA accounts for all the variation in the data. FA, on the other hand, is said to explain correlations. In the early 20th Century, Spearman observed that the ratio of off-diagonal covariance was often similar between rows, suggesting the presence of underlying `factors' [147].
The general FA model is stated in Equation 4.
Ideally, the variable
in this model is the
th test score
with mean zero and unit variance; the coefficients
are the
factor loadings for the
th test; and
(
)
are uncorrelated common factors, also with mean zero and unit variance.
It is assumed that the variables(
) are correlated with the
factors, but are otherwise uncorrelated with each other. With this
model: