Elastohydrodynamic Lubrication

Parallelism

The need for parallelism

The need for parallel computing in EHL calculations is now becoming more apparent. The push for solutions to more realistic problems, such as real surface roughness, will require much larger datasets to be calculated. Accurate calculations of true representations of surfaces are important since perfectly smooth contacts ignore all the extra pressure required to flatten the bumps as they pass through the centre of the contact. Lubricant film behaviour will differ, and extreme operating conditions will all occur at higher pressures, or with greater local differences of solution.

An example of real roughness is shown in the picture to the right. This is actually a measured rough surface supplied by Shell. The data has resolution 256x256 datapoints, but to accurately model that there needs to be many more mesh points in between.

Real surface roughness picture

Why is parallel EHL a challenging problem?

The main difficulty with parallelising EHL is the highly coupled equation system defining the problem. Since the deformation at every point is calculated from the pressures at every other point then this implies a global communication to all the processors. In a shared memory setting this, alone, would not pose a problem, however the increasing memory requirements of problem size make is an unworkable method. For example, the largest domain we have used is 8093x8093 points. Once all the work and solution matrices have been included this works out at a minimum requirement of 27Gb of memory. Whilst we do have access to one machine with that, this clearly is not available for runs this large all the time, and so a distributed memory version of the code is required.

The nature of the deformation calculation is not just based on knowing the pressure at every point but also how far it is away. We are already using multilevel multi-integration and multigrid schemes so solutions on many grids (and half grids) are required. Communication of data must be optimised fior speed, however clearly an ovewhelming consideration has to be minimising the memory footprint per processor to enable the code to scale to as many processors as possible.

The parallelisation of the code I have undertaken is explained in a paper in the Proceedings of PARA'02, Lecture Notes in Computer Science, vol 2367, pp 521--529. The results given there are on smaller numbers of processors than will be shown in the forthcoming journal paper, currently in preparation. It should be noted that for larger numbers of processors finer coarsest grids need to be used in the parallel solver due to the memory overlaps required between processors, this means a slightly more accurate solution is being computed at the expense of some of the multilevel speed-up. For strict comparison of like cases then the parallel speed-up for is better.

Grid Dimensions 1 2 4 8 16
7 257x257 20.99 11.68 6.78 5.52 10.23
8 513x513 74.92 41.10 23.78 15.17 25.65
9 1025x1025 289.87 155.66 92.37 54.12 46.21
10 2049x2049 1139.83 613.93 343.46 207.70 148.17
11 4097x4097 - - - 811.64 538.96

Parallel facilities used in these tests

The majority of the parallel development has been done on the machines in the White Rose Grid. These are Snowdon, a 256-processor Intel Pentium 4 distributed memory machine with 128 dual processor nodes, each with 2Gb or memory, connected by Myrinet 2000, and Maxima, a collection of shared memory SUN V880s (8 processors, 16Gb memory) and a SunFire 6800 (20 processors, 44Gb of memory). Other machines the code has been tested on include an SGI Onyx (12 processors), an SGI Origin 2000 (32 processors), an Itanium cluster and a Quad Xeon PIII 4Gb 'desktop' machine.


Introduction - Variable Timestepping - Mesh Adaptation - PSEs - Parallelism - The Grid - Optimisation