Object-oriented programming and fast computation techniques in Maud, a program for powder diffraction analysis written in Java^TM

Luca Lutterotti¹ and Mauro Bortolotti²

(1) (Present address) Earth and Planetary Science Department, University of California at Berkeley, 94720 - Berkeley, CA, USA, Dipartimento Ingegneria dei Materiali, Università di Trento, via Mesiano, 77, 38050 - Trento, Italy, [email protected] - WWW: http://www.ing.unitn.it/~luttero/ ; and Dipartimento Ingegneria dei Materiali, Università di Trento, via Mesiano, 77, 38050 - Trento, Italy, [email protected]

Introduction

The Maud project [1] started in 1996 as a mere exercise using the one year old Java language [2]. The idea was of writing a full Rietveld program easy to use and develop. But one idea was that the easy to use and develop concept should not lead to an underpowered program. Java at that time was just in its infancy, not ready for a full scientific program, but very promising for the possibility to run it on different platforms; powerful and clean, it was particularly focused on writing interfaces and web applet. Was it ready for a scientific application in need of speed and highly demanding mathematical performances? At the beginning not, still in 1996, with version 1.0 of the jdk (Java Development Kit), there were few bugs preventing a full computation of a powder pattern when the memory requirement was more than a 1 or 2 Mb. But one thing was already clear, writing a moderately complex interface was more easy than with other languages and coming from C/C++ finally you have a language were you don't need to deal with pointers. This means mainly a little bit less speed for the user but much less painful debugging for the developer. The first public version of Maud was released on the web in the early fall of 1998 with the jdk 1.1 that was the first really usable Java version for applications.

[MAUD program interface]

Fig 1: General view of the program interface where several plug-in models are visible. For example in the Instrument view (Hippo19 window), the combo box is open to see some of the different geometries already available; all the other combo boxes in the same frame correspond to other characteristics that support the plug-in structure and can be extended by providing additional models. For the phase frame (Calcite) under the Structure Factors tab panel there are others tree model-extendable features to work on structure solution.

In the early years more time was spent on designing the program and especially the internal structure to focus on the following aspects:

Objects and their arrangement inside the program should reflect the way a researcher is approaching diffraction, from the experiment to the analysis. So this means for example, that there should be an instrument, specified by several other objects like geometry, radiation etc. The root object was identified in an analysis containing basically 4 kinds of objects: samples, instruments, sets of data and phases. Each set of data has associated one instrument; each sample contains some of the phases and has associated some sets of data collected on it and so on.
No compromise on future development and possible analyses. So the structure should be open as much as possible and easily extendable. This should be not too much in contrast with speed an easy development.
Modifiable and extendable by advanced users. In order to accomplish that, the user must be able to write their own objects with algorithms (the controller) and also the interface (the view) to manage them. So instead of a pure separated controller and view model, a mixed one has been adopted in which each class can specify its own controller and view or inherits them from the superclass.

After the initial effort spent on designing the infrastructure the work has been concentrated on implementing the full program to cover as much aspects as possible in diffraction analyses with at least one or a couple of models for each methodology. The program covers now from Rietveld, to ab initio structure solution, to quantitative analysis, line broadening (crystallite size and microstrains), texture and residual stresses. Finally also reflectivity has been added for a specific project to couple it with diffraction. Only powder diffraction was considered.

In the last period of time, the general structure of the program has been fixed and the optimization step has been started. In the same time we started to write the documentation in order to provide help for users and for possible future developers who want to put their hands to customize or extend the program for their needs. This article is a starting point in this effort and we will discuss briefly the general OO (Object Oriented) structure of the program by some examples with code on how some tasks have been accomplished or can be extended, and some special techniques for fast computation in Java.

Plug-in structure and examples

To simplify and generalize the program a basic object has been developed and nearly all the other objects extend this class. This basic object provides the general structure and support for loading/saving, managing parameters, tree of objects and plug-in objects. Using that class as superclass greatly simplify the addition of different models in the program to extend and customize its features. For example, the geometry of an instrument is defined by an object extending the basic "Geometry" class. Each object, extending the "Geometry" class, can be used by Maud as a choice of a different geometry. So actually there is a "GeometryBraggBrentano" class, a "GeometryDebyeScherrer" class and more are available. Every time a new instrument geometry is required that doesn't match one of the already available in Maud, you can extend the one more similar overwriting just the method that need a different computation. A specific example has been setup to show how to include in Maud a new geometry in which the diffraction pattern is obtained by integrating the Laue rings obtained by an Imaging Plate or CCD flat camera in transmission. The example shows how to provide a different Lorentz-Polarization correction for that specific geometry. The coding example with the instructions to compile and include the class in Maud is available via http://www.ing.unitn.it/~luttero/maud/.

Another extendable feature of Maud is concerning the diffraction datafile format. To let Maud recognize and load a new data format it is possible to write a standalone class, compile it and add it to the libraries. A specific example with all the instructions and steps is reported at the same address as for the previous geometry case.

Further examples will be available in the future at that address to cover more extendable features of the program.

Fast computing for crystallography and Java

The following technique were used for the program Maud and we relate them to the Java language, but most of them are very general and have their non Java counterpart so you can apply them also to every other language or program.

Most of the people/developers around think of Java as a poor language in term of computation speed. This is not true and the wrong feeling is related principally to two reasons. The first is that Java at the beginning was just an interpreted language not running natively on the platform. This has changed a lot in the last few years with the advent of JIT (Just In Time) compilers whose job is to compile natively on the fly the java binary code for the specific processor as the various classes are loaded and used by the JVM (Java Virtual Machine). Running natively, there are no reasons why a division, a multiplication or a trigonometric function should be slower in Java than in another language. For the Windows platform for example the JIT are so optimized that in a pure mathematical computation most of the time you reach 90-95% of the speed of the C counterpart. The second fact is related instead to the general slowness of the interfaces written in Java. This is mainly due to the fact that most of these interfaces are coded using the Swing library that is very complete and fancy (changeable look and feel included) for the number of widgets it provide, but the internal design was not focused on speed and it is well known it provides a redundant number of event monitoring tasks and layers even for simple windows. But this has nothing to do with the speed of execution in Java. Interfaces written in pure AWT (the original library for interfaces in Java) are instead pretty fast, but more simple. Some developers, to optimize for speed, just use AWT for most of the interface, adding few Swing components just for the more fancy controls.

The case of the swing library is a typical case that shows how in an object oriented environment you can write very robust and complicated programs/libraries, but it is pretty easy to loose speed. Focusing on speed generally weakens your program and it is not very easy to write complicated OO programs that are both robust and fast. Non-OO languages normally permit to easily write fast programs, but then it is very difficult to create one that is very large and robust.

General strategies for speed optimization (profiling + rewriting, caching and native computation)

Before proceeding, I would like to recall the sentence of a developer (I forget the name and even the exact words) but the important point is that "… The developing of a program is generally divided in two parts: the general implementation to provide all the features the program needs, and the optimization step. In my experience the optimization step always starts too early…"

In the case of the Maud program a similar approach has been used and we realized that every time we try an optimization to gain speed we are likely to loose generality, ease of maintenance but we are especially introducing bugs. Let’s see one example.

[MAUD in action]

Fig 2: Maud in action with the JPVM console (for distribute computing) opened. The Daemon has been started on the local machine (principal machine) and it is ready to accept incoming connections. The same console can be used by another machine running Maud to connect to this principal machine by entering the TCP/IP address and port number as shown in the same window. Pressing the "Connect to Host" button, start the connection and then the connecting machine is ready to accept tasks distributed by the principal machine.

Suppose we have a phase object that contains other objects some of which are atoms or lists of atoms. Then we have another object that is in charge of optimizing the atoms positions from a set of intensities. This object needs at a certain time to compute the structure factors based on the current positions of the atoms. So it asks to the phase to compute the structure factor for a certain (hkl); the method (or object, if you want it to be changeable or customizable) that computes the structure factor asks for the current position and scattering factor to each atom. The phase computes then all the equivalent positions based on the current space group and gives them back to the method responsible of structure factor computation. The resulting OO code is simple, easy to read, maintain, debug and modify, but what about speed? If the structure factor is evaluated once or just a few times, than it is perfect; but suppose you have a genetic algorithm to solve the structure that is asking this computation several times, then your program will become extremely slow.

The first technique to speed up the computation is to cache the more requested data instead of re-computing it all the time. In this case we have different possibilities. First the phase may store the list of all actual equivalent positions for all atoms and use them without the need to recalculate them each time. We have to provide a mechanism to re-compute these positions every time something changes like the space-group, one atom position or an atom is removed or added. Further optimization can be provided in the structure factor computation method to optimize the computation for different space groups etc.

The problem with the last approach is that the refreshing mechanism should be very complete and robust; otherwise something will not be computed properly. But then, if the structure of the computation need to be changed, it becomes very difficult because the refreshing mechanism must be taken into account and the object encapsulation has been partially lost. Indeed if the refreshing mechanism is very good, you can even improve the speed over a normal procedural program, as we will avoid redundant computation for something that has not been changed.

In conclusion, the "computing only when refresh needed" technique is very powerful but prone to bugs and costly from the programming point of view. Another drawback of the caching mechanism is about the large memory consumption for storing the data. In general it is not needed that everything in the program is structured following this model. Most of the programs just spend a consistent fraction of cpu time on few functions and just optimizing these few functions is sufficient to get nearly full speed.

In Maud the last approach has been adopted. In any case a general mechanism for refreshing computation only when needed has been provided by a basic object that all the diffraction objects (phases, atoms, instruments, samples etc.) must extend. Then, additional gain in performance has been achieved (in a version still to be released because of testing) using a profiling technique to identify where the program was spending most of the time. These algorithms have then been optimised in order to maximum the speed.

In Java, profiling is included in the basic jdk of every Java implementation. It can be used in every Java program, even without having access to the source code. In the case of Maud, the program can be started by typing the following on a command line (DOS window or UNIX shell; the following is for the UNIX shell):

>java -mx1024M -cp Maud.jar:lib/miscLib.jar……:lib/ij.jar it.unitn.ing.rista.Maud

We are omitting here all the *.jar libraries contained in the lib folder that should go on the place of the dots sequence. To use profiling you just add the following string (for full reference type java -help and follow directions):

-Xrunhprof:heap=sites,cpu=samples,depth=15,monitor=y,thread=y,doe=y

just before the -mx1024M argument that specify the maximum memory amount that can be used by the program (1Gb). If we perform the computation we need to optimize, at the end a log file is produced reporting statistically where the program is spending most of its time (in percentage and cpu times) and also the entire tree of the calling process of every thread. If debug is turned on during the compilation of the java libraries it returns even the corresponding line numbers in the source code. So it is quite easy to identify where you need to optimize your computation efficiently at the minimum developer effort.

In Maud, for a simple structure refining of one phase with one spectrum, most of the cpu time was dedicated to the computing of the profile function for all peaks to generate the spectrum. The first thing done was to optimize this computation in pure Java at maximum by avoiding any unnecessary redundant computation and optimizing every loop. For example a division is well known to be slower than a multiplication so every division by a constant was substituted by the multiplication with the reciprocal value of that constant. Further speed gain on this particular routine can be done by more radical approaches as reported in the following paragraph and they have been tested or are in use in the last Maud version (available soon).

Special techniques for fast computation

Many techniques have been developed for fast computation on today computers. We will report here some of them that we tested using the Maud program. Some of them require special computers or processors; others are more general.

[results of profiling]

Fig 3: Results of the profiling after using the vector library JNIAltivec. As you can see the method vpseudovoigts is now ranked 15th and acoount only for 0.81% of the cpu time. After this profiling also the getTextureAngles (method to convert angles from instrument definitions to pole figure angles for texture analysis) routine has been implemented in the vector library. The window is split in two; the getTextureAngles method tracked in the first panel has a trace number 17187 that can be used to find the entire trace stack shown in the lower panel. Source code lines in parenthesis.

Vector computation

Some recent processors, as for example the Motorola G4 or the upcoming IBM Power4 include a vector unit (for the G4 is called Altivec [3]) that performs operations in parallel on vector elements instead of scalar values. This is useful only if the computation can be arranged to work on vector elements instead of single values and special coding techniques must be used to take advantage of it. For example in the case of the peak function computation, a special Java native vector library (in C/C++ language) including the Pseudo-Voigt has been developed for the Altivec [4]. The native C/C++ routine specification (usable from Java directly) is:

public static native int vpseudovoigts(float[] x, float[] f, int[] minindex, int[] maxindex, float[] intensity, float[] eta, float[] hwhm, float[] position);

It requires an array of "x" coordinates and return the intensities computed in the array "f". The arrays "intensity", "eta", "hwhm", "position", "minindex" and "maxindex" contain the values, for all Pseduo-Voigt peaks, respectively of the integrated intensity, Gaussian content (from 0, full Gaussian, to 1, full Cauchy), half width at half maximum, center position in x coordinate, starting and final index of the x coordinate for the range over which the function should be computed. The native C/C++ routine divides the x and f arrays in multiple lengths of the vector unit (128 bit unit, 4 floats of 32 bit each for the Altivec) and for each PV peak performs the computation in vector basic operations. A different version of the routine was created at the beginning doing the same computation on only one Pseudo-Voigt at time (vpseudovoigt); but then most of the cpu time was spent to move the "x" and "f" arrays between the native library and the Java Virtual machine. So a further gain in speed has been realized passing all the PV peaks at once (vpseudovoigts). Both routines are available in the library. Special care has been devoted in the native library to avoid moving too much data from the "scalar" to the "vector" unit, as this could be a bottleneck in the vector computation. A gain by more than a factor of two has been gained over a pure native scalar approach and even more respect to the pure Java approach. In the overall economy of the program (only simple Rietveld refinements) a gain in speed between 20 and 30% has been realized. More optimizations can be done by converting other critical routines to vector computation.

Parallel computing

Another powerful approach is to use parallel computing to benefit of multiprocessor machines. Java makes it very easy to write programs for parallel computing. Java has built-in a thread system, so the critical part is just to divide the computation in parallel tasks. After this has been accomplished it is possible to put each task in a different thread. Threads are executed in Java concurrently and if more than one processor is available the Java Virtual machine spreads automatically the different threads over the available processors. Creating and running a thread is very easy, so we just write:

 Thread task[] = new Thread[allTasksWeNeed];
 for (int i = 0; i < allTasksWeNeed; i++) {
 (task[i] = new Thread() {
 public void run() {
 // here goes the computation code for each task
 .....
 }
 }).start();
 }
 boolean allTaskFinished = false; // we have to wait now until all tasks has been completed
 while (!allTaskFinished) {
 allTaskFinished = true;
 for (int i = 0; i < allTasksWeNeed; i++)
 if (task[i].isAlive()) allTaskFinished = false;
 // we sleep for a while before to check again
 try {Thread.currentThread().sleep(100);} catch (InterruptedException e) {}
 }

So in these few lines we create a subclass of Thread, we overwrite the method run() to perform our task and we start the thread with the method start(); this for all the tasks we need to run in parallel. At the end we have to wait and check until all threads have finished their computation.

Distributed computing

Multiprocessor machine (with more than 2 processors) are not widely available, so a less efficient but still powerful approach is to distribute the parallel tasks over networked computers. A well known example of this is the SETI at Home project [5]. The JPVM [6] library has been modified and integrated in Maud to provide the support for distribute computing. The JPVM is a library emulating the PVM distributed system and uses basically messaging to spread the computation over others JVMs. The advantages of the system are that it can be used in a heterogeneous network, both over TCP/IP or a cluster. The drawback is that being a messaging system it performs efficiently only for long independent tasks. Otherwise too much time will be spent on messaging over computation. After some unsuccessful trials where the computation was not distributed efficiently, the final strategy was to use the JPVM directly in the least squares algorithm at the beginning. Maud, differently form other Rietveld programs, uses numerical derivatives. This permits the incorporation of some methodologies for which no analytical expression of the derivative is possible. The disadvantage is that for each refinable parameter at least one time the entire function (or all spectra) needs to be computed, so the speed cannot match in any case the one of a program using analytical derivatives. Using JPVM for derivative computation, each distributed task will be in charge of computing just one derivative and also the least squares matrix can be divided over the network decreasing the memory requirement per computer during such process. In a sufficiently fast network environment this greatly improves the computation speed in the least squares step. The system is still under testing and optimization.

Conclusions

The use of the Java environment has permitted us to write the Maud program with an OO structure that is easily extendable and quite powerful. In the present article we showed some examples of how easy it is to extend some features by writing the desired extension. Also, we have shown how it is possible to optimize the program for speed using different approaches ranging from code optimization, to vector, parallel and distribute computing. Our libraries, which are usable by other programs, are available via the Internet.