HYBRID-PARALLEL FORMULATION OF FUNDAMENTAL QUANTUM-CHEMICAL ALGORITHMS

Hybrid-parallel variants of Hartree-Fock, Kohn-Sham and Møller-Plesset second-level perturbation theory are described. Their eﬃciency with respect to the serial and MPI-based parallel implementations are measured and brieﬂy analyzed. It is shown that while hybrid parallelization provide increased eﬃciency in all cases, the magnitude of the eﬀect strongly depends on the features of the particular algorithm.


Introduction
Significantly growing demand for high quality computational chemistry results for large molecular systems cannot go unnoticed.As a result, the chemistry community needs for computational power are constantly growing.This trend is accompanied by quick development of high performance hardware solutions.The most abundant computing systems used for the purposes of computational chemistry are clusters, which, owing to the specificity of quantum-chemical calculations, provide the best efficiency to price ratio.
The multicore architecture is now well established as the most typical hardware solution for cluster nodes.While such systems can be very well served by the message passing approach usually found in parallel formulation of the computational chemistry algorithms, the shared memory and fast synchronization mechanism available within a node can be exploited to improve the efficiency.This gives rise to the so-called hybrid parallelism, where two or more of MPI [2], OpenMP [3], POSIX Threads (PT) [1] and/or General-Purpose computing on Graphics Processing Units (GPGPU) techniques are combined to utilize the computational resources more efficiently and effectively than any one of them on their own [7].
We designed such a hybrid approach for Hartree-Fock (HF) [6], Configuration Interaction Singles (CIS) [6], Kohn-Sham (KS) [8], Time-Dependent Density Functional Theory (TD-DFT) [5], and Møller-Plesset second order perturbation theory (MP2) [6] calculations.The algorithms are implemented within the niedoida [10] computational chemistry package.Building upon optimized message-passing based parallel implementation of these quantum-chemical methods [9], we introduced the next level of hierarchy employing PT-based parallelism within a batch of tasks scheduled for given node by the MPI dispatcher.
The paper is structured as follows.The core computational quantum chemistry algorithms are introduced in Section 2. In the next section our implementation of the hybrid parallel algorithms is briefly presented.Benchmark results are shown and analyzed in Section 4. Conclusions are drawn in Section 5.

Core Algorithms of Computational Quantum Chemistry
Large part of the computational quantum chemistry can be considered to be just a glorified name for simple tensor algebra.Specifically, these are various transformations of the two-electron integrals (electron repulsion integrals, ERI) tensor which consitute the rate-determining step of the calculations.The ERI are defined as where χ denote the atomic orbitals.It has to be noted that the sheer size of the ERI tensor and its sparsity pattern prevent standard algebraic treatment.While the actual resource requirements depend strongly on the type of tensor contraction performed by a specific algorithm, we can distinguish two main types of them, the two-index contraction and the four-index transformation The former consitutes the core of the Hartree-Fock [11,6] and hybrid Kohn-Sham [4] methods.The latter is crucial in the post-Hartree-Fock methods for which we used the Moller-Plesset second order perturbation theory (MP2) [6] as a representative example.
Apart from the complexity of the ERI tensor transformations, the cost of calculating the integrals is not negligible.Hence, usually efforts are made to cache at least part of the tensor and reuse them in subsequent computations.
The Kohn-Sham method additionally requires calculating of the exchangecorrelation potential matrix.Typically this is done by numerical integration, which introduces additional cost.On the other hand, pure (non-hybrid) DFT does not actually require full-fledged ERI tensor, because the Coulombic contribution to the total potential energy can be obtained by density-fitting procedure [12].Such approach is usually denoted by DF-DFT.The analysis does not hold for hybrid potentials, which include part of the Hartree-Fock potential contribution, and therefore involve the ERI tensor.
The specifics of the methods may have strong impact on the efficiency of parallel implementation.Hence, we include in the analysis representative algorithms from all of the above methodologies.To keep the paper concise, we decided to focus on the HF, DFT, DF-DFT and MP2 methods.The structure of CIS and TDDFT implementation is analogous to HF and DFT, respectively.Hence, they do not require specific analysis.

Implementation details
The parallel variants of the core quantum-chemical algorithms are formulated by splitting the calculations into tasks which are executed concurrently.To fully exploit the typical cluster architecture specifics, the concurrency is introduced at two different levels.
At the cluster level, the parallelism is implemented using MPI.The computational problem is divided into tasks of different sizes.The tasks are stored in a task queue.A node gets the next task from the queue as soon as it completes the previous one.The splitting of the problem into tasks is organized as follows.A fraction 1/f of the original problem is divided into n tasks, where f is the splitting factor and n stands for the number of nodes.Then the procedure is repeated recursively for the remaining part of the problem.The recurrence is stopped when the size of the remaining part is smaller then the threshold t.The algorithm is parametrized by f and t.For the details of the procedure see Ref. [9].
At the node level, POSIX threads are employed to exploit the multicore architecture.The task assigned to the node by the MPI dispatcher is further divided into subtasks, which are put in a subtask queue.Then group of threads is woken up.Each thread gets a subtask from the queue and executes it.The procedure is repeated until the subtask queue is empty.Then the next task is requested from the MPI dispatcher.The operation is repeated until the whole computational problem is completed.

Results
To assess the usefulness of the proposed approach, a series of calculations were run in parallel mode and their relative performance with respect to single-core execution times were measured.
All calculations were performed for a molecule of 1,3-diphenylisobenzofuran (see Fig. 1), as a representative case for medium-sized molecular systems.At the HF and MP2 levels of theory the cc-pVTZ atomic basis was used.Given that the quality of the DFT results depends to much smaller extent on the size of the applied basis set, cc-pVDZ basis was used for DFT methods.In order to allow for performance analysis, we run all the calculations on two different computer systems.The first was a cluster of 4-core nodes connected via Gigabit Ethernet.Each of the nodes was equipped with a Xeon X3450 CPU running at 2.67 GHz and 8 GB of memory.The other system was a 12-core fat node comprising two 6-core Xeon X5670 CPUs running at 2.93 GHz and 70 GB of memory.To avoid any interference with other processes runnning on the computer, we elected to use only 10 of the cores for the benchmark jobs.
The multicore fat node results (see Tab. 1) show that the difference between the MPI and thread-based parallelism is not very significant for CPU-bound methods.This is not the case for MP2, which seems to profit greatly from the shared memory present in the multithreaded version.
Speedup data obtained for a cluster (see Tab. 2) to some extent follow the results obtained for the single fat node.However, the quantitative analysis differ slightly.First of all, the advantage of the hybrid approach is more pronounced for the CPUbound methods than in the fat node case.This is attributed mainly to more flexible dispatching of the subtasks, smaller communication overhead and better integral cache utilization with respect to purely message passing-based parallelism.On the other hand, the MP2 speedup, while significant, is not as large as in the case of the singlenode multicore system.This can be easily explained by the fact that the hybrid approach results in the shared memory size being in-between the multithreaded and MPI cases.

Conclusions
The hierarchical hybrid implementation proved to provide increased efficiency in all cases we investigated.However, the magnitude of the effect strongly depends on the features of the particular algorithm.Among the algorithms under study, the most pronounced effect is observed in the case of MP2.In general, the algorithms which can effectively exploit shared memory are those profiting the most.

Table 1
Relative speedup of selected parallel calculations on 12-core fat node (only 10 cores were used).

Table 2
Relative speedup of selected parallel calculations on cluster.