DEVELOPMENT AND ACCELERATION OF UNSTRUCTURED MESH-BASED CFD SOLVER

The study was undertaken as part of a larger e ̈ort to establish a common computational §uid dynamics (CFD) code for simulation of internal and external §ows and involves some basic validation studies. The governing equations are solved with ¦nite volume code on unstructured meshes. The computational procedure involves reconstruction of the solution in each control volume and extrapolation of the unknowns to ¦nd the §ow variables on the faces of control volume, solution of Riemann problem for each face of the control volume, and evolution of the time step. The nonlinear CFD solver works in an explicit time-marching fashion, based on a three-step RungeKutta stepping procedure. Convergence to a steady state is accelerated by the use of geometric technique and by the application of Jacobi preconditioning for high-speed §ows, with a separate low Mach number preconditioning method for use with low-speed §ows. The CFD code is implemented on graphics processing units (GPUs). Speedup of solution on GPUs with respect to solution on central processing units (CPU) is compared with the use of di ̈erent meshes and di ̈erent methods of distribution of input data into blocks. The results obtained provide promising perspective for designing a GPU-based software framework for applications in CFD.


INTRODUCTION
Fluid §ow and heat transfer occur in nature and engineering.There are numerous naturally occurring phenomena as well as technical and technological applications in which §uid §ows and heat transfer play an important role.
The methods of CFD are extensively applied in design and optimization of industrial devices to get more insight into three-dimensional (3D) unsteady §ows through §uid or gas passages.Accurate prediction of compressible §ows still remains a challenging task despite a lot of work in this area.The quality of CFD calculations of the §ows strongly depends on the proper prediction of §ow physics (shock waves, rarefaction waves, and recirculation regions).Investigations of heat transfer, skin friction, secondary §ows, §ow separation, and reattachment e¨ects demand reliable numerical methods, accurate programming, and robust working practices.
The stagnation in the clock-speed of CPU has led to signi¦cant interest in parallel architectures that o¨er increasing computational power by using many separate processing units.Modern graphics hardware contains such an architecture in the form of the GPUs.The GPU platforms including GPU clusters make it possible to achieve speedups of an order of magnitude over a standard CPU in many CFD applications and are growing in popularity [1].
Speed and accuracy are the key factors in the evaluation of CFD solver performance.In CFD applications, the increasing demands for accuracy and simulation capabilities produce an exponential growth of the required computational resources.High performance computing (HPC) resources are widely used in engineering applications.
The use of GPUs is a cost e¨ective way of improving the performance in CFD applications [2].Taking advantage of any multicore architecture requires programs to be written for parallel execution.For CFD, this has traditionally meant splitting the §ow domain into several parts (domain decomposition) that are solved independently on each processor node in a cluster, with the §ow properties at boundaries being communicated between the nodes after each time step (processor balancing).This is also the process adopted for GPUs, but the GPU introduces several additional constraints that make the stream programming paradigm particularly useful [3].
Although GPU has attractive characteristics for massively parallel computations, it has not been implemented in CFD for a long time due to the complex programming techniques.Developers must have special knowledge about computer graphics which is unfamiliar for general CFD researchers.Thanks to the CUDA (Compute Uni¦ed Device Architecture) library provided by NVIDIA, researchers are free from the restrictions of computer hardware knowledge and need to concentrate on CFD algorithms and CUDA programming language.
Depending on the complexity of the CFD problem to represent and solve, structured or unstructured meshes are used.Computational algorithms are more e©ciently implemented on structured meshes and data structures to handle the mesh are easy to implement [4,5].However, structured meshes present poor accuracy if the problem to be solved has complex internal or external boundaries.On the other hand, unstructured meshes present more §exibility and higher accuracy to represent problems that have complex geometries and boundaries [6].
However, the data structures to handle it are not easy to implement and also, explicit neighboring information should be stored.
Much of the e¨orts in running CFD codes on GPUs has been directed toward the case of CFD solvers based on structured and block-structured meshes [3, 7 12].These solvers are easily to implement on GPUs due to their regular memory access pattern.There are various examples of implementation of CFD solvers on structured meshes for simulation of §ows of viscous incompressible §uid [1315].
Unstructured mesh based analysis methods on HPC systems with shared memory and distributed memory have been largely studied.However, shared and distributed memory systems are fundamentally di¨erent from GPUs.A GPU is a SIMT (Single Instruction Multiple Thread) engine, whereas shared and distributed memory systems are MPMD (Multiple Program Multiple Data) engines.The common aspect of these parallel engines is that in both of them, the mesh application is limited by memory latency.Achieving good performance for unstructured mesh based CFD solvers on GPUs is more di©cult due to their data dependent and irregular memory access patterns [1618].
The most of the work done so far has either been for relatively small codes written from scratch or for a small portion of a large existing code.However, GPU support is available in mathematical packages (MATLAB) and commercial CFD solvers (ANSYS CFX and ANSYS Fluent).
Multigrid and preconditioning techniques are widely used to increase performance of the CFD code.A key issue is e¨ective implementation of these techniques on unstructured meshes.
The present work is undertaken as a part of a larger e¨ort to establish a common CFD code for simulation of §ows in aerospace and mechanical applications and involves some basic validation studies.Up to now, a few researches on fully 3D compressible NavierStokes GPU solver for engineering applications have been reported.The motivation of this paper is to assess the in-house compressible CFD code and to demonstrate successful design of a highly parallel computation system based on GPUs and validate the speedup factor compared with CPU.The code is programmed following the standard of CUDA C language.Single precision arithmetic is kept through the entire residual computations with the help of latest GPU hardware and careful design of CFD code.The results obtained are generally in reasonable agreement with the available experimental and computational data reported in literature.The parallelization methods are studied and speedup factor by GPU cards is measured.
The pressure is calculated as The vector of conservative variables, Q, and the §ux vectors, F x , F y , and F z , have the following form: The components of viscous stress tensor and components of heat §ux vector are found from the relations: Here, t is the time; ρ is the density; v x , v y , and v z are the velocity components in the coordinate directions x, y, and z; p is the pressure; e is the total energy per unit mass; T is the temperature; and γ is the ratio of speci¦c heat capacities.The governing equations written in the form (1) are suitable for both laminar and turbulent §ows.In simulation of turbulent §ows, the e¨ective viscosity, µ e , is calculated as a sum of molecular viscosity, µ, and eddy viscosity, µ t , and the e¨ective thermal conductivity, λ e , is expressed in terms of viscosity and Prandtl number.They are expressed as where c p is the speci¦c heat capacity at constant pressure.Molecular and turbulent Prandtl numbers are Pr = 0.72 and Pr t = 0.9 for air.The Sutherland£s law is used to obtain molecular viscosity as a function of temperature: where µ * = 1.68 • 10 −5 kg/(m•s); T * = 273 K; and S 0 = 110.5K for air.

NUMERICAL METHOD
The nonlinear CFD solver works in an explicit time-marching fashion, based on a RungeKutta stepping procedure.The §ux vector is split into the inviscid and viscous components.The governing equations are solved with upwind ¦nite di¨erence scheme for inviscid §uxes and with central di¨erence scheme of the second order for viscous §uxes.The unstructured CFD code developed uses an edge-based data structure to give the §exibility to run on meshes composed of a variety of cell types [19].
In conservative variables, the equation describing an unsteady viscous compressible gas §ow is written as where Q is the vector of conservative variables; F (Q) is the vector of inviscid §uxes; and G(QU, ∇Q) is the vector of viscous §uxes.Integrating Eq. ( 2) over the control volume V with boundary ∂V , whose orientation is speci¦ed by the outer unit normal n, and applying the Gauss Ostrogradsky theorem, gives The governing equation (3) solved by the CFD code is of the form: where Q is the §ow variables vector averaged over the control volume.The §ow residual is where L(Q) denotes all the spatial di¨erencing terms and S(Q) denotes the terms from boundary conditions and possible source terms.
Equation ( 4) is written in the form: where L(Q n i ) is the di¨erential operator.The subscript i refers to the control volume and the superscript n refers to the time layer.
The three-step RungeKutta method is used for discretization of Eq. ( 5) in time: Here, . The inviscid §ux is found from the relation: where the subscripts L and R refer to cells on the left and on the right edges of the control volume.The matrix A is presented in the form A = R˜L where ˜is the diagonal matrix composed from the Jacobian eigenvalues and R and L are the matrices composed from its right and left eigenvectors, respectively.The time step is found from the estimation of the inviscid and viscous §uxes: where CFL is the CourantFriedrichsLewy number and β ∼ 0.5.The time step -t 1 i is calculated on the basis of the spectral radius of the Jacobian of the discrete inviscid operator and the time step -t 2 i is found on the basis of the quasi-linear form of viscous §uxes written in primitive variables and the spectral radius of the Jacobian of the discrete viscous operator.
Convergence to a steady state is accelerated by the use of multigrid techniques [20] and by the application of block-Jacobi preconditioning for high-speed §ows, with a separate low Mach number preconditioning method for use with lowspeed §ows [21,22].The sequence of meshes is created using an edge-collapsing algorithm.
The computational procedure is implemented as a computer code in C/C++ programming language.Parallelization of the computational procedure is performed by a message passing interface (MPI).The CUDA technology is used to implement GPU version of the code.

PRECONDITIONING
Numerical methods for compressible gas equations that perform well at moderately subsonic and supersonic §ow velocities become low e¨ective or unsuitable as applied to §ows at low Mach numbers (M < 0.2), which is manifested by slower convergence of time marching to a steady state and by the loss of accuracy of the resulting steady-state solutions.The slower convergence of time marching is explained by the fact that the sti¨ness of the compressible Euler and NavierStokes equations increases as M → 0 (this feature is exhibited at the di¨erential level).The sti¨ness is characterized by the ratio of the maximum to minimum eigenvalues of the Jacobian (the ratio of the maximum to minimum propagation velocities of perturbations).The integration time step is determined by the velocity of the fastest wave (acoustic waves, λ = |u + c|), while the time required for reaching a steady state depends on the velocity of the slowest wave (convective waves, λ = |u|).In viscous problems and turbulent §ow computations on stretched meshes in boundary layers, the time step is restricted by the acoustic solution modes and by the mesh size in the normal direction to the wall.
Preconditioning makes it possible to eliminate the sti¨ness of the original system and to accelerate the convergence of time marching to a steady state.Additionally, subsonic §ows can be computed more accurately by applying a modi¦ed discretization of convective §uxes in the preconditioned equations.In the general case, preconditioning changes the form of the underlying equations and the properties of di¨erence schemes because it introduces arti¦cial viscosity.
A preconditioning method is developed.It makes it possible to construct a universal numerical procedure for computing inviscid and viscous compressible §ows in a wide range of Mach numbers (from essentially subsonic to transonic and supersonic §ow velocities).The preconditioning matrix is constructed by applying the approach proposed in [23].This approach relies on physical variables (one of which is temperature).Its features include a speci¦c form of writing §uxes, the computation of a dissipative term in the course of ¦nding the §uxes through control volume faces and a speci¦c representation of matrices in the diagonalization of the inviscid §ux Jacobian of the preconditioned system.

MULTIGRID METHOD
The multigrid method is not a ¦xed technique but rather a stencil and a buildup construction whose implementation e©ciency depends on the adaptation of its components to the problem in question.
The multigrid cycle consists of the following steps: presmoothing, residual calculation at the current mesh level, restriction and correction of the residual on the coarse mesh, prolongation and interpolation of the error to a ¦ne mesh, correction of the ¦ne mesh solution using the correction interpolated from the coarse mesh (coarse mesh correction), and postsmoothing for the suppression of the high-frequency error components appearing after the interpolation to the ¦ne mesh.The computations terminate when a prescribed accuracy is achieved.
The smoothing method (smoother) damps the high-frequency error modes and the element of the multigrid method appears that is most dependent on the type of the problem.The role of the smoother is not so much to reduce the total error but rather to smooth it (suppress the high frequencies) so that the error admits a good approximation on the coarse mesh.Standard smoothers are linear iterative methods (Jacobi, GaussSeidel, and incomplete factorization methods).
The quality of the multigrid method is determined by the chosen sequence of meshes and interpolation operator.Quality criteria include the convergence factor which shows how fast the method converges (how many iteration steps are required to achieve the prescribed level of the residual) and the complexity of the restriction and prolongation operators which determines the number of operations and the amount of storage required for each iteration step.
The implementation of the multigrid approach is illustrated in Fig. 1  (V-cycle).The iterations begin with the mesh of the highest resolution (level 1).The number of mesh levels is n level .The arrows indicate the transmission of data from one mesh level to another.The horizontal arrows show that the computations (smoothing) occurs at a single mesh level.The CFL number is estimated by executing n start iteration steps (one iteration step is used by default).The parameter n pre speci¦es the number of iterations used for presmoothing at each mesh level (in the transition from a ¦ne to a coarse mesh).The parameter n post de¦nes the number of iterations used for postsmoothing (in the transition from a coarse to a ¦ne mesh).The parameter n crs speci¦es the number of iterations executed on the coarsest mesh.A multigrid method is proposed for solving the Euler and NavierStokes equations on an unstructured mesh.The multigrid technique producing a sequence of meshes via collapsing faces is implemented using a structure associated with mesh faces (edge weights).The discretization of the equations on a sequence of nested meshes is easy to implement, since the ¦nite-volume method developed has no constrains on the number of cells, their faces, or/and the cell shape (for hybrid meshes, the cell shape in the original mesh changes in the transition to the next mesh level).

PARALLEL IMPLEMENTATION
The performance of critical portion of the CFD solver consists of a loop which repeatedly computes the time derivatives of the conserved variables.The conserved variables are then updated using an explicit RungeKutta time-stepping procedure.The most expensive computation consists of accumulating §ux contributions across each face when computing the time derivatives.Therefore, the performance of the CUDA kernel which implements this computation is crucial in determining whether or not high performance is achieved.For simplicity, in some cases, discussion is restricted to two-dimensional (2D) cases.
The ¦nite-volume mesh is generated from input data with the appropriate setting of initial and boundary conditions.The computation steps required by the problem considered are classi¦ed into two groups: computations associated to faces and edges and computations associated to volumes.The numerical scheme exhibits a high degree of data parallelism because the computation at each edge/volume is independent with respect to the computation performed at the rest of edges/volumes.Moreover, the explicit scheme presents a high arithmetic intensity and the computation exhibits a high degree of locality.
The implementation is split between CPU and GPU.Pre-and postprocessing steps are done on the CPU, leaving only the computation itself to be performed on the GPU.For example, the CPU constructs the mesh and evaluates the face areas, face normals, and cell volumes.The initialization of the §ow¦eld is also done on the CPU.Each time step of the computation then involves a series of kernels on the GPU which evaluate the cell face §uxes, sum the §uxes into the cell, calculate the change in properties at each node, smooth the variables, and apply boundary conditions.Each kernel operates on all the nodes (no distinction is made between boundary nodes and interior nodes).This causes di©culties if an e©cient code is to be obtained.For example, the change in a §ow property at a node is formed by averaging the §ux sums of the adjacent cells (in 2D case, for mesh with quadrangle cells, four cells surround an interior node, but only two at a boundary node).This problem is overcome using dependent texturing.The indices of the cells required to update a node are precomputed on the CPU and loaded into GPU texture memory.For a given node, the kernel obtains the indices required and then looks up the relevant §ux sums which are stored in a separate GPU texture.This avoids branching within the kernel.
The implementation of the ¦nite-volume method using a global memory and register ¦le is illustrated in Fig. 2 for the scheme of the 1st order.To increase the order of accuracy of the numerical scheme, an approach based on piecewiselinear distributions of the parameters over a control volume is applied.Each time layer calculations are performed in two stages.Two kernels are used for the parallel implementation of the ¦nite-volume method on GPU, one of which calculates the §ow through the faces of control volumes (stage 1), and the other one provides §ow variable calculations on the next time layer (stage 2).On the ¦rst stage, §ow variables in the centers of control volumes are stored in global memory (array Q).One thread is used to calculate the §uxes through the faces of control volume.Each thread uses the §ow variables vector in adjacent control volumes, i and i + 1. Fluxes through cell faces are stored in array F .On the second stage, a set of threads corresponding to the same number of control volumes is launched to calculate the §ow variables vector on a new time level.The §uxes through the faces i − 1/2 and i + 1/2 are used and the solution is computed in the control volume i.The solution is then stored in the array Q.
The most costly stage in the algorithm is edge-based calculations involving two calculations for each face communicating two cells.This contribution is computed independently for each face and is added to the partial sums associated to each cell.For each control volume, the local time step is computed.The computation for each volume does not depend on the computation for the rest of volumes and, therefore, this stage is performed in parallel.The minimum of all the local time steps previously obtained for each volume is computed.The (n + 1)th state of each control volume is approximated from the nth state using the data computed in the previous phases.This stage is also completed in parallel.

TEST CASES
A series of model simulations in a wide range of Mach numbers are used to analyze the convergence rate and accuracy of steady-state solutions of the original and preconditioned gasdynamics equations.These equations are integrated until a steady-state solution is reached.The results computed are in a good agreement with the computational and experimental data available in the literature.The results obtained with di¨erent methods and procedures are compared and conclusions related to advantages and disadvantages of di¨erent approaches are made.

Flow Through a Channel with a Bump
The §ow through a plane channel with a bump is considered.The length-toheight ratio in the channel is L/H = 4 and the maximum height of the bump (which is a circular arc) is 0.1H (the maximum bump is 10% of the channel width).The computations are performed on a mesh of 120 × 20 cells (Fig. 3) with 60 nodes placed on the bump surface.
Flows through a channel with a bump are computed, for example, in [24,25].Speci¦cally, the implicit Euler time di¨erencing and the BeamWarming scheme for discretizing inviscid §uxes are used in [25].The computations in [24] are based on Godunov£s method and are performed in a wide range of Mach numbers.
The velocity (U = 3.47 m/s), pressure (p = 10 5 Pa), and temperature (T = 300 K) are applied to the inlet cross section of the channel, while mild boundary conditions (free out §ow) are speci¦ed in the outlet cross section.The inlet  Figure 4 displays level lines of the velocity magnitude.In contrast to the solution of the original equations, a velocity distribution symmetric about the vertical axis is obtained in the case of preconditioning.
The convergence rate of the time marching procedure is shown in Fig. 5.The original equations are solved in conservative variables, while the preconditioned equations are computed in physical variables.Curves 1 and 2 depict the residuals (in physical variables) caused by discretizing the momentum equation, while curves 3 show the residuals caused by discretizing the pressure equation.In the case of preconditioning, the prescribed residual is obtained after about 3500 iteration steps.For the original equations, the residuals with respect to velocity and pressure are two orders of magnitude higher and the convergence rate nearly ceases to vary after 4000 time steps.
To test performance accuracy of the method in a wide range of Mach numbers, the computations are performed in subsonic, transonic, and supersonic regimes.More speci¦cally, the §ow is computed in a channel with a 10 percent bump (as in the underlying version) on a mesh of 144 × 32 cells at M = 0.5 (subsonic) and 0.675 (transonic) and in a channel with a 4 percent bump on a mesh of 220 × 60 cells at M = 1.65 (supersonic).
For subsonic and supersonic regimes, Fig. 6 shows the level lines of the velocity magnitude at various inlet Mach numbers.For relatively low inlet Mach numbers, the §ow is nearly symmetric the vertical axis 6a).weak asymmetry of the §ow is associated with the leading and trailing edges of the bump (a horseshoe vortex of weak intensity develops behind the bump).To eliminate these shortcomings, the §ow characteristics near the corner points are computed by interpolating the §ow parameters from interior nodes of the computational domain [24].At high inlet Mach numbers, shock waves develop and interact in the §ow (Fig. 6b).The inclination angles of the shocks and the level lines agree well with the numerical data presented in [24].
Figure 7 presents the Mach number distributions on the upper (curves 1) and lower (curves 2) walls of the channel in various §ow regimes.These distributions agree well with numerical data from [24] (as in the case of velocity magnitude level lines, weak di¨erences are observed on the lower wall of the channel near the corner points).
The numerical results obtained in the test problem suggest that the numerical method developed has a su©cient accuracy for resolving characteristic features of incompressible and compressed §ows.Due to the preconditioning procedure, the convergence rate of time marching is made of the Mach At low Mach numbers, the CPU time required for solving the preconditioned equations is more by about 15% (due to an increase in the number of arithmetic operations) than in the case of the original equations.

Flow over Airfoil
Transonic §ow of viscous compressible gas over RAE2822 airfoil is considered at the angle of attack of 2.4 • .The length of the computational domain is 50L (15L before the airfoil and 25L behind the airfoil), and the width of the computational domain is 20L where L is the cord length (L = 1 m).Unstructured triangle mesh contains 38265 cells (Fig. 8).A number of time steps is 10 4 .
Free stream boundary conditions (Mach number is ¦xed at M ∞ = 0.725) are used on the inlet boundary.The parameters speci¦ed correspond to those used in [26].No-slip and no-penetration boundary conditions are applied to airfoil.Airfoil is treated as adiabatic boundary.Zero-gradient boundary conditions are applied to the outlet boundary.Slip boundary conditions are speci¦ed on the top and bottom boundaries.In the computations, four mesh levels and the mesh generation technique based on cells collapsing in the direction of the shortest face (semicoarsening method), which makes it possible to capture the boundary layer on the airfoil, are used.Reynolds number is 1.2 • 10 6 and SpalartAllmaras model is applied.
The computations are performed on an unstructured triangular mesh consisting of 11 298 nodes (version 1a) and on a hybrid mesh consisting of 19 126 nodes (version 1b).A structured mesh is used near the airfoil.In the case of a hybrid mesh, the number of cells is 24 339 in the mesh of level 1 (10 692 triangles and 13 647 quadrilaterals), 11484 in the mesh of level 2 (5527 triangles and 5957 quadrilaterals), 5528 in the mesh of level 3 (2795 triangles and 2733 quadrilaterals), and 2894 in the mesh of level 4 (1599 triangles and 1295 quadrilaterals).For the meshes of levels 14, the domain -y/L < 5 • 10 −6 contained 50, 25, 13, and 6 cells, respectively.
The aspect ratio for mesh levels 24 is 1 : 2 away from the airfoil, while near the airfoil, it is somewhat smaller than 1 : 2, since some of the quadrilaterals turned into triangles when a sequence of nested meshes is generated.The maximum aspect ratio is 1 for the mesh of level 1 (1 for both triangles and quadrilaterals), 0.47 for the mesh of level 2 (0.54 for triangles and 0.44 for quadrilaterals), 0.47 for the mesh of level 3 (0.51 for triangles and 0.45 for quadrilaterals), and 0.52 for the mesh of level 4 (0.57 for triangles and 0.48 for quadrilaterals).
Contours of the Mach number are presented in Fig. 9. Pressure coe©cient distributions over airfoil are shown in Fig. 10a.The results computed are com-   pared with those presented in [26].Integration pressure and wall shear stresses over airfoil surface give drag and lift coe©cients (C x = 0.8388 and C y = 0.0197) which are in a good agreement with the experimental data.Distribution of the friction coe©cient over airfoil, shown in Fig. 10b, is in a good agreement with the data presented in [26].
Table 1 demonstrates the conver- gence of the iterative process on the initial segment of the residual (from 10 0 to 10 −4 ) and in its entire range (from 10 0 to 10 −8 ) (here, the residual norm is normalized by its initial value).The numerator and the denominator correspond to scalar preconditioning and Jacobi block preconditioning, respectively.The number of iterations for pre-and postsmoothing is set equal to 1. Five smoothing iterations are executed on the coarsest mesh.Standard C++ routines were used to monitor CPU and GPU time.
Figure 11 shows the convergence factor of the multigrid method as a function of the number of mesh levels for version 1a.As the number of mesh levels rises from 1 to 4, the numbers of multigrid cycles decrease from 800 to 220.For n = 1, the residual reaches an approximately constant level R ∼ 10 −5 after 1000 multigrid cycles (a further increase in the number of iterations does not reduce the residual), while for n = 4, the residual monotonically decreases to R ∼ 10 −8 beyond a small initial segment.

Speedup Factor
The GPU version of the CFD code is used and validated for a variety of benchmark test cases.Numerical calculations are performed with unstructured inhouse ¦nite-volume CFD code.An equivalent solver is made in C++ to be run in a CPU for benchmarking purposes.
Laminar §ow calculations of §at plate boundary layer are performed on one core of AMD Phenom 2.3 GHz and one module of Tesla S1070 platform consisting of 240 cores with 1.44 GHz.The mesh consists of 111 670 nodes.Computational time of one iteration is 462.6 s on CPU and 11.5 s on GPU and speedup is 40.2.
Computational time of solution of one iteration on CPU and GPU and speedup are shown in Table 2. Calculations are performed on one core of AMD Phenom 2.3 GHz and one module of Tesla S1070, consisting of 240 cores with 1.44 GHz.Speedup factor changes from 22 (problems A4 and A5) to 60 (problem T1).The §at plate turbulent boundary layer problem is solved on various meshes.The turbulent §ow calculations are based on CPU Xeon X5670 2.93 GHz and one module of Tesla S2050 platform.The computational time and speedup of calculations are shown in Table 3 for one iteration.Increasing a number of nodes from 10 5 to 10 7 , speedup increases on 10%.
Speedup of solution on GPUs with respect to the solution on CPU is compared.Performance measurements show that numerical schemes developed achieve 20 to 50 speedup on GPU hardware compared to CPU reference implementation.The results obtained provide promising perspective for designing a GPU-based software framework for applications in CFD.

CONCLUDING REMARKS
A numerical method was developed for computing steady inviscid and viscous compressible gas in a wide range of Mach numbers.The accuracy and convergence rate of the method are independent of the Mach number.The original and preconditioned equations are discretized by applying the ¦nite-volume method on an unstructured mesh.An explicit scheme is used for time di¨erencing, while the inviscid and viscous §uxes are discretized with the help of second-order accurate schemes.Preconditioning is switched on depending on the local Mach number or the local pressure ¦eld (speci¦cally, the preconditioned equations are always solved for the incompressible §uid model).
A multigrid method was developed for solving the Euler and NavierStokes equations on unstructured and hybrid meshes.The method relies on a modi-¦ed form of the restriction operator and involves the generation of a sequence of nested meshes via collapsing faces (semicoarsening method).The method for generating meshes of di¨erent levels accurately takes into account the features of the problem (the boundary layer on the wall), preserves the topology of the original mesh, and produces high quality meshes in the near-wall domain (reasonable stretching and obliqueness of the cells).The capabilities of the approach are demonstrated by computing inviscid and viscous compressible §ows around an airfoil on meshes of various types and dimensions.The multigrid method, in conjunction with Jacobi block preconditioning, produces a prescribed level of the residual after considerably fewer iteration steps than in the case of scalar preconditioning.
Modern GPUs provide architectures and new programming models that enable to harness their large processing power and to design CFD simulations at both high performance and low cost.Possibilities of the use of GPUs for the simulation of external and internal §ows on unstructured meshes are discussed.The ¦nite-volume method is applied to solve 3D unsteady compressible Euler and NavierStokes equations on unstructured meshes with high resolution numerical schemes.The CUDA technology is used for programming implementation of parallel computational algorithms.Solutions of some benchmark test cases on GPUs are reported and the results computed are compared with experimental and computational data.Approaches to optimization of the CFD code related to the use of di¨erent types of memory are considered.In the calculations presented, the shared memory is not used.These opportunities are implemented in the code; however, impact of shared memory on speedup of CFD calculations will be investigated in future work.

Figure 2
Figure 2 Flux calculation (a) and calculation of §ow variables vector on a new time layer (b)

Figure 3
Figure 3 Computational mesh

Figure 4
Figure 4 Contours of the velocity magnitude in the case of the original (a) and preconditioned (b) equations

Figure 5
Figure 5 Convergence rate of the solutions of the original (a) and preconditioned (b) equations

Figure 6
Figure 6 Contours of the velocity magnitude at M = 0.5 (a) and 1.65 (b)

Figure 8
Figure 8 Mesh over airfoil (a) and mesh near airfoil (b)

Figure 9
Figure 9 Contours of Mach number.

Figure 10
Figure 10 Comparison of the computed (solid curves) pressure (a) and friction (b) coe©cients as distributions over airfoil with the data [26] (symbols)

Figure 11
Figure 11  Residual as a function of the number of multigrid cycles in the cases of one mesh (1) and of four nested meshes(2)

Table 1
Convergence of iteration process

Table 2
Computational time and speedup

Table 3
Time and speedup for §at plate problem