Performance analysis and optimization of a CFD application

Zhang, Wentao

Performance analysis and optimization of a CFD application

Zhang, Wentao

Permalink

https://hdl.handle.net/2142/88072

Description

Title

Performance analysis and optimization of a CFD application

Author(s)

Zhang, Wentao

Issue Date

2015-07-20

Director of Research (if dissertation) or Advisor (if thesis)

Bodony, Daniel J.

Department of Study

Mechanical Science & Engineering

Discipline

Mechanical Engineering

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Performance optimization
computational fluid dynamics (CFD)
Intel Xeon Phi

Abstract

This thesis documents the analysis and optimization of a high-order finite difference computational fluid dynamics (CFD) application (PlasComCM). Performance bottlenecks were identified using performance tools and hardware counters. The performance analysis of PlasComCM showed that the quantity of memory accesses and the lack of vectorization inhibited optimal serial performance on a x86-based CPU. Optimizing techniques including pointer dereferencing, loop transformation and Fortran SIMD directives were applied to the top 10 time-consuming subroutines to remove obstacles to vectorization and to improve the serial performance. Details about the optimization techniques are presented and their impacts on performance are discussed. A 63% reduction in the number of memory loads and a serial speedup of 2.02 were obtained from the optimization efforts. Using the optimized serial program as the codebase, further investigation was focused on the analysis and optimization of parallel heterogeneous execution on a dual-socket node fitted with an Intel Xeon Phi MIC card. To reduce the overhead created by host-accelerator copies in heterogeneous execution, the data layout of the halo region was changed from a ''star'' shape to a ''box'' shape to agglomerate small communications and to create a larger work granularity. Preliminary results of running PlasComCM on Intel Xeon Phis in symmetric mode are also presented, where it was found that a 20% reduction in wall-clock time can be obtained for particular problem size when using 2 SandyBridge sockets + 1 Phi card vs 2 SandyBridge sockets.

Graduation Semester

2015-8

Type of Resource

text

Permalink

http://hdl.handle.net/2142/88072

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Performance analysis and optimization of a CFD application

Zhang, Wentao

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Mechanical Science and Engineering

Log In