research-article

Using long vector extensions for MPI reductions

Authors:
Dong Zhong

The University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN 37996, United States of America

The University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN 37996, United States of America
Search about this author

,
Qinglei Cao

The University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN 37996, United States of America

The University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN 37996, United States of America
Search about this author

,
George Bosilca

The University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN 37996, United States of America

The University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN 37996, United States of America
Search about this author

,
Jack Dongarra

The University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN 37996, United States of America

The University of Tennessee, 1122 Volunteer Blvd, Knoxville, TN 37996, United States of America
Search about this author

Parallel Computing Volume 109 Issue CMar 2022 https://doi.org/10.1016/j.parco.2021.102871

Published:01 March 2022Publication History

Parallel Computing

Abstract

The modern CPU’s design, including the deep memory hierarchies and SIMD/vectorization capability have a more significant impact on algorithms’ efficiency than the modest frequency increase observed recently. The current introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become a critical software component to increase efficiency and close the gap to peak performance.

In this paper, we investigate the impact of the vectorization of MPI reduction operations. We propose an implementation of predefined MPI reduction operations using vector intrinsics (AVX and SVE) to improve the time-to-solution of the predefined MPI reduction operations. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architectures. Experiments conducted on varied architectures (Intel Xeon Gold, AMD Zen 2, and Arm A64FX), show that the proposed vector extension optimized reduction operations significantly reduce completion time for collective communication reductions. With these optimizations, we achieve higher memory bandwidth and an increased efficiency for local computations, which directly benefit the overall cost of collective reductions and applications based on them.

Highlights

•	Design and investigation of vector-based reduction operation for MPI reduction.
•	Implementation using Intel AVXs and Arm SVE to demonstrate the efficiency of our vectorized reduction operation.
•	Experiments with MPI benchmarks, performance tool, HPC and deep learning application.
•	Experiments with different architectures (x86 and aarch64) and processors including Intel Xeon Gold, AMD Zen 2, and Arm A64FX.

References

[1] Caminal H., Caballero D., Cebrián J.M., Ferrer R., Casas M., Moretó M., Martorell X., Valero M., Performance and energy effects on task-based parallelized applications, J. Supercomput. 74 (6) (2018) 2627–2637.Google Scholar
[2] Röhl T., Eitzinger J., Hager G., Wellein G., Validation of hardware events for successful performance pattern identification in high performance computing, in: Knüpfer A., Hilbrich T., Niethammer C., Gracia J., Nagel W.E., Resch M.M. (Eds.), Tools for High Performance Computing 2015, Springer International Publishing, Cham, 2016, pp. 17–28.Google Scholar
[3] R. Espasa, M. Valero, J.E. Smith, Vector architectures: past, present and future, in: Proceedings of the 12th International Conference on Supercomputing, 1998, pp. 425–432.Google Scholar
[4] W.J. Watson, The TI ASC: a highly modular and flexible super computer architecture, in: AFIPS ’72 (Fall, Part I), 1972.Google Scholar
[5] Molka D., Hackenberg D., Schöne R., Minartz T., Nagel W.E., Flexible workload generation for HPC cluster efficiency benchmarking, Comput. Sci. - Res. Dev. 27 (4) (2012) 235–243.Google Scholar
[6] Callahan D., Dongarra J., Levine D., Vectorizing compilers: A test suite and results, in: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, Supercomputing ’88, IEEE Computer Society Press, Washington, DC, USA, 1988, pp. 98–105.Google Scholar
[7] Levine D., Callahan D., Dongarra J., A comparative study of automatic vectorizing compilers, Benchmarking of High Performance Supercomputers, Parallel Comput. 17 (10) (1991) 1223–1244, 10.1016/S0167-8191(05)80035-3. URL http://www.sciencedirect.com/science/article/pii/S0167819105800353.Google ScholarDigital Library
[8] G. Mitra, B. Johnston, A.P. Rendell, E. McCreath, J. Zhou, Use of SIMD vector operations to accelerate application code performance on low-powered arm and intel platforms, in: 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum, 2013, pp. 1107–1116.Google Scholar
[9] Pentkovski V., Raman S.K., Keshava J., Implementing streaming SIMD extensions on the pentium III processor, IEEE Micro 20 (04) (2000) 47–57, 10.1109/40.865866.Google ScholarDigital Library
[10] Hammarlund P., Martinez A.J., Bajwa A.A., Hill D.L., Hallnor E., Jiang H., Dixon M., Derr M., Hunsaker M., Kumar R., Osborne R.B., Rajwar R., Singhal R., D’Sa R., Chappell R., Kaushik S., Chennupaty S., Jourdan S., Gunther S., Piazza T., Burton T., Haswell: The fourth-generation intel core processor, IEEE Micro 34 (2) (2014) 6–20.Google Scholar
[11] Sodani A., Gramunt R., Corbal J., Kim H., Vinod K., Chinthamani S., Hutsell S., Agarwal R., Liu Y., Knights landing: Second-generation intel xeon phi product, IEEE Micro 36 (2) (2016) 34–46, 10.1109/MM.2016.25.Google ScholarDigital Library
[12] Intel, Intel 64 and IA-32 architectures software developer’s manual volume 1: Basic architecture, 2019, URL https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.Google Scholar
[13] Intel, Intel 64 and IA-32 architectures software developer manuals, 2016, URL https://software.intel.com/en-us/articles/intel-sdm.Google Scholar
[14] McFarlin D.S., Arbatov V., Franchetti F., Püschel M., Automatic SIMD vectorization of fast Fourier transforms for the larrabee and AVX instruction sets, in: Proceedings of the International Conference on Supercomputing, ICS’11, Association for Computing Machinery, New York, NY, USA, 2011, pp. 265–274, 10.1145/1995896.1995938.Google ScholarDigital Library
[15] Intel, 64-Ia-32-architectures instruction set extensions reference manual, 2019, URL https://software.intel.com/en-us/articles/intel-sdm.Google Scholar
[16] Arm, Arm architecture reference manual armv8, for Armv8-A architecture profile, 2018, URL https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile.Google Scholar
[17] Flur S., Gray K.E., Pulte C., Sarkar S., Sezgin A., Maranget L., Deacon W., Sewell P., Modelling the Armv8 architecture, operationally: Concurrency and ISA, in: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’16, ACM, New York, NY, USA, 2016, pp. 608–621, 10.1145/2837614.2837615. URL http://doi.acm.org/10.1145/2837614.2837615.Google ScholarDigital Library
[18] Boettcher M., Al-Hashimi B.M., Eyole M., Gabrielli G., Reid A., Advanced SIMD: Extending the reach of contemporary simd architectures, in: 2014 Design, Automation Test in Europe Conference Exhibition (DATE), 2014, pp. 1–4, 10.7873/DATE.2014.037.Google Scholar
[19] Armejach A., Caminal H., Cebrian J.M., González-Alberquilla R., Adeniyi-Jones C., Valero M., Casas M., Moretó M., Stencil codes on a vector length agnostic architecture, in: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT ’18, ACM, New York, NY, USA, 2018, pp. 13:1–13:12, 10.1145/3243176.3243192. URL http://doi.acm.org/10.1145/3243176.3243192.Google ScholarDigital Library
[20] M. P. I. Forum, MPI: A message-passing interface standard version 4.0, 2020, URL https://www.mpi-forum.org.Google Scholar
[21] Bottou L., Large-scale machine learning with stochastic gradient descent, in: Lechevallier Y., Saporta G. (Eds.), Proceedings of COMPSTAT’2010, Physica-Verlag HD, Heidelberg, 2010, pp. 177–186.Google Scholar
[22] Li Z., Davis J., Jarvis S., An efficient task-based all-reduce for machine learning applications, 2017, pp. 1–8, 10.1145/3146347.3146350.Google ScholarDigital Library
[23] Krizhevsky A., Sutskever I., Hinton G.E., ImageNet classification with deep convolutional neural networks, in: Pereira F., Burges C.J.C., Bottou L., Weinberger K.Q. (Eds.), Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.Google Scholar
[24] Moritz P., Nishihara R., Stoica I., Jordan M.I., SparkNet: Training deep networks in spark, 2015, arXiv:1511.06051.Google Scholar
[25] Lim R., Lee Y., Kim R., Choi J., An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512, Cluster Comput. 21 (4) (2018) 1785–1795.Google Scholar
[26] Kim R., Choi J., Lee M., Optimizing parallel GEMM routines using auto-tuning with intel AVX-512, in: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2019, Association for Computing Machinery, New York, NY, USA, 2019, pp. 101–110, 10.1145/3293320.3293334.Google ScholarDigital Library
[27] Bramas B., A novel hybrid quicksort algorithm vectorized using AVX-512 on Intel Skylake, Int. J. Adv. Comput. Sci. Appl. 8 (10) (2017), 10.14569/ijacsa.2017.081044.Google Scholar
[28] Dosanjh M.G.F., Schonbein W., Grant R.E., Bridges P.G., Gazimirsaeed S.M., Afsahi A., Fuzzy matching: Hardware accelerated MPI communication middleware, in: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2019, pp. 210–220, 10.1109/CCGRID.2019.00035.Google Scholar
[29] Armejach A., Caminal H., Cebrian J.M., Langarita R., González-Alberquilla R., Adeniyi-Jones C., Valero M., Casas M., Moretó M., Using arm’s scalable vector extension on stencil codes, J. Supercomput. (2019).Google Scholar
[30] Iliescu D.A., Arm scalable vector extension and application to machine learning, 2018, URL https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning.Google Scholar
[31] D. Zhong, P. Shamis, Q. Cao, G. Bosilca, S. Sumimoto, K. Miura, J. Dongarra, Using arm scalable vector extension to optimize OPEN MPI, in: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2020, pp. 222–231.Google Scholar
[32] Zhong D., Cao Q., Bosilca G., Dongarra J., Using advanced vector extensions AVX-512 for MPI reductions, eurompi/usa ’20, Association for Computing Machinery, New York, NY, USA, 2020, pp. 1–10, 10.1145/3416315.3416316.Google ScholarDigital Library
[33] Gómez C., Mantovani F., Focht E., Casas M., Efficiently running SpMV on long vector architectures, in: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp. 292–303, 10.1145/3437801.3441592.Google ScholarDigital Library
[34] Träff J.L., Transparent neutral element elimination in MPI reduction operations, in: Keller R., Gabriel E., Resch M., Dongarra J. (Eds.), Recent Advances in the Message Passing Interface, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 275–284.Google Scholar
[35] Hofmann M., Rünger G., MPI Reduction operations for sparse floating-point data, in: Lastovetsky A., Kechadi T., Dongarra J. (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 94–101.Google Scholar
[36] Chu C., Hamidouche K., Venkatesh A., Awan A.A., Panda D.K., CUDA kernel based collective reduction operations on large-scale GPU clusters, in: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2016, pp. 726–735, 10.1109/CCGrid.2016.111.Google ScholarDigital Library
[37] Luo X., Wu W., Bosilca G., Pei Y., Cao Q., Patinyasakdikul T., Zhong D., Dongarra J., HAN: a hierarchical AutotuNed collective communication framework, in: 2020 IEEE International Conference on Cluster Computing (CLUSTER), 2020, pp. 23–34, 10.1109/CLUSTER49012.2020.00013.Google Scholar
[38] Patarasuk P., Yuan X., Bandwidth optimal all-reduce algorithms for clusters of workstations, J. Parallel Distrib. Comput. 69 (2) (2009) 117–124, 10.1016/j.jpdc.2008.09.002.Google ScholarDigital Library
[39] H. Shan, S. Williams, C.W. Johnson, Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression, in: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2018, pp. 1–11.Google Scholar
[40] Arm, Porting and optimizing HPC applications for arm SVE version 2.1, 2020, URL https://developer.arm.com/documentation/101726/0210/Port-and-Optimize-your-Application-to-SVE-enabled-Arm-based-processors.Google Scholar
[41] E. Gabriel, G.E. Fagg, G. Bosilca, T. Angskun, J.J. Dongarra, J.M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R.H. Castain, D.J. Daniel, R.L. Graham, T.S. Woodall, Open MPI: Goals, concept, and design of a next generation MPI implementation, in: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, 2004, pp. 97–104.Google Scholar
[42] Zhong D., Bouteiller A., Luo X., Bosilca G., Runtime level failure detection and propagation in HPC systems, in: Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI ’19, Association for Computing Machinery, New York, NY, USA, 2019,10.1145/3343211.3343225.Google ScholarDigital Library
[43] S. Maleki, Y. Gao, M.J. Garzar’n, T. Wong, D.A. Padua, An evaluation of vectorizing compilers, in: 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011, pp. 372–382.Google Scholar
[44] Open MPI main development repository, URL https://github.com/open-mpi/ompi.Google Scholar
[45] Terpstra D., Jagode H., You H., Dongarra J., Collecting performance data with PAPI-c, in: Müller M.S., Resch M.M., Schulz A., Nagel W.E. (Eds.), Tools for High Performance Computing 2009, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 157–173.Google Scholar
[46] Plimpton S., Fast parallel algorithms for short-range molecular dynamics, J. Comput. Phys. 117 (1) (1995) 1–19, 10.1006/jcph.1995.1039. URL http://www.sciencedirect.com/science/article/pii/S002199918571039X.Google ScholarDigital Library
[47] Sergeev A., Balso M.D., Horovod: fast and easy distributed deep learning in TensorFlow, 2018, arXiv preprint arXiv:1802.05799.Google Scholar
[48] A benchmark framework for Tensorflow, URL https://github.com/tensorflow/benchmarks.Google Scholar

Index Terms

(auto-classified)

Using long vector extensions for MPI reductions

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Parallel Computing Volume 109, Issue C
Mar 2022
96 pages
ISSN:0167-8191
Issue’s Table of Contents

Elsevier B.V.
Sponsors
In-Cooperation
Publisher
Elsevier Science Publishers B. V.
Netherlands
Publication History
- Published: 1 March 2022
Author Tags
Vector operation
Long vector extension
Scalable Vector Extension (SVE)
MPI reduction operation
Single instruction multiple data
Instruction level parallelism
Intel AVX2/AVX-512
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 0
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

Using long vector extensions for MPI reductions

Save to Binder

Parallel Computing

Abstract

Abstract

Highlights

References

Cited By

Index Terms

Using long vector extensions for MPI reductions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

Digital Edition

Caption

Using long vector extensions for MPI reductions

Save to Binder

Parallel Computing

Abstract

Abstract

Highlights

References

Cited By

Index Terms

Using long vector extensions for MPI reductions

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

Digital Edition

Share this Publication link

Share on Social Media