research-article

Using long vector extensions for MPI reductions

Authors Info & Claims
Published:01 March 2022Publication History
Skip Abstract Section

Abstract

Abstract

The modern CPU’s design, including the deep memory hierarchies and SIMD/vectorization capability have a more significant impact on algorithms’ efficiency than the modest frequency increase observed recently. The current introduction of wide vector instruction set extensions (AVX and SVE) motivated vectorization to become a critical software component to increase efficiency and close the gap to peak performance.

In this paper, we investigate the impact of the vectorization of MPI reduction operations. We propose an implementation of predefined MPI reduction operations using vector intrinsics (AVX and SVE) to improve the time-to-solution of the predefined MPI reduction operations. The evaluation of the resulting software stack under different scenarios demonstrates that the approach is not only efficient but also generalizable to many vector architectures. Experiments conducted on varied architectures (Intel Xeon Gold, AMD Zen 2, and Arm A64FX), show that the proposed vector extension optimized reduction operations significantly reduce completion time for collective communication reductions. With these optimizations, we achieve higher memory bandwidth and an increased efficiency for local computations, which directly benefit the overall cost of collective reductions and applications based on them.

Highlights

Design and investigation of vector-based reduction operation for MPI reduction.

Implementation using Intel AVXs and Arm SVE to demonstrate the efficiency of our vectorized reduction operation.

Experiments with MPI benchmarks, performance tool, HPC and deep learning application.

Experiments with different architectures (x86 and aarch64) and processors including Intel Xeon Gold, AMD Zen 2, and Arm A64FX.

References

  1. [1] Caminal H., Caballero D., Cebrián J.M., Ferrer R., Casas M., Moretó M., Martorell X., Valero M., Performance and energy effects on task-based parallelized applications, J. Supercomput. 74 (6) (2018) 26272637.Google ScholarGoogle Scholar
  2. [2] Röhl T., Eitzinger J., Hager G., Wellein G., Validation of hardware events for successful performance pattern identification in high performance computing, in: Knüpfer A., Hilbrich T., Niethammer C., Gracia J., Nagel W.E., Resch M.M. (Eds.), Tools for High Performance Computing 2015, Springer International Publishing, Cham, 2016, pp. 1728.Google ScholarGoogle Scholar
  3. [3] R. Espasa, M. Valero, J.E. Smith, Vector architectures: past, present and future, in: Proceedings of the 12th International Conference on Supercomputing, 1998, pp. 425–432.Google ScholarGoogle Scholar
  4. [4] W.J. Watson, The TI ASC: a highly modular and flexible super computer architecture, in: AFIPS ’72 (Fall, Part I), 1972.Google ScholarGoogle Scholar
  5. [5] Molka D., Hackenberg D., Schöne R., Minartz T., Nagel W.E., Flexible workload generation for HPC cluster efficiency benchmarking, Comput. Sci. - Res. Dev. 27 (4) (2012) 235243.Google ScholarGoogle Scholar
  6. [6] Callahan D., Dongarra J., Levine D., Vectorizing compilers: A test suite and results, in: Proceedings of the 1988 ACM/IEEE Conference on Supercomputing, Supercomputing ’88, IEEE Computer Society Press, Washington, DC, USA, 1988, pp. 98105.Google ScholarGoogle Scholar
  7. [7] Levine D., Callahan D., Dongarra J., A comparative study of automatic vectorizing compilers, Benchmarking of High Performance Supercomputers, Parallel Comput. 17 (10) (1991) 12231244, 10.1016/S0167-8191(05)80035-3. URL http://www.sciencedirect.com/science/article/pii/S0167819105800353.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] G. Mitra, B. Johnston, A.P. Rendell, E. McCreath, J. Zhou, Use of SIMD vector operations to accelerate application code performance on low-powered arm and intel platforms, in: 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum, 2013, pp. 1107–1116.Google ScholarGoogle Scholar
  9. [9] Pentkovski V., Raman S.K., Keshava J., Implementing streaming SIMD extensions on the pentium III processor, IEEE Micro 20 (04) (2000) 4757, 10.1109/40.865866.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Hammarlund P., Martinez A.J., Bajwa A.A., Hill D.L., Hallnor E., Jiang H., Dixon M., Derr M., Hunsaker M., Kumar R., Osborne R.B., Rajwar R., Singhal R., D’Sa R., Chappell R., Kaushik S., Chennupaty S., Jourdan S., Gunther S., Piazza T., Burton T., Haswell: The fourth-generation intel core processor, IEEE Micro 34 (2) (2014) 620.Google ScholarGoogle Scholar
  11. [11] Sodani A., Gramunt R., Corbal J., Kim H., Vinod K., Chinthamani S., Hutsell S., Agarwal R., Liu Y., Knights landing: Second-generation intel xeon phi product, IEEE Micro 36 (2) (2016) 3446, 10.1109/MM.2016.25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Intel, Intel 64 and IA-32 architectures software developer’s manual volume 1: Basic architecture, 2019, URL https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-software-developers-manual-volume-1-basic-architecture.Google ScholarGoogle Scholar
  13. [13] Intel, Intel 64 and IA-32 architectures software developer manuals, 2016, URL https://software.intel.com/en-us/articles/intel-sdm.Google ScholarGoogle Scholar
  14. [14] McFarlin D.S., Arbatov V., Franchetti F., Püschel M., Automatic SIMD vectorization of fast Fourier transforms for the larrabee and AVX instruction sets, in: Proceedings of the International Conference on Supercomputing, ICS’11, Association for Computing Machinery, New York, NY, USA, 2011, pp. 265274, 10.1145/1995896.1995938.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Intel, 64-Ia-32-architectures instruction set extensions reference manual, 2019, URL https://software.intel.com/en-us/articles/intel-sdm.Google ScholarGoogle Scholar
  16. [16] Arm, Arm architecture reference manual armv8, for Armv8-A architecture profile, 2018, URL https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile.Google ScholarGoogle Scholar
  17. [17] Flur S., Gray K.E., Pulte C., Sarkar S., Sezgin A., Maranget L., Deacon W., Sewell P., Modelling the Armv8 architecture, operationally: Concurrency and ISA, in: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’16, ACM, New York, NY, USA, 2016, pp. 608621, 10.1145/2837614.2837615. URL http://doi.acm.org/10.1145/2837614.2837615.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Boettcher M., Al-Hashimi B.M., Eyole M., Gabrielli G., Reid A., Advanced SIMD: Extending the reach of contemporary simd architectures, in: 2014 Design, Automation Test in Europe Conference Exhibition (DATE), 2014, pp. 14, 10.7873/DATE.2014.037.Google ScholarGoogle Scholar
  19. [19] Armejach A., Caminal H., Cebrian J.M., González-Alberquilla R., Adeniyi-Jones C., Valero M., Casas M., Moretó M., Stencil codes on a vector length agnostic architecture, in: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques, PACT ’18, ACM, New York, NY, USA, 2018, pp. 13:113:12, 10.1145/3243176.3243192. URL http://doi.acm.org/10.1145/3243176.3243192.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] M. P. I. Forum, MPI: A message-passing interface standard version 4.0, 2020, URL https://www.mpi-forum.org.Google ScholarGoogle Scholar
  21. [21] Bottou L., Large-scale machine learning with stochastic gradient descent, in: Lechevallier Y., Saporta G. (Eds.), Proceedings of COMPSTAT’2010, Physica-Verlag HD, Heidelberg, 2010, pp. 177186.Google ScholarGoogle Scholar
  22. [22] Li Z., Davis J., Jarvis S., An efficient task-based all-reduce for machine learning applications, 2017, pp. 18, 10.1145/3146347.3146350.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Krizhevsky A., Sutskever I., Hinton G.E., ImageNet classification with deep convolutional neural networks, in: Pereira F., Burges C.J.C., Bottou L., Weinberger K.Q. (Eds.), Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 2012, pp. 10971105. URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf.Google ScholarGoogle Scholar
  24. [24] Moritz P., Nishihara R., Stoica I., Jordan M.I., SparkNet: Training deep networks in spark, 2015, arXiv:1511.06051.Google ScholarGoogle Scholar
  25. [25] Lim R., Lee Y., Kim R., Choi J., An implementation of matrix–matrix multiplication on the Intel KNL processor with AVX-512, Cluster Comput. 21 (4) (2018) 17851795.Google ScholarGoogle Scholar
  26. [26] Kim R., Choi J., Lee M., Optimizing parallel GEMM routines using auto-tuning with intel AVX-512, in: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia 2019, Association for Computing Machinery, New York, NY, USA, 2019, pp. 101110, 10.1145/3293320.3293334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Bramas B., A novel hybrid quicksort algorithm vectorized using AVX-512 on Intel Skylake, Int. J. Adv. Comput. Sci. Appl. 8 (10) (2017), 10.14569/ijacsa.2017.081044.Google ScholarGoogle Scholar
  28. [28] Dosanjh M.G.F., Schonbein W., Grant R.E., Bridges P.G., Gazimirsaeed S.M., Afsahi A., Fuzzy matching: Hardware accelerated MPI communication middleware, in: 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2019, pp. 210220, 10.1109/CCGRID.2019.00035.Google ScholarGoogle Scholar
  29. [29] Armejach A., Caminal H., Cebrian J.M., Langarita R., González-Alberquilla R., Adeniyi-Jones C., Valero M., Casas M., Moretó M., Using arm’s scalable vector extension on stencil codes, J. Supercomput. (2019).Google ScholarGoogle Scholar
  30. [30] Iliescu D.A., Arm scalable vector extension and application to machine learning, 2018, URL https://developer.arm.com/solutions/hpc/resources/hpc-white-papers/arm-scalable-vector-extensions-and-application-to-machine-learning.Google ScholarGoogle Scholar
  31. [31] D. Zhong, P. Shamis, Q. Cao, G. Bosilca, S. Sumimoto, K. Miura, J. Dongarra, Using arm scalable vector extension to optimize OPEN MPI, in: 2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID), 2020, pp. 222–231.Google ScholarGoogle Scholar
  32. [32] Zhong D., Cao Q., Bosilca G., Dongarra J., Using advanced vector extensions AVX-512 for MPI reductions, eurompi/usa ’20, Association for Computing Machinery, New York, NY, USA, 2020, pp. 110, 10.1145/3416315.3416316.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Gómez C., Mantovani F., Focht E., Casas M., Efficiently running SpMV on long vector architectures, in: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp. 292303, 10.1145/3437801.3441592.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Träff J.L., Transparent neutral element elimination in MPI reduction operations, in: Keller R., Gabriel E., Resch M., Dongarra J. (Eds.), Recent Advances in the Message Passing Interface, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 275284.Google ScholarGoogle Scholar
  35. [35] Hofmann M., Rünger G., MPI Reduction operations for sparse floating-point data, in: Lastovetsky A., Kechadi T., Dongarra J. (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 94101.Google ScholarGoogle Scholar
  36. [36] Chu C., Hamidouche K., Venkatesh A., Awan A.A., Panda D.K., CUDA kernel based collective reduction operations on large-scale GPU clusters, in: 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2016, pp. 726735, 10.1109/CCGrid.2016.111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Luo X., Wu W., Bosilca G., Pei Y., Cao Q., Patinyasakdikul T., Zhong D., Dongarra J., HAN: a hierarchical AutotuNed collective communication framework, in: 2020 IEEE International Conference on Cluster Computing (CLUSTER), 2020, pp. 2334, 10.1109/CLUSTER49012.2020.00013.Google ScholarGoogle Scholar
  38. [38] Patarasuk P., Yuan X., Bandwidth optimal all-reduce algorithms for clusters of workstations, J. Parallel Distrib. Comput. 69 (2) (2009) 117124, 10.1016/j.jpdc.2008.09.002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] H. Shan, S. Williams, C.W. Johnson, Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression, in: 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), 2018, pp. 1–11.Google ScholarGoogle Scholar
  40. [40] Arm, Porting and optimizing HPC applications for arm SVE version 2.1, 2020, URL https://developer.arm.com/documentation/101726/0210/Port-and-Optimize-your-Application-to-SVE-enabled-Arm-based-processors.Google ScholarGoogle Scholar
  41. [41] E. Gabriel, G.E. Fagg, G. Bosilca, T. Angskun, J.J. Dongarra, J.M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R.H. Castain, D.J. Daniel, R.L. Graham, T.S. Woodall, Open MPI: Goals, concept, and design of a next generation MPI implementation, in: Proceedings, 11th European PVM/MPI Users’ Group Meeting, Budapest, Hungary, 2004, pp. 97–104.Google ScholarGoogle Scholar
  42. [42] Zhong D., Bouteiller A., Luo X., Bosilca G., Runtime level failure detection and propagation in HPC systems, in: Proceedings of the 26th European MPI Users’ Group Meeting, EuroMPI ’19, Association for Computing Machinery, New York, NY, USA, 2019,10.1145/3343211.3343225.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. [43] S. Maleki, Y. Gao, M.J. Garzar’n, T. Wong, D.A. Padua, An evaluation of vectorizing compilers, in: 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011, pp. 372–382.Google ScholarGoogle Scholar
  44. [44] Open MPI main development repository, URL https://github.com/open-mpi/ompi.Google ScholarGoogle Scholar
  45. [45] Terpstra D., Jagode H., You H., Dongarra J., Collecting performance data with PAPI-c, in: Müller M.S., Resch M.M., Schulz A., Nagel W.E. (Eds.), Tools for High Performance Computing 2009, Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 157173.Google ScholarGoogle Scholar
  46. [46] Plimpton S., Fast parallel algorithms for short-range molecular dynamics, J. Comput. Phys. 117 (1) (1995) 119, 10.1006/jcph.1995.1039. URL http://www.sciencedirect.com/science/article/pii/S002199918571039X.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Sergeev A., Balso M.D., Horovod: fast and easy distributed deep learning in TensorFlow, 2018, arXiv preprint arXiv:1802.05799.Google ScholarGoogle Scholar
  48. [48] A benchmark framework for Tensorflow, URL https://github.com/tensorflow/benchmarks.Google ScholarGoogle Scholar

Index Terms

(auto-classified)
  1. Using long vector extensions for MPI reductions

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Article Metrics

            • Downloads (Last 12 months)0
            • Downloads (Last 6 weeks)0

            Other Metrics

          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!