Comment: | eecg.toronto.edu wiki reference |
---|---|
Downloads: | Tarball | ZIP archive | SQL archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
b9bda3f6f081f25cd67c069c74fac602 |
User & Date: | martin_vahi on 2017-05-12 03:16:17 |
Other Links: | manifest | tags |
2017-05-17 07:34 | additional wiki references check-in: 54983c9b9b user: martin_vahi tags: trunk | |
2017-05-12 03:16 | eecg.toronto.edu wiki reference check-in: b9bda3f6f0 user: martin_vahi tags: trunk | |
2017-05-10 12:47 | wiki references check-in: cf7f03865c user: martin_vahi tags: trunk | |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/COMMENTS.txt version [39a0921e6a].
> > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
The origin: http://www.eecg.toronto.edu/parallel/publications.html (archival copy: https://archive.is/rZUhA ) ftp://ftp.cs.toronto.edu/pub/parallel Many, may be NOT all, of the HTML-files that the wget downloaded, were modified so that they reference the files from the ./manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_eduwww.eecg.toronto.edu in stead of containing URLs to the original FTP-site. |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/ABSTRACTS version [f912e52af2].
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 |
--------------------------------------------------------------------- File: Ravi_Stumm_ICPP95.ps.Z Title: Hierarchical Ring Topologies and the effect of their Bisection Bandwidth Constraints Authors: G. Ravindran and M. Stumm Where : Proc. Intl. Conf. on Parallel Processing, pp.I/51-55, 1995 Keywords: Multiprocessor architectures, Interconnection networks, Hierarchical rings, Bisection bandwidth Abstract: Ring-based hierarchical networks are interesting alternatives to popular direct networks such as 2D meshes or tori. They allow for simple router designs, wider communications paths, and faster networks than their direct network counterparts. However, they have a constant bisection bandwidth, regardless of system size. In this paper, we present the results of a simulation study to determine how large hierarchical ring networks can become before their performance deteriorates due to their bisection bandwidth constraint. We show that a system with a maximum of 128 processors can sustain most memory access behaviors, but that larger systems can be sustained, only if their bisection bandwidth is increased. --------------------------------------------------------------------- File: Ravi_Stumm_JIEICE96.ps.Z Title : A Comparison of Blocking and Non-blocking Packet Switching Techniques in Hierarchical Ring Networks Authors: G. Ravindran and M. Stumm Where : IEICE Trans. Inf. & Syst., vol. E79-D, No. 8, August 1996 keywords: Networks, Switching, Wormhole, Virtual Cut-through, Hierarchical Ring Networks, Slotted Rings Abstract : This paper presents the results of a simulation study of blocking and non-blocking switching for hierarchical ring networks. The switching techniques include wormhole, virtual cut-through, and slotted ring. We conclude that slotted ring network performs better than the more popular wormhole and virtual cut-through networks. We also show that the size of the node buffers is an important parameter and that choosing them too large can hurt performance in some cases. Slotted rings have the advantage that the choice of buffer size is easier in that larger than necessary buffers do not hurt performance and hence a single choice of buffer size performs well for all system configurations. In contrast, the optimal buffer size for virtual cut-through and wormhole switching nodes varies depending on the system configuration and the level in the hierarchy in which the switching node lies. --------------------------------------------------------------------- File: Zhou_Brecht_SM91.ps.Z Title: Processor Pool-Based Scheduling for Large-Scale NUMA Multiprocessors Where: Appears in: Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, May (1991), pp. 133-142. Authors: Songnian Zhou and Timothy Brecht Keywords: NUMA, Schedulling, multiprocessor performance Abstract: Large-scale Non-Uniform Memory Access (NUMA) multiprocessors are gaining increased attention due to their potential for achieving high performance through the replication of relatively simple components. Because of the complexity of such systems, scheduling algorithms for parallel applications are crucial in realizing the performance potential of these systems. In particular, scheduling methods must consider the scale of the system, with the increased likelihood of creating bottlenecks, along with the NUMA characteristics of the system, and the benefits to be gained by placing threads close to their code and data. We propose a class of scheduling algorithms based on processor pools. A processor pool is a software construct for organizing and managing a large number of processors by dividing them into groups called pools. The parallel threads of a job are run in a single processor pool, unless there are performance advantages for a job to span multiple pools. Several jobs may share one pool. Our simulation experiments show that processor pool-based scheduling may effectively reduce the average job response time. The performance improvements attained by using processor pools increase with the average parallelism of the jobs, the load level of the system, the differentials in memory access costs, and the likelihood of having system bottlenecks. As the system size increases, while maintaining the workload composition and intensity, we observed that processor pools can be used to provide significant performance improvements. We therefore conclude that processor pool-based scheduling may be an effective and efficient technique for scalable systems. --------------------------------------------------------------------- File: Brecht_SEDMS93.ps.Z Title: On the Importance of Parallel Application Placement in NUMA Multiprocessors Authors: Timothy Brecht Where: Proceedings of the Fourth Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, CA, September, 1993. Keywords: NUMA, multiprocessor scheduling, multiprocessor performance Abstract: The thesis of this paper is that scheduling decisions in large-scale, shared-memory, NUMA (Non-Uniform Memory Access) multiprocessors must consider not only how many processors, but also which processors to allocate to each application. We call the problem of assigning parallel processes of an application to processors application placement. We explore the importance of placement decisions by measuring the execution time of several parallel applications using different placements on a shared-memory NUMA multiprocessor. The results of these experiments lead us to conclude that, as expected, in small- scale mildly NUMA multiprocessors, placement decisions have only a minor affect on the execution time of parallel applications. However, the results also show that placement decisions in large-scale multiprocessors are critical and localization that considers the architectural clusters inherent in these systems is essential. Our experiments also show that the importance of placement decisions increases substantially with the size and NUMAness of the system and that the placement of individual processes of an application within the subset of chosen processors also significantly impacts performance. --------------------------------------------------------------------- File: Kumar_Kulkarni_ICPP91.ps.Z (does not contain figures) Title: Generalized Unimodular Loop Transformations for Distributed Memory Multiprocessors Authors: K G Kumar*, D Kulkarni+ and A Basu Center for Development of Advanced Computing 2/1 Brunton Road, Bangalore 560 025, India * Now at IBM TJ Watson, York Town Heights, NY 10598 + Now at Dept of Computer Science, University of Toronto, Toronto, ON M5S 1A4 Where: International Conference of Parallel Processing -91 Keywords: Parallelizing Compilers, Restructuring Transformations, Loop Partitioning, Iteration Spaces, Dependence Vectors. Abstract In this paper, we present a generalized unimodular loop transformation as a simple, systematic and elegant method for partitioning the iteration spaces of nested loops for execution on distributed memory multiprocessors. We present a methodology for deriving the transformations that internalize multiple dependences in a multidimensional iteration space without resulting in a deadlocking situation. We then derive the general expression for the bounds of the transformed loops in terms of the bounds of the original space and the transformation matrix elements. ------------------------------------------------------------------- File: Kumar_Kulkarni_ICS92.ps.Z Title: Deriving Good Transformations for Mapping Nested Loops on Hierarchical Parallel Machines in Polynomial Time Authors: K G Kumar*, D Kulkarni+ and A Basu Center for Development of Advanced Computing 2/1 Brunton Road, Bangalore 560 025, India * IBM TJ Watson, York Town Heights, NY 10598 + Dept of Computer Science, University of Toronto, Toronto, ON M5S 1A4 Where: International Conference on Supercomputing 92 Keywords: Parallelizing Compilers, Restructuring Transformations, Loop Partitioning, Iteration Spaces, Dependence Vectors. We present a computationally efficient method for deriving the most appropriate transformation and mapping of a nested loop for a given hierarchical parallel machine. This method is in the context of our systematic and general theory of unimodular loop transformations for the problem of iteration space partitioning \cite{kandk6}. Finding an optimal mapping or an optimal associated unimodular transformation is NP-complete. We present a polynomial time method for obtaining a `good' transformation using a simple parameterized model of the hierarchical machine. We outline a systematic methodology for obtaining the most appropriate mapping. ------------------------------------------------------------------- File: Li_Tandri_et_ICPP93.ps.Z Title: LOCALITY AND LOOP SCHEDULING ON NUMA MULTIPROCESSORS Authors: Hui Li, Sudarsan Tandri Michael Stumm, and Kenneth C. Sevcik Where: International Conference on Parallel Processing 93 Keywords: NUMA multiprocessors, Locality, Scheduling Abstract: An important issue in the parallel execution of loops is how to partition and schedule the loops onto the available processors. While most existing dynamic scheduling algorithms manage load imbalances well, they fail to take locality into account and therefore perform poorly on parallel systems with non-uniform memory access times. In this paper, we propose a new loop scheduling algorithm, Locality-based Dynamic Scheduling (LDS), that exploits locality, and dynamically balances the load. -------------------------------------------------------------- File: Sandhu_et_al_PPOPP.ps.Z Title: The shared regions approach to software cache coherence on multiprocessors Where: Appears in: Proceedings of the 1993 ACM SIGPLAN Symposium on Principles and Pranctice of Parallel Programming, May (1993). Authors: Harjinder Sandhu, Benjamin Gamsa and Songnian Zhou Keywords: NUMA, cache coherence, multiprocessor performance Abstract: The effective management of caches is critical to the performance of applications on shared-memory multiprocessors. In this paper, we discuss a technique for software cache coherence that is based upon the integration of a program-level abstraction for shared data with software cache management. The program-level abstraction, called {\it Shared Regions}, explicitly relates synchronization objects with the data they protect. Cache coherence algorithms are presented which use the information provided by shared region primitives, and ensure that shared regions are always cacheable by the processors accessing them. Measurements and experiments of the Shared Region approach on a shared-memory multiprocessor are shown. Comparisons with other software based coherence strategies, including a user-controlled strategy and an operating system-based strategy, show that this approach is able to deliver better performance, with relatively low corresponding overhead and only a small increase in the programming effort. Compared to a compiler-based coherence strategy, the Shared Regions approach still performs better than a compiler that can achieve 90\% accuracy in allowing cacheing, as long as the regions are a few hundred bytes or larger, or they are re-used a few times in the cache. ------------------------------------------------------------------- File: Wilton_Vranesic_SPDP.ps.Z Title: Architectural Support for Block Transfers in a Shared-Memory Multiprocessor Authors: Steven J.E. Wilton and Zvonko G. Vranesic To appear in the Fifth IEEE Symposium on Parallel and Distributed Processing, Irving, Texas, December 1993 Keywords: Shared-memory multiprocessor, block transfer support Abstract: This paper examines how the performance of a shared-memory multiprocessor can be improved by including hardware support for block transfers. A system similar to the Hector multiprocessor developed at the University of Toronto is used as a base architecture. It is shown that such hardware support can improve the performance of initialization code by as much as 50%, but that the amount of improvement depends on the memory access behavior of the program and the way in which the operating system issues block transfer requests. ---------------------------------------------------------------------- File: Sevcik_Zhou_PERF93.ps.Z Title: Performance Benefits and Limitations of Large NUMA Multiprocessors Authors: Kenneth C. Sevcik and Songnian Zhou Where: appeared in the Proceedings of Performance '93 , Rome, Italy, September 27 to October 1, 1993, pp. 183-204, Elsevier Science Publ. Abstract: Please see the ps file. ---------------------------------------------------------------------- File: Harz_Sevcik_SC93.ps.Z Title: Hot Spot Analysis in Large Scale Shared Memory Multiprocessors Authors: Karim Harzallah and Kenneth C. Sevcik Where: will appear in the Proceedings of the Supercomputing '93 Conference, November, 1993, Portland, Oregon. Abstract: Please see the ps file. ----------------------------------------------------------------------- File: Sevcik_JPE.ps.Z Title: Application Scheduling and Processor Allocation in Multiprogrammed Parallel Processing Systems Authors: Kenneth C. Sevcik Where: This paper will appear in a special issue of the journal "Performance Evaluation" on the performance evaluation of parallel systems in late 1993 or early 1994. Abstract: Please see the ps file. ----------------------------------------------------------------------- File : Holliday_Stumm_IEEETC.ps.Z Title: Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors Authors: Mark Holliday Dept. of Computer Science, Duke University, Durham, NC 27706 Michael Stumm Dept. of Electrical and Computer Engineering University of Toronto, Toronto, Canada M5S 1A4 Date: November 1992; revised April 1993 Published: Technical Report CS-1992-18, Duke University Accepted for publication in IEEE Transactions on Computers Keywords: communication locality; hierarchical ring-based networks; hot spots; large scale parallel systems; memory banks; performance evaluation; prefetching; shared memory multiprocessors; simulation. Abstract: This paper investigates the performance of word-packet, slotted unidirectional ring-based hierarchical direct networks in the context of large-scale shared memory multiprocessors. Slotted unidirectional rings are attractive because their electrical characteristics and simple interfaces allow for fast cycle times and large bandwidths. For large-scale systems, it is necessary to use multiple rings for increased aggregate bandwidth. Hierarchies are attractive because the topology ensures unique paths between nodes, simple node interfaces and simple inter-ring connections. To ensure that a realistic region of the design space is examined, the architecture of the network used in the Hector prototype is adopted as the initial design point. A simulator of that architecture has been developed and validated with measurements from the prototype. The system and workload parameterization reflects conditions expected in the near future. The results of our study show the importance of system balance on performance. Large-scale systems inherently have large communication delays for distant accesses, so processor efficiency will be low, unless the processors can operate with multiple outstanding transactions using techniques such as prefetching, asynchronous writes and multiple hardware contexts. However with multiple outstanding transactions and only one memory bank per processing module, memory quickly saturates. Memory saturation can be alleviated by having multiple memory banks per processing module, but this shifts the bottleneck to the ring subsystem. While the topology of the ring hierarchy affects performance, the ring subsystem will inherently limit the throughput of the system. Hence increasing the number of outstanding transactions per processor beyond a certain point only has a limiting effect on performance, since it causes some of the rings to become congested. An adaptive maximum number of outstanding transactions appears necessary to adjust for the appropriate tradeoff between concurrency and contention as the communication locality changes. We show the relationships between processor, ring and memory speeds, and their effects on performance. Using block transfers for prefetching seems unlikely to be advantageous in that the improvement in the cache hit ratio needed to compensate for the increased network utilization is substantial. ------------------------------------------------------------------------- File : Curran_Stumm_CS.ps.Z Title: A Comparison of basic CPU Scheduling Algorithms for Multiprocessor Unix Authors: Stephen Curran and Michael Stumm Department of Electrical and Computer Engineering University of Toronto, Toronto, Canada M5S 1A4 Published: Computer Systems, 3(4), Oct., 1990, pp. 551--579. Abstract: In this paper, we present the results of a simulation study comparing three basic algorithms that schedule independent tasks in multiprocessor versions of Unix. Two of these algorithms, namely Central Queue and Initial Placement, are obvious extensions to the standard uniprocessor scheduling algorithm and are in use in a number of multiprocessor systems. A third algorithm, Take, is a variation on Initial Placement, where processors are allowed to raid the task queues of the other processors. Our simulation results show the difference between the performance of the three algorithms to be small when scheduling a typical Unix workload running on a small, bus-based, shared memory multiprocessor. They also show that the Take algorithm performs best for those multiprocessors on which tasks incur overhead each time they migrate. In particular, the Take algorithm appears to be more stable than the other two algorithms under extreme conditions. ----------------------------------------------------------------------- File: Stumm_Unrau_Krieger_USENIX92.ps.Z Title: HIERARCHICAL CLUSTERING: A STRUCTURE FOR SCALABLE MULTIPROCESSOR OPERATING SYSTEM DESIGN Authors: Michael Stumm, Ron Unrau, and Orran Krieger Where: Extended version of Clustering Micro-Kernels for Scalability, Proc.\ of the Usenix Workshop on Micro-Kernels and Other Kernel Architectures, April, 1992. Abstract: Please see the ps file. ---------------------------------------------------------------------- File: Stumm_Vranesic_White_IPPS93.ps.Z Title: EXPERIENCE WITH THE HECTOR MULTIPROCESSOR Authors: Michael Stumm, Zvonko Vranesic, Ron White Where: Extended version of paper with same title in Proc.\ Intl.\ Parallel Processing Symposium Parallel Systems Fair, 1993, pp.\ 9--16. Abstract: Please see the ps file. ---------------------------------------------------------------------- File: Krieger_Stumm_Unrau_USENIX92.ps.Z Title: The Alloc Stream Facility: A redesign of application-level Stream I/O Authors: O. Krieger, M. Stumm, and R. Unrau Where: Extended version of ``Exploiting the advantages of mapped files for stream I/O'' in Proc.\ of the Winter 1992 Usenix Conference, January, 1992. Abstract: This paper describes the design and implementation of a new application level I/O facility, called the Alloc Stream Facility. The Alloc Stream Facility has several key advantages. First, performance is substantially improved as a result of a)~the structure of the facility that allows it to take advantage of system specific features like mapped files, and b)~a reduction in data copying and the number of I/O system calls. Second, the facility is designed for multi-threaded applications running on multiprocessors and allows for a high degree of concurrency. Finally, the facility can support a variety of I/O interfaces, including stdio, emulated Unix I/O, ASI, and C++ streams, in a way that allows applications to freely intermix calls to the different interfaces, resulting in improved code reusability. We show that on several Unix workstation platforms the performance of Unix applications using the Alloc Stream Facility can be substantially better that when the applications use the original I/O facilities. ---------------------------------------------------------------------- File: Krieger_Stumm_DAGS93.ps.Z Title: HFS: A Flexible File System for large-scale Multiprocessors Authors: Orran Krieger and Michael Stumm Where: Proceedings of the 1993 DAGS/PC Symposium Abstract: The {H{\sc urricane}} File System (HFS) is a new file system being developed for large-scale shared memory multiprocessors with distributed disks. The main goal of this file system is scalability; that is, the file system is designed to handle demands that are expected to grow linearly with the number of processors in the system. To achieve this goal, HFS is designed using a new structuring technique called Hierarchical Clustering. HFS is also designed to be flexible in supporting a variety of policies for managing file data and for managing file system state. This flexibility is necessary to support in a scalable fashion the diverse workloads we expect for a multiprocessor file system. ---------------------------------------------------------------------- File: Krieger_etal_ICPP93.ps.Z Title: A fair fast scalable reader-writer lock Authors: O. Krieger, M. Stumm, R. Unrau, and J. Hanna, Where: Proc. Intl. Conf. on Parallel Processing, 1993. Abstract: A reader-writer lock allows either multiple readers to inspect shared data or a single writer exclusive access to that data. On shared memory multiprocessors, the cost of acquiring and releasing these locks can have a large impact on the performance of parallel applications. Other researchers have shown how to implement scalable locks, that is, locks that can become contended without resulting in memory or interconnection network contention. This paper describes a new algorithm for a reader-writer lock that, while being scalable in the contended case, has a low overhead in the uncontended case. This is important because most parallel applications are written so that locks are typically uncontended. The only atomic operation required by this algorithm is fetch_and_store and hence it can be used on most current multiprocessor systems. Experimental results are provided. ---------------------------------------------------------------------- File: Kulkarni_Stumm_Tutorial.ps.Z Title: Loop and Data Transformations: A tutorial Authors: Dattatraya Kulkarni and Michael Stumm Where: Internal document, a tutorial guide. Abstract: Hierarchically structured machines appear to be becoming the dominant parallel computing structure. These systems have non-uniform access times. We address the problem of restructuring a possibly sequential program to execute efficiently on such parallel machines. This restructuring involves transforming and partitioning the loop structures and the data to so as to improve {\it parallelism}, {\it static} and {\it dynamic locality}, and {\it load balance}. The objective of this paper is to present previous and ongoing work on loop and data transformations and motivate a {\it unified} framework to restructuring of a sequence of loops and data so as to execute efficiently on parallel machines with several levels of hierarchy. ---------------------------------------------------------------------- File: Baru_Zilio_PADS93.ps.Z Title: Data reorganization in parallel database systems Author: Chaitanya Baru & Daniel C. Zilio Where : Proc. of the IEEE Workshop on Advances in Parallel and Distributed Systems}, Princeton, NJ, pp.102-107, Oct. 1993. Abstract: Parallel database systems are suitable for use in applications with high capacity and high performance and availability requirements. The trend in such systems is to provide efficient on-line capability for performing various system administration functions such as, index creation and maintenance, backup/restore, reorganization, and gathering of statistics. For some of these functions, the on-line capability can be efficiently supported by the use of ``incremental algorithms", i.e., algorithms that achieve the function in several, relatively small (i.e., less time-consuming) steps, rather than in a single, large step. Incremental algorithms ensure that only small parts of the database become inaccessible for short durations as opposed to non-incremental algorithms which may lock large portions of the database or the entire database for a longer duration. In this paper, we discuss issues in providing concurrent data reorganization capability using incremental algorithms in parallel database systems. ---------------------------------------------------------------------- File: Kulkarni_Stumm_292.ps.Z Title: Computational Alignment: A new, unified program transformation for local and global optimization Authors: Dattatraya Kulkarni and Michael Stumm Where: CSRI Tech report 292, ISSN 0834-1648 Abstract: {\small {\em Computational Alignment} is a new class of program transformations suitable for both local and global optimization. Computational Alignment transforms all of the computations of a {\em portion} of the loop body in order to align them to other computations either in the same loop or in another loop. It extends along a new dimension and is significantly more powerful than linear transformations because $i)$ it can transform subsets of dependences and references; $ii)$ it is sensitive to the location of data in that it can move the computation relative to data; $iii)$ it applies to imperfect loop nests; and $iv)$ it is the first loop transformation that can change {\it access vectors}. Linear transformations are just a special case of Computational Alignment. Computational Alignment is highly suitable for global optimization because it can transform given loops to access data in similar ways. Two important subclasses of Computational Alignment are presented as well, namely, {\em Freeing} and {\em Isomerizing} Computational Alignment.} ------------------------------------------------------------- File: Brecht_PhD_303.ps.Z Title: Multiprogrammed Parallel Application Scheduling in NUMA Multiprocessors Authors: Timothy B. Brecht Where: Ph.D. Dissertation - CSRI Technical Report CSRI-303 Abstract: The invention, acceptance, and proliferation of multiprocessors are primarily a result of the quest to increase computer system performance. The most promising features of multiprocessors are their potential to solve problems faster than previously possible and to solve larger problems than previously possible. Large-scale multiprocessors offer the additional advantage of being able to execute multiple parallel applications simultaneously. The execution time of a parallel application is directly related to the number of processors it is allocated and, in shared-memory non-uniform memory access time (NUMA) multiprocessors, which processors it is allocated. As a result, efficient and effective scheduling becomes critical to overall system performance. In fact, it is likely to be a contributing factor in ultimately determining the success or failure of shared-memory NUMA multiprocessors. The subjects of this dissertation are the problems of processor allocation and application placement. The processor allocation problem involves determining the number of processors to allocate to each of several simultaneously executing parallel applications and possibly dynamically adjusting those allocations to improve overall system performance. The performance metric used is mean response time. We show that by differentiating between applications based on the amount of remaining work they have to execute, performance can be improved significantly. Then we propose techniques for estimating an application's expected remaining work along with policies for using these estimates to make improved processor allocation decisions. An experimental evaluation demonstrates the promise of this approach. The placement problem involves determining which of the many processors to assign to each application. Using experiments conducted on a representative system, we demonstrate that in large-scale NUMA multiprocessors the execution time of parallel applications is significantly affected by the placement of the application. This motivates the need for new techniques designed explicitly for NUMA multiprocessors. We introduce such a technique, called processor pool-based scheduling, that is designed to localize the execution of parallel applications within a NUMA architecture and to isolate different parallel applications from each other. An experimental evaluation of this scheduling method shows that it can be used to significantly reduce mean response time over methods that do not consider the placement of parallel applications. ------------------------------------------------------------------- File: Gamsa_MASc.ps.Z Title: Region-Oriented Main Memory Management in Shared-Memory NUMA Multiprocessors Authors: Benjamin Gamsa Where: M.Sc. Thesis Abstract: In Non-Uniform Memory Access time (NUMA) multiprocessors, distribution of the memory modules facilitates architectural scaling, but creates complications for the programmers who must be concerned with the physical distribution of their data in order to obtain good performance. In order to reduce the impact of remote accesses, in this thesis we propose that data be partitioned into Shared Regions that reflect the granularity of data sharing in programs, and that special synchronization calls be added to enforce proper ordering of accesses to the shared data as well as to manage replication and consistency transparently to the programmer. Results from measurements on a 16-processor NUMA multiprocessor and from a model of the system indicate that the Shared Regions approach is successful in obtaining the necessary locality critical to performance, while incurring only minimal overhead. Data distribution methods are also observed to have a significant impact on the performance of the system, especially in the larger multiprocessors modeled. ------------------------------------------------------------------- File: Unrau_PhD.ps.Z Title: Scalable Memory Management through Hierarchical Symmetric Multiprocessing Authors: Ronald C. Unrau Where: Ph.D. Disseration Abstract: This dissertation examines scalability issues in the design of operating systems for large-scale, shared-memory multiprocessors. In particular, the thesis focuses on structuring issues as they relate to memory management. From a set of simple, well-known queuing network formulas, we derive a set of properties that describe sufficient conditions for an operating system to scale. From these properties we first develop a set of guidelines for designing scalable systems, and then develop a new structuring philosophy for shared-memory multiprocessor operating systems, called Hierarchical Symmetric Multiprocessing (HSM). HSM manages the system resources in clusters, using tight coupling within a cluster, and loose coupling across clusters. Distributed systems principles are applied by distributing and replicating system services and data objects to increase locality, increase concurrency, and to avoid centralized bottlenecks, thus making the system scalable. However, tight coupling is used within a cluster, so the system performs well for local interactions. HSM maximizes locality which is key to good performance in large systems, and systems based on HSM can easily be adapted to different hardware configurations and architectures by changing the size of the clusters. Finally, HSM leads to a modular system composed from easy-to-design and hence efficient building blocks. Memory management is a particularly challenging service to implement within the HSM framework, because it must provide the applications with an integrated and coherent view of a single system, while distributing and replicating services in order to fully exploit the hardware potential. We describe in detail the implementation of an HSM structured memory management subsystem, and evaluate the performance of our implementation on Hector, a prototype scalable shared memory multiprocessor. ------------------------------------------------------------------- File: Wu_MASc.ps.Z Title: Processor Scheduling in Multiprogrammed Shared Memory NUMA Multiprocessors Authors: Chee-Shong Wu Where: M.Sc. Thesis Abstract: In a multiprogrammed multiprocessor, the scheduler is not only responsible for deciding when to activate an application and when to suspend it, but is also responsible for determining how many processors to allocate to each application. In a scalable Non- Uniform Memory Access (NUMA) multiprocessor, it must further resolve the problem of which processor(s) to allocate to which application since the memory reference times are not the same for all processor-memory pairs. In this thesis, we study the problem of how to characterize parallel applications and how to apply this knowledge in scheduling for NUMA systems. We also study the performance of several scheduling algorithms in an NUMA environment. These algorithms differ in the frequency of reallocations. We propose two policies, the Static policy and the Immediate Start Static policy, that utilize application characteristics when making scheduling decisions. The performance of these two policies is compared with that of the Dynamic policy, on an NUMA multiprocessor, Hector. --------------------------------------------------------------------- File: Parsons_Sevcik_IPPS95.ps.Z Title: Multiprocessor Scheduling for High-Variability Service Time Distributions Where: IPPS '95 Workshop on Job Scheduling Strategies for Parallel Processing reprinted in Springer-Verlag Lecture Notes in Computer Science, Vol 949, pages 127--145. Authors: Eric W. Parsons and Kenneth C. Sevcik Keywords: Scheduling, multiprocessor performance Abstract: Many disciplines have been proposed for scheduling and processor allocation in multiprogrammed multiprocessors for parallel processing. These have been, for the most part, designed and evaluated for workloads having relatively low variability in service demand. But with reports that variability in service demands at high performance computing centers can actually be quite high, these disciplines must be reevaluated. In this paper, we examine the performance of two well-known static scheduling disciplines, and propose preemptive versions of these that offer much better mean response times when the variability in service demand is high. We argue that, in systems in which dynamic repartitioning in applications is expensive or impossible, these preemptive disciplines are well suited for handling high variability in service demand. --------------------------------------------------------------------- File: Okrieg_PhD.ps.Z Title: HFS: A flexible file system for shared-memory multiprocessors Where: PhD Dissertation, Department of Electrical and Computer Engineering, University of Toronto Authors: Orran Krieger Keywords: File System, I/O, Hurricane, Hector NUMA multiprocessor Abstract: The Hurricane File System (HFS) is designed for large-scale, shared-memory multiprocessors. Its architecture is based on the principle that a file system must support a wide variety of file structures, file system policies and I/O interfaces to maximize performance for a wide variety of applications. HFS uses a novel, object-oriented building-block approach to provide the flexibility needed to support this variety of file structures, policies, and I/O interfaces. File structures can be defined in HFS that optimize for sequential or random access, read-only, write-only or read/write access, sparse or dense data, large or small file sizes, and different degrees of application concurrency. Policies that can be defined on a per-file or per-open instance basis include locking policies, prefetching policies, compression/decompression policies and file cache management policies. In contrast, most existing file systems have been designed to support a single file structure and a small set of policies. We have implemented large portions of HFS as part of the Hurricane operating system running on the \hec\ shared-memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. Also, we show that HFS is able to deliver the full I/O bandwidth of the disks on our system to the applications. --------------------------------------------------------------------- File: Vranesic_etal_IEEEC.ps.Z Title: Hector -- A hierarchically structured shared memory multiprocessor Where: IEEE Computer, 24(1): 72-80, January, 1991. Authors: Z. Vranesic, M. Stumm, D. Lewis and R. White Keywords: Shared memory multiprocessors, slotted rings, NUMA, Scalability. Abstract: Please see the ps file. --------------------------------------------------------------------- File: Kulkarni_etal_317.ps.Z Title: A Generalized Theory of Linear Loop Transformations Where: CSRI Tech Report 317 Authors: D. Kulkarni, M. Stumm, R. Unrau Keywords: Computational Alignment, Computation Decomposition, Linear loop transformations, SPMD code generation. Abstract: In this paper we present a new theory of linear loop transformations called {\em Computation Decomposition and Alignment\/} (CDA). A CDA transformation has two components: {\em Computation Decomposition\/} first decomposes the computations in the loop into computations of finer granularity, from iterations to instances of subexpressions. {\em Computation Alignment\/} subsequently, linearly transforms each of these sets of computations, possibly by using a different transformation for each set. This framework subsumes all existing linear transformation frameworks in that it reduces to a conventional linear loop transformation when the smallest granularity is an iteration, and it reduces to some of the more recently extended frameworks when the smallest granularity is a statement instance. The possibility of being able to align computations at arbitrary granularities adds a new dimensions to performance optimization on high performance computing platforms. We describe Computation Decomposition and Alignment and provide examples of CDA transformations. We present some heuristics to derive appropriate CDA transformations, given a desired optimization objective. We present the results of experiments run on the KSR1 multiprocessor and various RS6000 and Sparc platforms that demonstrate that CDA can result in substantial performance improvements. --------------------------------------------------------------------- File: Kulkarni_Stumm_ACJ95.ps.Z Title: Linear Loop Transformations in Optimizing Compilers for Parallel Machines Where: To appear in the Australian Computer Journal Authors: D. Kulkarni, M. Stumm Keywords: Linear loop transformations Abstract: We present the linear loop transformation framework which is the formal basis for state of the art optimization techniques in restructuring compilers for parallel machines. The framework unifies most existing transformations and provides a systematic set of code generation techniques for arbitrary compound loop transformations. The algebraic representation of the loop structure and its transformation give way to quantitative techniques for optimizing performance on parallel machines. We discuss in detail the techniques for generating the transformed loop and deriving the desired linear transformation. --------------------------------------------------------------------- File: Manjikian_Abdelrahman_315.ps.Z Title: Fusion of Loops for Parallelism and Locality Where: CSRI Tech Report 315 Authors: N. Manjikian and T. Abdelrahaman Keywords: Loop fusion, cache performance, locality, NUMA Abstract: Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which reduce parallelism. In addition, performance losses result from cache conflicts in fused loops. We present new, systematic techniques which: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) allow parallel execution of fused loops with minimal synchronization, and (3) eliminate cache conflicts in fused loops. We evaluate our techniques on a 56-processor KSR2 multiprocessor, and show performance improvements of up to 20% for representative loop nest sequences. The results also indicate a performance tradeoff as more processors are used, suggesting a careful evaluation of the profitability of fusion. <!---------------------------------------------------------------------> <HR> <A NAME="Kulkarni_Stumm_LCR95">.</A> <HR> <B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Kulkarni_Stumm_LCR95.ps.Z">CDA Loop Transformations</A> <P> <B>Authors:</B> Dattatraya Kulkarni and Michael Stumm <P> <B>Where:</B> Proceedings of the Third workshop on languages, compilers and run- time systems for scalable computers}, Troy, NY, May 1995, Kluwer Academic. <P> <B>Abstract:</B> <P> In this paper we present a new loop transformation technique called {\em Computation Decomposition and Alignment\/} (CDA). {\em Computation Decomposition\/} first decomposes the iteration space into finer computation spaces. {\em Computation Alignment\/} subsequently, linearly transforms each computation space independently. CDA is a general framework in that linear transformations and its recent extensions are just special cases of CDA. CDA's fine grained loop restructuring can incur considerable computational effort, but can exploit optimization opportunities that earlier frameworks cannot. We present four optimization contexts in which CDA can be useful. Our initial experiments demonstrate that CDA adds a new dimension to performance optimization. <!---------------------------------------------------------------------> <HR> <A NAME="Kulkarni_Stumm_EuroPar95">.</A> <HR> <B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Kulkarni_Stumm_EuroPar95.ps.Z">Implementing Flexible Computation Rules with Subexpression-level Loop Transformations</A> <P> <B>Authors:</B> Dattatraya Kulkarni, Michael Stumm and Ronald C. Unrau <P> <B>Where:</B>Proceedings of the Euro-Par95, Stockholm, Aug 28-31, 1995. <P> <B>Abstract:</B> <P> Computation Decomposition and Alignment (CDA) is a new loop transformation framework that extends the linear loop transformation framework and the more recently proposed Computation Alignment frameworks by linearly transforming computations at the granularity of subexpressions. It can be applied to achieve a number of optimization objectives, including the removal of data alignment constraints, the elimination of ownership tests, the reduction of cache conflicts, and improvements in data access locality. In this paper we show how CDA can be used to effectively implement flexible computation rules with the objective of minimizing communication and, whenever possible, eliminating intrinsics that test whether computations need to be executed or not. We describe CDA, show how it can be used to implement flexible computation rules, and present an algorithm for deriving appropriate CDA transformations. <!---------------------------------------------------------------------> <HR> <A NAME="Unrau_etal_EuroPar95">.</A> <HR> <B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Unrau_etal_EuroPar95.ps.Z">On the Scalability of Demand-Driven Parallel Systems </A> <P> <B>Authors:</B> Ronald C. Unrau and Michael Stumm and Orran Krieger <P> <B>Where:</B>Proceedings of the Euro-Par95, Stockholm, Aug 28-31, 1995. <P> <B>Abstract:</B> <P> Demand-driven systems follow the model where customers enter the system, request some service, and then depart. Examples are databases, transaction processing systems and operating systems, which form the system software layer between the applications and the hardware. Achieving scalability at the system software layer is critical for the scalability of the system as a whole, and yet this layer has largely been ignored. In this paper, we characterize the scalability of the system software layer of demand-driven parallel systems based on fundamental metrics of quantitative system performance analysis. We develop a set of sufficient conditions so that if a system satisfies these conditions, then the system is scalable. We further argue that in practice these conditions are also necessary. In the remainder of the paper, we use the necessary and sufficient conditions to develop a set of practical design guidelines, to study the effect of application workloads, and to examine the scalability behavior of a system with only a limited number of processors. <!---------------------------------------------------------------------> <HR> <A NAME="Parsons_etal_IWOOS95">.</A> <HR> <B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Parsons_etal_IWOOS95.ps.Z">(De-)Clustering Objects for Multiprocessor System Software </A> <P> <B>Authors:</B> Eric Parsons, Ben Gamsa, Orran Krieger, Michael Stumm <P> <B>Where:</B> IWOOS95 (Fourth International Workshop on Object Orientation in Operating Systems 95) <P> <B>Abstract:</B> <P> Designing system software for large-scale shared-memory multiprocessors is challenging because of the level of performance demanded by the application workload and the distributed nature of the system. Adopting an object-oriented approach for our system, we have developed a framework for de-clustering objects, where each object may migrate, replicate, and distribute all or part of its data across the system memory using the policies that will best meet the locality requirements for that data. The mechanism for object invocation hides the internal structure of an object, allowing a request to be made directly to the most suitable part of the object on a per-processor basis without any knowledge of how the object is de-clustered. Method invocation is very efficient, both within and across address spaces, involving no remote memory accesses in the common case. We describe the design and implementation of this framework in Tornado, our multiprocessor operating system. <!---------------------------------------------------------------------> <HR> <A NAME="Ben_etal_OOPSLAW94">.</A> <HR> <B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Ben_etal_OOPSLAW94.ps.Z">The Importance of Performance-Oriented Flexibility in System Software for Large-Scale Shared-Memory Multiprocessors </A> <P> <B>Authors:</B> Orran Krieger, Ben Gamsa, Karen Reid, Paul Lu, Eric Parsons and Michael Stumm <P> <B>Where:</B> OOPSLA Workshop on Flexible System Software. October 1994. <P> <B>Abstract:</B> <P> See paper for abstract. <!---------------------------------------------------------------------> <HR> <A NAME="Orran_etal_SPDPW95">.</A> <HR> <B>Title:</B> <A HREF="ftp://ftp.cs.toronto.edu/pub/parallel/Orran_etal_SPDPW95.ps.Z"> Exploiting Mapped Files for Parallel I/O </A> <P> <B>Authors:</B> Orran Krieger, Karen Reid and Michael Stumm <P> <B>Where:</B> SPDP Workshop on Modeling and Specification of I/O (MSIO), October 1995 <P> <B>Abstract:</B> <P> Harnessing the full I/O capabilities of a large-scale multiprocessor is difficult and requires a great deal of cooperation between the application programmer, the compiler and the operating (/file) system. Hence, the parallel I/O interface used by the application to communicate with the system is crucial in achieving good performance. We present a set of properties we believe that a good I/O interface should have and consider current parallel I/O interfaces from the perspective of these properties. We describe the advantages and disadvantages of mapped-file I/O and argue that if properly implemented it can be a good basis for a parallel I/O interface that can fulfill the suggested properties. To demonstrate that such an implementation is feasible, we describe methodology used in our previous work on the Hurricane operating system and in our current work on the Tornado operating system to implement mapped files. |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Baru_Zilio_PADS93.ps.Z version [562e569ca3].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Ben_etal_OOPSLAW94.ps.Z version [62f95e96a8].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Brecht_PhD_303.ps.Z version [c786620163].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Brecht_SEDMS93.ps.Z version [a46f67e4c9].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Curran_Stumm_CS.ps.Z version [1e41b84548].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Gamsa_MASc.ps.Z version [384073afb3].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Gamsa_etal_ICPP94.ps.Z version [5c90cfb8a3].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Harz_Sevcik_SC93.ps.Z version [e8dfae8a65].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Holliday_Stumm_IEEETC.ps.Z version [5500d6f174].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Krieger_Stumm_DAGS93.ps.Z version [fec8d531f0].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Krieger_Stumm_Unrau_USENIX92.ps.Z version [a50dc80555].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Krieger_etal_ICPP93.ps.Z version [be37bd6a4b].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Krieger_etal_IEEEComp94.ps.Z version [5727cfd3f1].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kulkarni_Stumm_292.ps.Z version [12fc51f1ea].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kulkarni_Stumm_ACJ95.ps.Z version [60d0400ff3].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kulkarni_Stumm_CDA.ps.Z version [e3cb470a06].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kulkarni_Stumm_LCR95.ps.Z version [b25e953c9e].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kulkarni_Stumm_Tutorial.ps version [0f96759886].
more than 10,000 changes
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kulkarni_Stumm_Tutorial.ps.Z version [00c2672e37].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kulkarni_Stumm_Unrau_EuroPar95.ps.Z version [333b693b11].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kulkarni_etal_317.ps.Z version [43337cf4e0].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kumar_Kulkarni_ICPP91.ps.Z version [1044ec917d].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kumar_Kulkarni_ICS92.ps.Z version [d380050e5e].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Li_Tandri_et.ps.Z version [048f032c7c].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Manjikian_Abdelrahaman_315.ps.Z version [cd2fdcd627].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/New_Kulkarni_Stumm_Tutorial.ps.Z version [bb636c1af0].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/ABSTRACTS.Z version [83ff0aee29].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Baru_Zilio_PADS93.ps.Z version [ef4d38005d].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Ben_etal_OOPSLAW94.ps.Z version [071bfc4652].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Brecht_PhD_303.ps.Z version [26f316a9a8].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Brecht_SEDMS93.ps.Z version [1fb4aa619c].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Curran_Stumm_CS.ps.Z version [1126490152].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Gamsa_MASc.ps.Z version [ecab87ac53].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Gamsa_etal_ICPP94.ps.Z version [37e7a07dbe].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Harz_Sevcik_SC93.ps.Z version [cdaa8a7875].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Holliday_Stumm_IEEETC.ps.Z version [9eac90ad8c].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Krieger_Stumm_DAGS93.ps.Z version [a8661224c1].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Krieger_Stumm_Unrau_USENIX92.ps.Z version [a50dc80555].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Krieger_etal_ICPP93.ps.Z version [a93baad145].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Krieger_etal_IEEEComp94.ps.Z version [ce1ab5a5a0].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Kulkarni_Stumm_292.ps.Z version [d4c59465ba].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Kulkarni_Stumm_ACJ95.ps.Z version [6c3b324030].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Kulkarni_Stumm_CDA.ps.Z version [e3cb470a06].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Kulkarni_Stumm_LCR95.ps.Z version [162ba80033].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Kulkarni_Stumm_Tutorial.ps.Z version [6828d2a140].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Kulkarni_Stumm_Unrau_EuroPar95.ps.Z version [9dac4bd560].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Kulkarni_etal_317.ps.Z version [34eccfe707].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Kumar_Kulkarni_ICPP91.ps.Z version [3940104ad0].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Kumar_Kulkarni_ICS92.ps.Z version [53a23b5e0b].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Li_Tandri_et.ps.Z version [0386572be6].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Manjikian_Abdelrahaman_315.ps.Z version [78d886e478].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/New_Kulkarni_Stumm_Tutorial.ps.Z version [bb636c1af0].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Okrieg_PhD.ps.Z version [09970b958c].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Orran_etal_SPDPW95.ps.Z version [19877a720d].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Parsons_Sevcik_IPPS95.ps.Z version [b4abb772ee].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Parsons_etal_IWOOS95.ps.Z version [ddc5af20a0].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/README.Z version [2cf48fe915].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Sandhu_et_al_PPOPP.ps.Z version [7d2dc2e38a].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Sevcik_JPE.ps.Z version [4c05ad6c60].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Sevcik_Zhou_PERF93.ps.Z version [32a44a59ed].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Stumm_Unrau_Krieger_USENIX92.ps.Z version [0eb56da9b7].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Stumm_Vranesic_White_IPPS93.ps.Z version [3e88aab020].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Tandri_Abdel_PDPTA.ps version [2089955127].
|
|
%!PS-Adobe-2.0 %%Creator: dvips 5.512 Copyright 1986, 1993 Radical Eye Software %%Title: pdpta.dvi %%CreationDate: Thu Nov 23 17:27:55 1995 %%Pages: 10 %%PageOrder: Ascend %%BoundingBox: 0 0 612 792 %%DocumentFonts: Times-Bold Times-Roman Times-Italic Courier %%EndComments %DVIPSCommandLine: dvips -o pdpta.ps pdpta.dvi %DVIPSSource: TeX output 1995.08.11:1234 %%BeginProcSet: tex.pro /TeXDict 250 dict def TeXDict begin /N{def}def /B{bind def}N /S{exch}N /X{S N} B /TR{translate}N /isls false N /vsize 11 72 mul N /@rigin{isls{[0 -1 1 0 0 0] concat}if 72 Resolution div 72 VResolution div neg scale isls{Resolution hsize -72 div mul 0 TR}if Resolution VResolution vsize -72 div 1 add mul TR matrix currentmatrix dup dup 4 get round 4 exch put dup dup 5 get round 5 exch put setmatrix}N /@landscape{/isls true N}B /@manualfeed{statusdict /manualfeed true put}B /@copies{/#copies X}B /FMat[1 0 0 -1 0 0]N /FBB[0 0 0 0]N /nn 0 N /IE 0 N /ctr 0 N /df-tail{/nn 8 dict N nn begin /FontType 3 N /FontMatrix fntrx N /FontBBox FBB N string /base X array /BitMaps X /BuildChar{ CharBuilder}N /Encoding IE N end dup{/foo setfont}2 array copy cvx N load 0 nn put /ctr 0 N[}B /df{/sf 1 N /fntrx FMat N df-tail}B /dfs{div /sf X /fntrx[sf 0 0 sf neg 0 0]N df-tail}B /E{pop nn dup definefont setfont}B /ch-width{ch-data dup length 5 sub get}B /ch-height{ch-data dup length 4 sub get}B /ch-xoff{128 ch-data dup length 3 sub get sub}B /ch-yoff{ch-data dup length 2 sub get 127 sub}B /ch-dx{ch-data dup length 1 sub get}B /ch-image{ch-data dup type /stringtype ne{ctr get /ctr ctr 1 add N}if}B /id 0 N /rw 0 N /rc 0 N /gp 0 N /cp 0 N /G 0 N /sf 0 N /CharBuilder{save 3 1 roll S dup /base get 2 index get S /BitMaps get S get /ch-data X pop /ctr 0 N ch-dx 0 ch-xoff ch-yoff ch-height sub ch-xoff ch-width add ch-yoff setcachedevice ch-width ch-height true[1 0 0 -1 -.1 ch-xoff sub ch-yoff .1 add]{ch-image}imagemask restore}B /D{/cc X dup type /stringtype ne{]}if nn /base get cc ctr put nn /BitMaps get S ctr S sf 1 ne{dup dup length 1 sub dup 2 index S get sf div put}if put /ctr ctr 1 add N} B /I{cc 1 add D}B /bop{userdict /bop-hook known{bop-hook}if /SI save N @rigin 0 0 moveto /V matrix currentmatrix dup 1 get dup mul exch 0 get dup mul add .99 lt{/QV}{/RV}ifelse load def pop pop}N /eop{SI restore showpage userdict /eop-hook known{eop-hook}if}N /@start{userdict /start-hook known{start-hook} if pop /VResolution X /Resolution X 1000 div /DVImag X /IE 256 array N 0 1 255 {IE S 1 string dup 0 3 index put cvn put}for 65781.76 div /vsize X 65781.76 div /hsize X}N /p{show}N /RMat[1 0 0 -1 0 0]N /BDot 260 string N /rulex 0 N /ruley 0 N /v{/ruley X /rulex X V}B /V{}B /RV statusdict begin /product where{ pop product dup length 7 ge{0 7 getinterval dup(Display)eq exch 0 4 getinterval(NeXT)eq or}{pop false}ifelse}{false}ifelse end{{gsave TR -.1 -.1 TR 1 1 scale rulex ruley false RMat{BDot}imagemask grestore}}{{gsave TR -.1 -.1 TR rulex ruley scale 1 1 false RMat{BDot}imagemask grestore}}ifelse B /QV{ gsave transform round exch round exch itransform moveto rulex 0 rlineto 0 ruley neg rlineto rulex neg 0 rlineto fill grestore}B /a{moveto}B /delta 0 N /tail{dup /delta X 0 rmoveto}B /M{S p delta add tail}B /b{S p tail}B /c{-4 M} B /d{-3 M}B /e{-2 M}B /f{-1 M}B /g{0 M}B /h{1 M}B /i{2 M}B /j{3 M}B /k{4 M}B /w{0 rmoveto}B /l{p -4 w}B /m{p -3 w}B /n{p -2 w}B /o{p -1 w}B /q{p 1 w}B /r{ p 2 w}B /s{p 3 w}B /t{p 4 w}B /x{0 S rmoveto}B /y{3 2 roll p a}B /bos{/SS save N}B /eos{SS restore}B end %%EndProcSet %%BeginProcSet: texps.pro TeXDict begin /rf{findfont dup length 1 add dict begin{1 index /FID ne 2 index /UniqueID ne and{def}{pop pop}ifelse}forall[1 index 0 6 -1 roll exec 0 exch 5 -1 roll VResolution Resolution div mul neg 0 0]/Metrics exch def dict begin Encoding{exch dup type /integertype ne{pop pop 1 sub dup 0 le{pop}{[}ifelse}{ FontMatrix 0 get div Metrics 0 get div def}ifelse}forall Metrics /Metrics currentdict end def[2 index currentdict end definefont 3 -1 roll makefont /setfont load]cvx def}def /ObliqueSlant{dup sin S cos div neg}B /SlantFont{4 index mul add}def /ExtendFont{3 -1 roll mul exch}def /ReEncodeFont{/Encoding exch def}def end %%EndProcSet %%BeginProcSet: special.pro TeXDict begin /SDict 200 dict N SDict begin /@SpecialDefaults{/hs 612 N /vs 792 N /ho 0 N /vo 0 N /hsc 1 N /vsc 1 N /ang 0 N /CLIP 0 N /rwiSeen false N /rhiSeen false N /letter{}N /note{}N /a4{}N /legal{}N}B /@scaleunit 100 N /@hscale{@scaleunit div /hsc X}B /@vscale{@scaleunit div /vsc X}B /@hsize{/hs X /CLIP 1 N}B /@vsize{/vs X /CLIP 1 N}B /@clip{/CLIP 2 N}B /@hoffset{/ho X}B /@voffset{/vo X}B /@angle{/ang X}B /@rwi{10 div /rwi X /rwiSeen true N}B /@rhi {10 div /rhi X /rhiSeen true N}B /@llx{/llx X}B /@lly{/lly X}B /@urx{/urx X}B /@ury{/ury X}B /magscale true def end /@MacSetUp{userdict /md known{userdict /md get type /dicttype eq{userdict begin md length 10 add md maxlength ge{/md md dup length 20 add dict copy def}if end md begin /letter{}N /note{}N /legal{ }N /od{txpose 1 0 mtx defaultmatrix dtransform S atan/pa X newpath clippath mark{transform{itransform moveto}}{transform{itransform lineto}}{6 -2 roll transform 6 -2 roll transform 6 -2 roll transform{itransform 6 2 roll itransform 6 2 roll itransform 6 2 roll curveto}}{{closepath}}pathforall newpath counttomark array astore /gc xdf pop ct 39 0 put 10 fz 0 fs 2 F/|______Courier fnt invertflag{PaintBlack}if}N /txpose{pxs pys scale ppr aload pop por{noflips{pop S neg S TR pop 1 -1 scale}if xflip yflip and{pop S neg S TR 180 rotate 1 -1 scale ppr 3 get ppr 1 get neg sub neg ppr 2 get ppr 0 get neg sub neg TR}if xflip yflip not and{pop S neg S TR pop 180 rotate ppr 3 get ppr 1 get neg sub neg 0 TR}if yflip xflip not and{ppr 1 get neg ppr 0 get neg TR}if}{noflips{TR pop pop 270 rotate 1 -1 scale}if xflip yflip and{TR pop pop 90 rotate 1 -1 scale ppr 3 get ppr 1 get neg sub neg ppr 2 get ppr 0 get neg sub neg TR}if xflip yflip not and{TR pop pop 90 rotate ppr 3 get ppr 1 get neg sub neg 0 TR}if yflip xflip not and{TR pop pop 270 rotate ppr 2 get ppr 0 get neg sub neg 0 S TR}if}ifelse scaleby96{ppr aload pop 4 -1 roll add 2 div 3 1 roll add 2 div 2 copy TR .96 dup scale neg S neg S TR}if}N /cp{pop pop showpage pm restore}N end}if}if}N /normalscale{Resolution 72 div VResolution 72 div neg scale magscale{DVImag dup scale}if 0 setgray}N /psfts{S 65781.76 div N}N /startTexFig{/psf$SavedState save N userdict maxlength dict begin /magscale false def normalscale currentpoint TR /psf$ury psfts /psf$urx psfts /psf$lly psfts /psf$llx psfts /psf$y psfts /psf$x psfts currentpoint /psf$cy X /psf$cx X /psf$sx psf$x psf$urx psf$llx sub div N /psf$sy psf$y psf$ury psf$lly sub div N psf$sx psf$sy scale psf$cx psf$sx div psf$llx sub psf$cy psf$sy div psf$ury sub TR /showpage{}N /erasepage{}N /copypage{}N /p 3 def @MacSetUp}N /doclip{psf$llx psf$lly psf$urx psf$ury currentpoint 6 2 roll newpath 4 copy 4 2 roll moveto 6 -1 roll S lineto S lineto S lineto closepath clip newpath moveto}N /endTexFig{end psf$SavedState restore}N /@beginspecial{ SDict begin /SpecialSave save N gsave normalscale currentpoint TR @SpecialDefaults count /ocount X /dcount countdictstack N}N /@setspecial{CLIP 1 eq{newpath 0 0 moveto hs 0 rlineto 0 vs rlineto hs neg 0 rlineto closepath clip}if ho vo TR hsc vsc scale ang rotate rwiSeen{rwi urx llx sub div rhiSeen{ rhi ury lly sub div}{dup}ifelse scale llx neg lly neg TR}{rhiSeen{rhi ury lly sub div dup scale llx neg lly neg TR}if}ifelse CLIP 2 eq{newpath llx lly moveto urx lly lineto urx ury lineto llx ury lineto closepath clip}if /showpage{}N /erasepage{}N /copypage{}N newpath}N /@endspecial{count ocount sub{pop}repeat countdictstack dcount sub{end}repeat grestore SpecialSave restore end}N /@defspecial{SDict begin}N /@fedspecial{end}B /li{lineto}B /rl{ rlineto}B /rc{rcurveto}B /np{/SaveX currentpoint /SaveY X N 1 setlinecap newpath}N /st{stroke SaveX SaveY moveto}N /fil{fill SaveX SaveY moveto}N /ellipse{/endangle X /startangle X /yrad X /xrad X /savematrix matrix currentmatrix N TR xrad yrad scale 0 0 1 startangle endangle arc savematrix setmatrix}N end %%EndProcSet TeXDict begin 40258431 52099146 1000 300 300 (/stumm/a0/tandri/pdpta/pdpta.dvi) @start /Fa 175[27 7[27 1[27 70[{}3 45.833332 /Courier rf /Fb 80[25 25 51[20 23 23 33 23 23 13 18 15 23 23 23 23 36 13 23 1[13 23 23 15 20 23 20 23 20 3[15 1[15 2[33 2[33 28 25 30 1[25 33 33 41 28 33 1[15 33 1[25 28 33 30 30 33 5[13 3[23 23 4[23 2[11 15 11 1[23 15 15 3[23 2[15 33[{}60 45.833332 /Times-Roman rf /Fc 81[29 51[23 26 2[26 29 16 23 23 2[29 29 42 16 2[16 29 29 16 26 29 26 29 29 13[29 2[36 42 1[48 6[36 1[42 39 1[36 11[29 29 29 29 29 2[15 19 45[{}36 58.333336 /Times-Italic rf /Fd 134[30 2[30 30 30 30 30 1[30 30 30 30 30 30 1[30 30 30 30 30 30 30 30 30 12[30 6[30 3[30 2[30 30 30 30 30 30 14[30 4[30 30 1[30 30 30 40[{}36 50.000000 /Courier rf /Fe 134[22 22 33 1[25 14 19 19 25 25 25 25 36 14 22 1[14 25 25 14 22 25 22 25 25 9[41 2[28 25 30 1[30 36 1[41 28 33 22 17 36 2[30 36 33 1[30 7[25 4[25 25 25 25 2[12 17 5[17 39[{}47 50.000000 /Times-Italic rf /Ff 1 1 df<FFFFF0FFFFF014027D881B>0 D E /Fg 4 117 df<1F0006000600060006000C000C000C00 0C0018F01B181C08180838183018301830306030603160616062C022C03C10177E9614>104 D<0300038003000000000000000000000000001C002400460046008C000C001800180018003100 3100320032001C0009177F960C>I<383C0044C6004702004602008E06000C06000C06000C0C00 180C00180C40181840181880300880300F00120E7F8D15>110 D<030003000600060006000600 FFC00C000C000C001800180018001800300030803080310031001E000A147F930D>116 D E /Fh 3 3 df<FFFFFFFCFFFFFFFC1E027C8C27>0 D<70F8F8F87005057C8E0E>I<C00003E0 000770000E38001C1C00380E00700700E00381C001C38000E700007E00003C00003C00007E0000 E70001C3800381C00700E00E00701C003838001C70000EE00007C000031818799727>I E /Fi 4 62 dfj 16 111 dfk 134[21 1[30 1[21 12 16 14 1[21 21 21 32 12 2[12 21 21 14 18 21 18 21 18 3[14 1[14 17[14 5[28 8[12 21 21 5[21 21 1[12 10 14 45[{}32 41.666668 /Times-Roman rf /Fl 203[15 15 15 15 49[{}4 29.166668 /Times-Roman rf /Fm 203[17 17 17 17 17 48[{}5 33.333332 /Times-Roman rf /Fn 138[39 23 27 31 1[39 35 39 59 20 39 1[20 1[35 23 31 39 31 39 35 9[71 4[51 1[43 6[27 2[43 1[51 51 11[35 35 35 35 35 35 35 49[{}32 70.833336 /Times-Bold rf /Fo 69[22 8[25 1[28 28 3[22 47[22 25 25 36 25 25 14 19 17 25 25 25 25 39 14 25 14 14 25 25 17 22 25 22 25 22 3[17 1[17 30 2[47 36 36 30 28 33 1[28 36 36 44 30 36 19 17 36 36 28 30 36 33 33 36 3[28 1[14 14 25 25 25 25 25 25 25 25 25 25 1[12 17 12 2[17 17 17 39[{}75 50.000000 /Times-Roman rf /Fp 139[17 19 22 14[22 28 25 31[36 65[{}7 50.000000 /Times-Bold rf /Fq 2 104 dfr 134[29 2[29 29 16 23 19 1[29 29 29 45 16 29 1[16 29 29 19 26 29 26 29 26 11[42 36 32 5[52 7[36 42 39 1[42 54 5[16 4[29 29 2[29 2[15 19 15 44[{}37 58.333336 /Times-Roman rf /Fs 134[42 3[46 28 32 37 1[46 42 46 69 23 2[23 46 42 1[37 46 37 46 42 13[46 2[51 2[78 8[60 60 67[{}23 83.333336 /Times-Bold rf end %%EndProlog %%BeginSetup %%Feature: *Resolution 300dpi TeXDict begin %%EndSetup %%Page: 1 1 1 0 bop 80 177 a Fs(Computation)19 b(and)i(Data)e(Partitioning)g(on)h (Scalable)341 280 y(Shar)o(ed)g(Memory)g(Multipr)o(ocessors)403 451 y Fr(Sudarsan)15 b(T)l(andri)29 b(and)g(T)l(arek)14 b(S.)g(Abdelrahman) 316 526 y(Department)h(of)g(Electrical)f(and)h(Computer)g(Engineering)284 601 y(The)f(University)h(of)g(T)l(oronto,)f(T)l(oronto,)g(Canada,)f(M5S)i (1A4)478 675 y(e-mail:)g Fq(f)p Fr(tandri,tsa)p Fq(g)p Fr(@eecg.toronto.edu) 833 865 y Fp(Abstract)217 945 y Fo(In)g(this)h(paper)f(we)h(identify)f(the)h (factors)f(that)h(af)o(fect)f(the)h(derivation)e(of)i(com-)217 999 y(putation)10 b(and)h(data)g(partitions)g(on)g(scalable)g(shared)g (memory)g(multiprocessors)217 1053 y(\(SSMMs\).)18 b(W)l(e)12 b(show)h(that)f(these)h(factors)f(necessitate)i(an)e(SSMM-conscious)217 1107 y(approach.)17 b(In)10 b(addition)g(to)g(remote)g(memory)f(access,)k (which)d(is)h(the)f(sole)h(factor)217 1161 y(on)19 b(distributed)g(memory)f (multiprocessors,)k(cache)d(af)o(\256nity)m(,)i(memory)e(con-)217 1216 y(tention)12 b(and)h(false)g(sharing)f(are)h(important)f(factors)g(that) h(must)g(be)g(considered.)217 1270 y(Experimental)g(evidence)h(is)g (presented)g(to)g(demonstrate)f(the)h(impact)f(of)h(these)217 1324 y(factors)i(on)g(performance)g(using)g(three)h(applications)f(on)h(the)f (KSR1)h(and)f(the)217 1378 y(Hector)c(multiprocessors.)4 1540 y Fn(1)71 b(Intr)o(oduction)4 1667 y Fo(Scalable)12 b(shared)g(memory)f (multiprocessors)g(\(SSMMs\))g(are)h(becoming)f(increasingly)h(popular)f(and) h(a)4 1721 y(viable)e(alternative)f(to)h(distributed)f(memory)g (multiprocessors)h(\(DMMs\).)17 b(The)11 b(Stanford)e(DASH)g([20],)4 1775 y(FLASH)i([14)o(],)h(the)f(KSR1)f([24],)h(T)m(oronto')m(s)f(Hector)h ([26)o(],)h(NUMAchine)f([1)o(],)h(and)f(the)f(Cray)h(T3D)h([23)o(])4 1830 y(are)d(some)g(SSMMs)g(currently)e(in)i(use)g(or)f(under)g(development.) 17 b(Processors)9 b(in)f(a)h(SSMM)g(share)g(a)g(single)4 1884 y(coherent)f(address)g(space.)17 b(However)n(,)9 b(shared)f(memory)g(is)g (physically)g(distributed)g(to)f(allow)h(scalability)l(,)4 1938 y(as)17 b(shown)f(in)g(Figure)f(1.)29 b(This)17 b(distribution)e(of)g (shared)i(memory)e(results)h(in)g(non-uniform)e(memory)4 1992 y(access)f(latencies,)g(depending)f(on)f(the)h(distance)h(between)f(a)g (processor)f(and)h(memory)m(.)17 b(Consequently)m(,)4 2046 y(careful)12 b(placement)g(and)g(management)g(of)g(data)h(is)g(essential)g (for)e(scaling)i(performance.)77 2122 y(W)l(e)i(believe)f(that)g(data)g (distribution)732 2104 y Fm(1)764 2122 y Fo(is)g(a)h(good)f(paradigm)f(for)h (managing)f(data)i(in)f(data-parallel)4 2176 y(applications)h(on)g(SSMMs)g ([3)o(,)h(21].)25 b(The)16 b(division)e(of)h(array)f(data)h(allows)g(a)g (compiler)f(to)h(place)g(data)4 2230 y(in)g(the)g(physical)f(memory)g(of)h (the)g(processor)f(that)h(uses)h(it)e(the)h(most,)h(and)f(also)g(allows)g (the)g(compiler)4 2284 y(to)k(partition)f(the)h(computations)g(of)f(parallel) h(loops.)38 b(W)l(e)19 b(have)g(experimented)g(with)f(programmer)4 2339 y(speci\256ed)12 b(data)f(distributions)g(on)g(the)h(Hector)f (multiprocessor)f(and)i(have)g(found)e(them)h(to)h(be)f(ef)o(fective)4 2393 y(in)e(improving)e(performance.)16 b(However)n(,)10 b(the)e(task)h(of)g (selecting)g(a)g(good)f(data)h(distribution)f(requires)g(the)4 2447 y(programmer)i(to)h(understand)g(both)f(the)i(parallel)e(machine)h (architecture)g(and)g(the)g(data)g(access)i(patterns)4 2501 y(in)19 b(the)f(program.)37 b(Porting)17 b(programs)h(to)h(various)g (machines)g(and)f(tuning)h(them)f(for)g(performance)4 2555 y(becomes)g(a)f(tedious)g(and)g(laborious)g(process.)33 b(Consequently)m(,)19 b(it)e(is)h(desirable)f(to)g(derive)f(data)i(and)4 2609 y(computation)h (partitions)g(automatically)h(using)g(a)g(compiler)m(.)40 b(The)21 b(objective)e(of)h(this)g(paper)g(is)g(to)4 2664 y(describe)13 b(the)f(factors)g(that)g(af)o(fect)g(the)g(derivation)g(of)g(computation)f (and)i(data)f(partitions)g(on)g(SSMMs.)77 2739 y(On)19 b(DMMs,)k(the)c(main)g (factor)f(that)i(af)o(fects)f(the)g(performance)f(of)h(an)g(application)g(is) g(the)g(cost)4 2793 y(of)d(interprocessor)f(communication.)28 b(Consequently)m(,)17 b(scalable)g(performance)e(can)h(be)g(achieved)g(by)p 4 2838 737 2 v 62 2869 a Fl(1)79 2884 y Fk(In)10 b(this)f(paper)i(we)g(use)f (the)g(terms)h(data)g(distributi)o(ons)c(and)k(data)f(partitions)f (interchangeably)m(.)p eop %%Page: 2 2 2 1 bop 175 533 a @beginspecial 114 @llx 408 @lly 476 @urx 553 @ury 3600 @rwi @setspecial %%BeginDocument: numaarch1.ps /arrowHeight 10 def /arrowWidth 5 def /IdrawDict 51 dict def IdrawDict begin /reencodeISO { dup dup findfont dup length dict begin { 1 index /FID ne { def }{ pop pop } ifelse } forall /Encoding ISOLatin1Encoding def currentdict end definefont } def /ISOLatin1Encoding [ /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /space/exclam/quotedbl/numbersign/dollar/percent/ampersand/quoteright /parenleft/parenright/asterisk/plus/comma/minus/period/slash /zero/one/two/three/four/five/six/seven/eight/nine/colon/semicolon /less/equal/greater/question/at/A/B/C/D/E/F/G/H/I/J/K/L/M/N /O/P/Q/R/S/T/U/V/W/X/Y/Z/bracketleft/backslash/bracketright /asciicircum/underscore/quoteleft/a/b/c/d/e/f/g/h/i/j/k/l/m /n/o/p/q/r/s/t/u/v/w/x/y/z/braceleft/bar/braceright/asciitilde /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/dotlessi/grave/acute/circumflex/tilde/macron/breve /dotaccent/dieresis/.notdef/ring/cedilla/.notdef/hungarumlaut /ogonek/caron/space/exclamdown/cent/sterling/currency/yen/brokenbar /section/dieresis/copyright/ordfeminine/guillemotleft/logicalnot /hyphen/registered/macron/degree/plusminus/twosuperior/threesuperior /acute/mu/paragraph/periodcentered/cedilla/onesuperior/ordmasculine /guillemotright/onequarter/onehalf/threequarters/questiondown /Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla /Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex /Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis /multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute /Thorn/germandbls/agrave/aacute/acircumflex/atilde/adieresis /aring/ae/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave /iacute/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex /otilde/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis /yacute/thorn/ydieresis ] def /Times-Roman reencodeISO def /none null def /numGraphicParameters 17 def /stringLimit 65535 def /Begin { save numGraphicParameters dict begin } def /End { end restore } def /SetB { dup type /nulltype eq { pop false /brushRightArrow idef false /brushLeftArrow idef true /brushNone idef } { /brushDashOffset idef /brushDashArray idef 0 ne /brushRightArrow idef 0 ne /brushLeftArrow idef /brushWidth idef false /brushNone idef } ifelse } def /SetCFg { /fgblue idef /fggreen idef /fgred idef } def /SetCBg { /bgblue idef /bggreen idef /bgred idef } def /SetF { /printSize idef /printFont idef } def /SetP { dup type /nulltype eq { pop true /patternNone idef } { dup -1 eq { /patternGrayLevel idef /patternString idef } { /patternGrayLevel idef } ifelse false /patternNone idef } ifelse } def /BSpl { 0 begin storexyn newpath n 1 gt { 0 0 0 0 0 0 1 1 true subspline n 2 gt { 0 0 0 0 1 1 2 2 false subspline 1 1 n 3 sub { /i exch def i 1 sub dup i dup i 1 add dup i 2 add dup false subspline } for n 3 sub dup n 2 sub dup n 1 sub dup 2 copy false subspline } if n 2 sub dup n 1 sub dup 2 copy 2 copy false subspline patternNone not brushLeftArrow not brushRightArrow not and and { ifill } if brushNone not { istroke } if 0 0 1 1 leftarrow n 2 sub dup n 1 sub dup rightarrow } if end } dup 0 4 dict put def /Circ { newpath 0 360 arc patternNone not { ifill } if brushNone not { istroke } if } def /CBSpl { 0 begin dup 2 gt { storexyn newpath n 1 sub dup 0 0 1 1 2 2 true subspline 1 1 n 3 sub { /i exch def i 1 sub dup i dup i 1 add dup i 2 add dup false subspline } for n 3 sub dup n 2 sub dup n 1 sub dup 0 0 false subspline n 2 sub dup n 1 sub dup 0 0 1 1 false subspline patternNone not { ifill } if brushNone not { istroke } if } { Poly } ifelse end } dup 0 4 dict put def /Elli { 0 begin newpath 4 2 roll translate scale 0 0 1 0 360 arc patternNone not { ifill } if brushNone not { istroke } if end } dup 0 1 dict put def /Line { 0 begin 2 storexyn newpath x 0 get y 0 get moveto x 1 get y 1 get lineto brushNone not { istroke } if 0 0 1 1 leftarrow 0 0 1 1 rightarrow end } dup 0 4 dict put def /MLine { 0 begin storexyn newpath n 1 gt { x 0 get y 0 get moveto 1 1 n 1 sub { /i exch def x i get y i get lineto } for patternNone not brushLeftArrow not brushRightArrow not and and { ifill } if brushNone not { istroke } if 0 0 1 1 leftarrow n 2 sub dup n 1 sub dup rightarrow } if end } dup 0 4 dict put def /Poly { 3 1 roll newpath moveto -1 add { lineto } repeat closepath patternNone not { ifill } if brushNone not { istroke } if } def /Rect { 0 begin /t exch def /r exch def /b exch def /l exch def newpath l b moveto l t lineto r t lineto r b lineto closepath patternNone not { ifill } if brushNone not { istroke } if end } dup 0 4 dict put def /Text { ishow } def /idef { dup where { pop pop pop } { exch def } ifelse } def /ifill { 0 begin gsave patternGrayLevel -1 ne { fgred bgred fgred sub patternGrayLevel mul add fggreen bggreen fggreen sub patternGrayLevel mul add fgblue bgblue fgblue sub patternGrayLevel mul add setrgbcolor eofill } { eoclip originalCTM setmatrix pathbbox /t exch def /r exch def /b exch def /l exch def /w r l sub ceiling cvi def /h t b sub ceiling cvi def /imageByteWidth w 8 div ceiling cvi def /imageHeight h def bgred bggreen bgblue setrgbcolor eofill fgred fggreen fgblue setrgbcolor w 0 gt h 0 gt and { l b translate w h scale w h true [w 0 0 h neg 0 h] { patternproc } imagemask } if } ifelse grestore end } dup 0 8 dict put def /istroke { gsave brushDashOffset -1 eq { [] 0 setdash 1 setgray } { brushDashArray brushDashOffset setdash fgred fggreen fgblue setrgbcolor } ifelse brushWidth setlinewidth originalCTM setmatrix stroke grestore } def /ishow { 0 begin gsave fgred fggreen fgblue setrgbcolor /fontDict printFont printSize scalefont dup setfont def /descender fontDict begin 0 [FontBBox] 1 get FontMatrix end transform exch pop def /vertoffset 1 printSize sub descender sub def { 0 vertoffset moveto show /vertoffset vertoffset printSize sub def } forall grestore end } dup 0 3 dict put def /patternproc { 0 begin /patternByteLength patternString length def /patternHeight patternByteLength 8 mul sqrt cvi def /patternWidth patternHeight def /patternByteWidth patternWidth 8 idiv def /imageByteMaxLength imageByteWidth imageHeight mul stringLimit patternByteWidth sub min def /imageMaxHeight imageByteMaxLength imageByteWidth idiv patternHeight idiv patternHeight mul patternHeight max def /imageHeight imageHeight imageMaxHeight sub store /imageString imageByteWidth imageMaxHeight mul patternByteWidth add string def 0 1 imageMaxHeight 1 sub { /y exch def /patternRow y patternByteWidth mul patternByteLength mod def /patternRowString patternString patternRow patternByteWidth getinterval def /imageRow y imageByteWidth mul def 0 patternByteWidth imageByteWidth 1 sub { /x exch def imageString imageRow x add patternRowString putinterval } for } for imageString end } dup 0 12 dict put def /min { dup 3 2 roll dup 4 3 roll lt { exch } if pop } def /max { dup 3 2 roll dup 4 3 roll gt { exch } if pop } def /midpoint { 0 begin /y1 exch def /x1 exch def /y0 exch def /x0 exch def x0 x1 add 2 div y0 y1 add 2 div end } dup 0 4 dict put def /thirdpoint { 0 begin /y1 exch def /x1 exch def /y0 exch def /x0 exch def x0 2 mul x1 add 3 div y0 2 mul y1 add 3 div end } dup 0 4 dict put def /subspline { 0 begin /movetoNeeded exch def y exch get /y3 exch def x exch get /x3 exch def y exch get /y2 exch def x exch get /x2 exch def y exch get /y1 exch def x exch get /x1 exch def y exch get /y0 exch def x exch get /x0 exch def x1 y1 x2 y2 thirdpoint /p1y exch def /p1x exch def x2 y2 x1 y1 thirdpoint /p2y exch def /p2x exch def x1 y1 x0 y0 thirdpoint p1x p1y midpoint /p0y exch def /p0x exch def x2 y2 x3 y3 thirdpoint p2x p2y midpoint /p3y exch def /p3x exch def movetoNeeded { p0x p0y moveto } if p1x p1y p2x p2y p3x p3y curveto end } dup 0 17 dict put def /storexyn { /n exch def /y n array def /x n array def n 1 sub -1 0 { /i exch def y i 3 2 roll put x i 3 2 roll put } for } def /SSten { fgred fggreen fgblue setrgbcolor dup true exch 1 0 0 -1 0 6 -1 roll matrix astore } def /FSten { dup 3 -1 roll dup 4 1 roll exch newpath 0 0 moveto dup 0 exch lineto exch dup 3 1 roll exch lineto 0 lineto closepath bgred bggreen bgblue setrgbcolor eofill SSten } def /Rast { exch dup 3 1 roll 1 0 0 -1 0 6 -1 roll matrix astore } def /arrowhead { 0 begin transform originalCTM itransform /taily exch def /tailx exch def transform originalCTM itransform /tipy exch def /tipx exch def /dy tipy taily sub def /dx tipx tailx sub def /angle dx 0 ne dy 0 ne or { dy dx atan } { 90 } ifelse def gsave originalCTM setmatrix tipx tipy translate angle rotate newpath arrowHeight neg arrowWidth 2 div moveto 0 0 lineto arrowHeight neg arrowWidth 2 div neg lineto patternNone not { originalCTM setmatrix /padtip arrowHeight 2 exp 0.25 arrowWidth 2 exp mul add sqrt brushWidth mul arrowWidth div def /padtail brushWidth 2 div def tipx tipy translate angle rotate padtip 0 translate arrowHeight padtip add padtail add arrowHeight div dup scale arrowheadpath ifill } if brushNone not { originalCTM setmatrix tipx tipy translate angle rotate arrowheadpath istroke } if grestore end } dup 0 9 dict put def /arrowheadpath { newpath arrowHeight neg arrowWidth 2 div moveto 0 0 lineto arrowHeight neg arrowWidth 2 div neg lineto } def /leftarrow { 0 begin y exch get /taily exch def x exch get /tailx exch def y exch get /tipy exch def x exch get /tipx exch def brushLeftArrow { tipx tipy tailx taily arrowhead } if end } dup 0 4 dict put def /rightarrow { 0 begin y exch get /tipy exch def x exch get /tipx exch def y exch get /taily exch def x exch get /tailx exch def brushRightArrow { tipx tipy tailx taily arrowhead } if end } dup 0 4 dict put def Begin [ 0.799705 0 0 0.799705 0 0 ] concat /originalCTM matrix currentmatrix def Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 433.125 504.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 433.125 552.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 265.125 504.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 265.125 552.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 137.125 504.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 137.125 552.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 321.125 600.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 489.125 600.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 193.125 600.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 144 410 ] concat 453 529 448 32 Elli End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 486.625 598.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 505.272 606.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 318.625 598.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 337.272 606.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 190.625 598.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 209.272 606.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 134.625 502.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 153.272 510.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.125 -0 -0 0.125 262.625 502.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.125 -0 -0 0.125 430.625 502.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 362.875 611.625 ] concat 617 99 16 16 Elli End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 378.875 611.625 ] concat 617 99 16 16 Elli End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 133.5 439.5 ] concat 117 369 181 369 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 261.5 439.5 ] concat 117 369 181 369 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 429.5 439.5 ] concat 117 369 181 369 Line End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 179.5 532 ] concat [ (Procr) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 307.5 532 ] concat [ (Procr) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 475.5 532 ] concat [ (Procr) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 235.5 628 ] concat [ (Mem) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 363.5 628 ] concat [ (Mem) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 531.5 628 ] concat [ (Mem) ] Text End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 346.875 611.625 ] concat 617 99 16 16 Elli End Begin %I BSpl 1 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 19.5 393.5 ] concat 457 333 473 381 441 365 457 413 4 BSpl End Begin %I BSpl 1 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 147.5 393.5 ] concat 457 333 473 381 441 365 457 413 4 BSpl End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 360 555 ] concat [ (Remote) (memory) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 232 555 ] concat [ (Local) (memory) ] Text End Begin %I BSpl 1 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 315.5 393.5 ] concat 457 333 473 381 441 365 457 413 4 BSpl End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 528 555 ] concat [ (Remote) (memory) ] Text End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 134.625 550.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 153.272 558.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.125 -0 -0 0.125 262.625 550.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.125 -0 -0 0.125 430.625 550.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 177 580 ] concat [ (Cache) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 305 580 ] concat [ (Cache) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 473 580 ] concat [ (Cache) ] Text End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 60 364 ] concat 132 180 132 196 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 188 364 ] concat 132 180 132 196 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 356 364 ] concat 132 180 132 196 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 49 237 ] concat 143 355 143 435 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 49 237 ] concat 439 355 439 427 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 49 237 ] concat 271 355 271 427 Line End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.5 -0 -0 0.5 141.5 407.5 ] concat 453 529 448 32 Elli End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 306.5 676 ] concat [ (Interconnection Network) ] Text End End %I eop showpage end %%EndDocument @endspecial 295 598 a Fo(Figure)11 b(1:)18 b(Scalable)12 b(shared-memory)f (multiprocessor)h(architecture.)4 719 y(partitioning)g(data)i(and)g (computations)f(in)g(a)h(way)f(that)h(minimizes)f(interprocessor)g (communications.)4 773 y(On)f(SSMMs,)h(processors)f(communicate)f(through)g (shared)g(memory)m(,)h(and)f(the)h(cost)g(of)f(interprocessor)4 827 y(communications)h(\(i.e.,)i(remote)e(memory)f(access\))j(is)f (relatively)e(inexpensive.)19 b(W)l(e)13 b(show)g(that)f(cache)4 881 y(af)o(\256nity)m(,)i(memory)f(contention)h(and)g(false)g(sharing)g(are)g (additional)g(factors)g(that)f(must)i(be)f(considered)4 935 y(in)i(the)g(selection)g(of)f(data)h(distributions.)28 b(Furthermore,)16 b(the)g(presence)g(of)f(a)h(single)g(shared)g(address)4 989 y(space)i(allows)g(\257exibility)f(in)g(the)h(selection)g(of)f(a)h (computation)e(partition.)33 b(Speci\256cally)m(,)19 b(we)f(show)4 1044 y(that)h(relaxing)g(the)h(commonly)e(used)i(owner)o(-computes)f(rule)g ([15)o(])h(has)g(performance)e(advantages.)4 1098 y(W)l(e)d(present)g (experimental)f(results)i(to)e(support)h(our)f(conclusions)i(using)f(three)f (applications)h(on)g(two)4 1152 y(SSMMs,)e(the)g(Hector)f(and)g(the)g(KSR1)h (multiprocessors.)77 1228 y(The)g(remainder)f(of)g(this)g(paper)g(is)h(or)o (ganized)f(as)h(follows.)18 b(Section)12 b(2)h(presents)f(an)h(overview)f (data)4 1282 y(distributions.)35 b(Section)18 b(3)g(describes)h(the)f (factors)f(that)h(impact)g(on)g(the)h(selection)f(of)g(computation)4 1336 y(and)i(data)g(partitions)f(on)g(SSMMs.)41 b(Section)19 b(4)h(gives)g(experimental)f(evidence)h(of)f(the)h(impact)f(of)4 1390 y(cache)e(af)o(\256nity)e(and)h(false)g(sharing)g(on)g(the)g(choice)h (of)e(data)h(partitions.)29 b(Section)16 b(5)g(presents)h(results)4 1444 y(to)d(show)g(that)g(the)g(\257exibility)f(in)h(selecting)h(the)f (computation)f(partitioning)g(can)h(be)g(used)h(to)f(improve)4 1498 y(performance.)j(Section)9 b(6)i(reviews)f(related)g(work.)17 b(Finally)m(,)11 b(Section)e(7)i(presents)f(concluding)g(remarks)4 1553 y(and)j(directions)e(for)h(future)f(work.)4 1734 y Fn(2)71 b(Data)19 b(Distributions)4 1861 y Fo(Data)10 b(distribution)f([15)o(,)i(16]) e(is)i(achieved)f(by)g(specifying)f(a)h(partitioning)f(scheme)h(for)f(each)i (array)e(in)h(the)4 1915 y(program)h(and)i(by)f(specifying)g(a)g(processor)h (geometry)e(to)i(which)f(array)g(partitions)f(map.)18 b(A)13 b(processor)4 1970 y(geometry)g(is)i(an)f Fj(n)p Fo(-dimensional)f(Cartesian) h(grid)f(of)h(virtual)f(processors)h Fi(\()p Fj(V)1385 1977 y Fm(0)1404 1970 y Fj(;)8 b(V)1454 1977 y Fm(1)1473 1970 y Fj(;)g Fh(\001)g(\001)g(\001)g Fj(;)g(V)1611 1977 y Fg(n)p Ff(\000)p Fm(1)1679 1970 y Fi(\))p Fo(,)15 b(where)4 2024 y Fj(V)32 2031 y Fg(i)63 2024 y Fo(is)i(the)f(number)g(of)g(processors)h(in)f (the)g Fj(i)793 2006 y Fg(th)844 2024 y Fo(dimension)g(of)g(the)h(grid,)g (and)f Fj(V)1430 2031 y Fm(0)1463 2024 y Fh(\002)d Fj(V)1543 2031 y Fm(1)1575 2024 y Fh(\002)g(\001)8 b(\001)g(\001)14 b(\002)f Fj(V)1779 2031 y Fg(n)p Ff(\000)p Fm(1)4 2078 y Fo(=)i Fj(P)7 b Fo(,)16 b(the)f(total)f(number)h(of)f(processors.)26 b(A)15 b(partitioning)e(scheme)i(assigns)h(a)f Fe(partitioning)f(attribute)4 2132 y Fo(to)k(each)g(dimension)g(the)g(array)m(.)34 b(There)18 b(are)g(four)f(partitioning)f(attributes.)35 b(The)18 b Fd(Block)g Fo(attribute)4 2186 y(divides)f(the)g(corresponding)g(dimension)g(of)f(the)h (array)g(in)g(approximately)f(equal)h(size)h(blocks)f(such)4 2240 y(that)j(a)g(processor)g(owns)h(a)f(contiguous)g(range)g(of)f(that)h (dimension)g(of)g(the)g(array)m(.)41 b(The)20 b Fd(Cyclic)4 2295 y Fo(attribute)11 b(divides)h(the)h(corresponding)e(array)g(dimension)h (by)g(distributing)f(the)h(array)f(elements)i(in)f(this)4 2349 y(dimension)g(to)g(processors)h(in)f(a)h(round-robin)d(fashion.)18 b(The)13 b Fd(BlockCyclic)f Fo(attribute)f(\256rst)h(groups)4 2403 y(array)f(elements)h(in)f(the)g(corresponding)g(dimension)g(in)g (contiguous)g(blocks)h(of)f(a)h(given)f(size,)h(and)g(then)4 2457 y(assigns)k(the)f(blocks)f(to)h(processors)g(in)g(a)g(round-robin)d (fashion.)26 b(The)15 b(block)f(size,)j(called)d(the)h Fe(block-)4 2511 y(cyclic)10 b(factor)p Fo(,)h(is)e(supplied)h(by)f(the)h(programmer)m(.) 16 b(Finally)m(,)9 b(the)h Fd(*)f Fo(attribute)g(is)h(used)g(to)f(indicate)g (that)h(the)4 2565 y(corresponding)f(dimension)g(of)g(the)h(array)f(is)h(not) f(distributed.)17 b(The)10 b(processor)g(geometry)f(on)g(which)h(the)4 2620 y(array)h(is)i(mapped)e(determines)h(the)g(number)f(of)g(processors)i (assigned)f(to)g(each)g(distributed)f(dimension)4 2674 y(of)h(the)f(array)m (.)18 b(For)11 b(example,)h(distributing)e(a)i(two)g(dimensional)f(array)h (using)f(the)h Fd(\(Block,Block\))4 2728 y Fo(attributes)g(onto)h(a)g(two)f (dimensional)h(processor)f(geometry)g(of)h(\(2,4\),)f(distributes)h(the)f (array)h(on)f(to)h(the)4 2782 y(8)k(processors,)i(assigning)f(2)f(processors) g(to)g(the)g(\256rst)g(dimension)g(and)g(4)g(processors)g(to)g(the)g(second)4 2836 y(dimension.)p eop %%Page: 3 3 3 2 bop 4 -21 a Fn(3)71 b(Performance)21 b(Factors)4 106 y Fo(The)15 b(main)g(factor)f(that)g(af)o(fects)h(the)f(performance)g(of)g(a)h (parallel)f(application)g(on)h(a)g(DMM)g(is)g(the)g(rel-)4 160 y(atively)i(high)f(cost)h(of)g(interprocessor)f(communication.)30 b(For)17 b(example,)h(the)f(latency)f(for)g(a)h(remote)4 215 y(memory)e(access)i(on)e(the)h(CM5)g(multiprocessor)f(is)h(approximately)e (2560)h(processor)h(cycles)1699 196 y Fm(2)1718 215 y Fo(.)28 b(This)4 269 y(necessitates)16 b(the)e(selection)h(of)f(computation)f(and)h (data)h(partitions)f(that)g(minimize)f(the)i(cost)f(of)g(com-)4 323 y(munication.)27 b(In)15 b(contrast,)h(on)f(SSMMs,)i(processors)f (communicate)e(through)h(shared)g(memory)g(and)4 377 y(the)j(cost)h(of)f (remote)f(memory)h(access)h(is)g(relatively)e(small.)36 b(For)17 b(example,)j(the)f(cost)f(of)g(a)g(remote)4 431 y(read)11 b(operation)g(on)g (the)h(KSR1)f(is)h(approximately)e(170)h(processor)g(cycles)h([24].)17 b(Consequently)m(,)12 b(other)4 485 y(factors)h(come)f(into)h(play)g(in)f (the)h(selection)g(of)f(computation)g(and)h(data)g(partitions.)19 b(In)13 b(this)g(section)g(we)4 540 y(elaborate)h(on)g(these)g(factors)g(and) g(on)g(how)g(they)g(af)o(fect)g(performance,)f(and)h(consequently)m(,)h(af)o (fect)f(the)4 594 y(choice)f(of)f(data)g(and)g(computation)g(partitions.)4 755 y Fc(3.1)58 b(Cache)14 b(Af\256nity)4 853 y Fo(Caches)j(are)e(used)h(in)f (SSMMs)h(to)g(reduce)f(ef)o(fective)g(memory)f(access)j(time)e(and)h(reduce)f (contention)4 907 y(in)e(the)h(interconnection)e(network.)21 b(Data)14 b(is)g(transferred)e(between)i(cache)g(and)f(memory)g(in)g(units)g (of)h(a)4 961 y Fe(cache)g(line)p Fo(,)h(typically)e(a)h(multiple)f(of)g(the) h(processor)g(word)f(size.)24 b Fe(Spatial)13 b(r)n(euse)i Fo(occurs)e(when)h(other)4 1015 y(words)h(on)g(the)g(same)g(line)g(are)g (used)g(by)g(the)g(processor)g(before)f(the)h(line)g(is)g(\257ushed)g(from)f (the)h(cache.)4 1070 y(Analogously)m(,)g Fe(temporal)f(r)n(euse)i Fo(occurs)e(when)g(data)h(on)f(a)g(cache)h(line)f(is)h(used)g(again)f(before) g(the)g(line)4 1124 y(is)i(evicted)g(from)e(the)i(cache.)29 b(The)16 b(performance)e(of)i(an)f(application)h(depends)g(to)f(a)h(lar)o(ge) f(extent)h(on)4 1178 y(the)g(ability)g(of)g(the)g(caches)h(to)f(exploit)g (spatial)h(and)f(temporal)f(reuse.)31 b(In)16 b(some)g(cases,)j(this)d(may)h (be)4 1232 y(dif)o(\256cult)9 b(because)i(of)f(the)g(limited)f(capacity)h (and)g(associativity)g(of)g(caches.)18 b(Data)10 b(brought)f(into)h(a)g (cache)4 1286 y(by)16 b(a)g(reference)f(or)h(a)g(prefetch)f(may)h(be)g (evicted)f(before)h(being)f(used)h(or)g(reused,)h(because)g(of)e(either)4 1340 y(a)i(capacity)g(or)g(a)g(con\257ict)f(miss)i(caused)f(by)g(a)g (subsequent)h(reference.)31 b(Cache)18 b(misses)f(on)g(SSMMs)4 1395 y(adversely)f(af)o(fect)g(performance,)g(since)h(evicted)f(data)g(must)g (be)g(retrieved)f(from)g(its)i(home)e(memory)m(,)4 1449 y(which)k(may)g(be)g (remote)f(to)h(the)f(processor)m(.)38 b(Caches)20 b(play)f(less)h(of)e(an)h (important)f(role)g(in)h(DMMs)4 1503 y(because)g(cache)f(misses)h(result)e (exclusively)h(in)f(local)h(memory)f(accesses,)k(which)d(are)g(inexpensive)4 1557 y(relative)12 b(to)g(interprocessor)g(communications.)4 1718 y Fc(3.2)58 b(False)14 b(Sharing)4 1816 y Fo(In)g(SSMMs)h(data)f(on)h (the)f(same)h(cache)g(line)f(may)g(be)h(shared)f(by)h(more)e(than)i(one)f (processor)n(,)h(and)g(the)4 1870 y(line)j(may)g(exit)g(in)g(more)g(than)g (one)g(processor)r(')m(s)g(cache)h(at)f(the)g(same)h(time.)35 b(Hardware)18 b(is)g(used)h(to)4 1925 y(maintain)13 b(the)f(consistency)i(of) e(the)h(multiple)g(copies)g(of)f(the)h(line,)h(typically)e(using)h(a)g (write-invalidate)4 1979 y(protocol)e([24,)h(14].)18 b Fe(T)m(rue)12 b(sharing)g Fo(occurs)g(when)g(two)g(or)f(more)g(processors)i(access)g(the)f (same)g(data)g(on)4 2033 y(a)k(cache)f(line,)i(and)e(it)g(re\257ects)g (necessary)h(data)f(communications)g(in)g(an)g(application.)27 b(On)15 b(the)g(other)4 2087 y(hand,)h Fe(false)e(sharing)h Fo(occurs)f(when)h(two)f(processors)h(access)h(dif)o(ferent)d(pieces)i(of)f (data)h(on)f(the)g(same)4 2141 y(cache)e(line.)18 b(If)11 b(processors)h (write)g(to)f(the)h(same)g(cache)g(line,)g(the)g(cache)g(consistency)h (hardware)e(causes)4 2195 y(the)j(cache)g(line)g(to)g(be)g(transferred)f (back)h(and)g(forth)f(between)h(processors)g(leading)g(to)g(a)g (\252ping-pong\272)4 2250 y(ef)o(fect)h([8)o(].)27 b(False)16 b(sharing)f(causes)h(extensive)g(invalidation)e(traf)o(\256c)g(and)i(can)f (considerably)g(degrade)4 2304 y(performance.)i(False)c(sharing)f(is)h (non-existent)e(on)i(DMMs.)4 2465 y Fc(3.3)58 b(Memory)14 b(Contention)4 2563 y Fo(Memory)i(contention)g(occurs)g(when)g(many)g(processors)h(access)h (data)e(in)g(a)g(single)h(memory)e(module)4 2617 y(at)j(the)g(same)h(time.)35 b(Since)18 b(the)g(communication)f(protocol)g(in)h(SSMMs)g(is)g(receiver)o (-initiated,)h(and)4 2671 y(transfers)i(data)g(in)f(units)h(of)g(relatively)f (small)h(cache)g(lines,)j(a)d(lar)o(ge)g(number)f(of)h(requests)g(to)g(the)4 2725 y(same)12 b(memory)f(can)h(over\257ow)f(memory)g(buf)o(fers)g(and)h (cause)g(excessive)h(delays)f(in)g(memory)e(response)4 2780 y(time)20 b([13].)42 b(Contention)20 b(has)h(been)g(considered)g(less)g(of)f (a)h(performance)e(bottleneck)h(on)h(DMMs)p 4 2825 737 2 v 62 2855 a Fl(2)79 2870 y Fk(Calculated)10 b(based)h(on)f(the)g(elapsed)h (time)f(for)g(a)g(send-reply)g(message)i(of)e(128)g(bytes)g([19)o(].)p eop %%Page: 4 4 4 3 bop 4 -27 a Fo(because)16 b(a)g(sender)o(-initiated)e(communication)h (protocol)f(is)i(employed,)g(and)g(because)g(programmers)4 27 y(typically)f(communicate)f(data)i(in)f(lar)o(ge)g(infrequent)f(messages.) 28 b(Applications)15 b(on)g(DMMs)h(also)f(use)4 82 y(collective)d (communications)g([15)o(])g(that)h(further)e(reduce)h(contention.)4 243 y Fc(3.4)58 b(Over)o(head)14 b(of)g(Parallelism)4 341 y Fo(In)g(DMM,)i(synchronization)e(is)h(achieved)f(through)g(data)g (communication.)24 b(However)n(,)15 b(on)g(SSMMs,)4 395 y(synchronization)9 b(is)h(explicit)e(and)i(is)g(independent)f(of)f(data)i(communication.)16 b(The)10 b(resulting)f(overhead)4 449 y(can)14 b(become)f(a)h(performance)e (bottleneck)h([27)o(],)h(and)f(must)h(be)f(minimized.)21 b(The)14 b(performance)e(of)h(an)4 503 y(application)e(is)h(also)h(af)o(fected)e(by)h (the)f(overhead)h(involved)f(in)h(parallelizing)e(loops,)j(manifested)e(in)h (the)4 557 y(form)h(of)h(computation)f(partitioning)f(tests)j([25)o(].)23 b(These)15 b(tests)g(can)f(be)g(eliminated)g(in)f(some)i(cases)g(by)4 612 y(compiler)g(analysis,)i(but)d(when)i(not)f(possible,)h(can)g(degrade)f (performance.)26 b(This)15 b(overhead)g(though)4 666 y(also)d(present)g(in)f (the)h(case)g(of)f(DMMs,)j(is)e(not)f(considered)h(signi\256cant)f(because)h (of)g(the)f(predominantly)4 720 y(high)h(cost)h(of)f(remote)g(memory)f (access.)4 902 y Fn(4)71 b(Impact)19 b(on)f(Data)h(Distribution)4 1029 y Fo(In)e(this)g(section)g(we)g(use)g(two)g(applications,)h Fd(Multigrid)e Fo(and)h Fd(Tred2)p Fo(,)h(to)f(illustrate)f(the)h(impact)4 1083 y(of)f(cache)h(af)o(\256nity)f(and)h(false)g(sharing)f(on)h(the)f (choice)h(of)f(a)h(data)g(distribution.)30 b(The)17 b(KSR1)f(system)4 1137 y(is)f(used)f(because)h(of)f(its)h(lar)o(ge)f(cache)g(size,)i(and)e (because)h(of)f(the)g(presence)h(of)f(monitoring)e(hardware)4 1191 y(that)i(enables)h(the)g(measurement)f(of)g(the)g(number)g(of)f (non-local)h(memory)g(accesses)i(and)e(the)g(number)4 1245 y(of)e(caches)h(misses)h(for)d(a)i(processor)m(.)77 1321 y(The)j(KSR1)e(is)h (a)g(Cache)g(only)g(Memory)f(Architecture)g(\(COMA\))g(con\256gured)g(as)h(a) g(hierarchy)f(of)4 1375 y(slotted)c(rings)g(with)g(processing)g(cells)h(on)f (the)g(leaf-level)f(rings.)18 b(The)10 b(local)g(portion)g(of)f(shared)i (memory)4 1429 y(associated)g(with)e(a)i(processor)e(is)i(or)o(ganized)e(as)i (a)f(cache.)18 b(There)10 b(is)g(no)g(home)g(location)f(for)g(data,)i(rather) n(,)4 1483 y(data)k(may)g(exist)f(in)h(more)f(than)h(one)f(local)h(memory)m (.)24 b(The)16 b(hardware)e(maintains)g(the)h(consistency)g(of)4 1538 y(possible)e(multiple)e(copies)i(of)f(the)g(data.)77 1613 y(The)e(KSR1)g(implicitly)e(implements)i(the)f(owner)o(-computes)g(rule,)h (since)g(data)g(written)f(by)g(a)h(proces-)4 1667 y(sor)j(must)f(exclusively) g(reside)h(in)f(the)h(processor)r(')m(s)f(local)g(portion)g(of)g(the)g (shared)h(memory)m(.)k(Hardware)4 1722 y(automatically)j(migrates)g(data)h (to)g(the)f(processor)h(that)f(requests)h(the)g(data)f(in)h(units)g(of)f Fe(subpages)p Fo(.)4 1776 y(Hence,)13 b(the)f(computation)g(partitioning)e (of)i(a)g(loop)g(dictates)h(the)f(residence)g(of)g(a)g(data)h(item)e(and)i (hence)4 1830 y(the)k(distribution)f(of)h(the)g(arrays)g(in)g(the)g(loop.)33 b(Data)17 b(which)g(is)h(read)f(by)g(the)g(processors)h(may)f(exist)4 1884 y(in)e(multiple)e(local)i(memories,)g(and)g(read)f(requests)h(to)g(this) g(data)f(from)g(dif)o(ferent)f(processors)i(may)g(be)4 1938 y(satis\256ed)e(from)e(dif)o(ferent)g(portions)h(of)g(the)g(shared)h(memory)m (.)4 2099 y Fc(4.1)58 b(Cache-Conscious)13 b(Data)i(Distribution)4 2197 y Fo(The)j Fd(Multigrid)e Fo(application)g(from)g(the)h(NAS)f(suite)h (of)g(benchmarks)f(illustrates)h(how)g(data)g(dis-)4 2252 y(tributions)d (must)h(be)g(cache-conscious.)27 b Fd(Multigrid)14 b Fo(is)h(a)g(three)g (dimensional)f(solver)h(calculating)4 2306 y(the)j(potential)f(\256eld)h(on)f (a)h(cubical)g(grid.)34 b(W)l(e)18 b(focus)g(on)f(the)h(subroutine)f Fd(psinv)h Fo(which)f(uses)i(two)4 2360 y(3-dimensional)13 b(arrays)h Fj(U)20 b Fo(and)14 b Fj(R)p Fo(.)25 b(The)14 b(subroutine)g (mainly)f(performs)h(the)g(following)f(computation)4 2414 y(inside)i(a)h (triply)e(nested)h(loop:)23 b Fj(U)5 b Fi(\()p Fj(i;)23 b(j;)h(k)r Fi(\))15 b(+)j(=)33 b Fj(\013)p Fi(\()15 b Fj(R)p Fi(\()p Fj(f)5 b Fi(\()p Fj(i)p Fi(\))p Fj(;)24 b(g)r Fi(\()p Fj(j)s Fi(\))p Fj(;)f(h)p Fi(\()p Fj(k)r Fi(\)\)\))p Fo(,)16 b(where)f Fj(f)5 b Fi(\()p Fj(i)p Fi(\))15 b Fo(=)h Fj(i)c Fh(\000)g Fo(1,)4 2468 y Fj(i)18 b Fo(or)g Fj(i)13 b Fi(+)i Fo(1,)20 b(as)e(are)g(the)g (functions)g Fj(g)i Fo(and)e Fj(h)p Fo(.)36 b(The)18 b(loop)g(nest)g(is)h (fully)e(parallel.)35 b(The)18 b(application)4 2522 y(has)e(nearest)g (neighbor)e(communications)h(along)g(all)g(three)g(dimensions,)i(which)e(is)h (typical)f(of)g(many)4 2577 y(scienti\256c)d(applications.)77 2652 y(In)d(this)g(application,)g(we)g(choose)h(not)e(to)h(parallelize)f(the) h(innermost)g(loop)f(to)h(avoid)g(cache)g(line)g(false)4 2706 y(sharing)k(and)g(cache)h(interference;)e(successive)j(iterations)e(of)f (this)i(loop)f(access)h(successive)h(elements)4 2761 y(on)h(the)f(same)i (cache)f(line.)28 b(Hence)16 b(we)g(use)g(a)g(two)g(dimensional)f(grid)g(for) g(the)h(processor)f(geometry)m(.)4 2815 y(Since)10 b(the)g(application)g(has) h(nearest)f(neighbor)f(communications,)i Fd(Block)f Fo(distribution)f (performs)g(the)4 2869 y(best.)18 b(The)10 b(restriction)e(of)h(the)h (innermost)f(loop)g(to)g(be)h(sequential)f(requires)g(the)g(arrays)h(to)f(be) h(distributed)p eop %%Page: 5 5 5 4 bop 503 532 a @beginspecial 50 @llx 50 @lly 410 @urx 302 @ury 2057 @rwi @setspecial %%BeginDocument: mg.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.100 0.100 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 473 M 2817 0 V LTb 600 473 M 63 0 V 2754 0 R -63 0 V 540 473 M (96) Rshow LTa 600 916 M 2817 0 V LTb 600 916 M 63 0 V 2754 0 R -63 0 V 540 916 M (98) Rshow LTa 600 1360 M 2817 0 V LTb 600 1360 M 63 0 V 2754 0 R -63 0 V -2814 0 R (100) Rshow LTa 600 1804 M 2817 0 V LTb 600 1804 M 63 0 V 2754 0 R -63 0 V -2814 0 R (102) Rshow LTa 600 2247 M 2817 0 V LTb 600 2247 M 63 0 V 2754 0 R -63 0 V -2814 0 R (104) Rshow LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (\(16,1\)) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(8,2\)) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(4,4\)) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(2,8\)) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(1,16\)) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 340 1360 M currentpoint gsave translate 90 rotate 0 0 M (Normalized Execution Time \(w.r.t \(16,1\)\)) Cshow grestore 2008 51 M (Processor Geometry - 16 Processors ) Cshow LT1 1654 2106 M (160x160x160) Rshow 1714 2106 M 180 0 V 1523 341 R 2713 2048 L 2009 1826 L 1304 939 L 600 1360 L 1774 2106 A 3417 2447 A 2713 2048 A 2009 1826 A 1304 939 A 600 1360 A LT2 1654 2006 M (144x144x144) Rshow 1714 2006 M 180 0 V 1523 241 R 2713 1715 L -705 -67 V 1304 850 L 600 1360 L 1774 2006 B 3417 2247 B 2713 1715 B 2009 1648 B 1304 850 B 600 1360 B LT4 1654 1906 M (64x64x64) Rshow 1714 1906 M 180 0 V 1523 -590 R 2713 1160 L 2009 340 L 1304 495 L 600 1360 L 1774 1906 T 3417 1316 T 2713 1160 T 2009 340 T 1304 495 T 600 1360 T stroke grestore end showpage %%EndDocument @endspecial 377 612 a Fo(Figure)11 b(2:)18 b(Normalized)11 b(Execution)i(time)f(of)g Fd(Multigrid)p Fo(.)4 733 y(with)17 b Fd(\(*,Block,Block\))f Fo(since)h(the)g(arrays)g(are)g(assumed)h(to)f(be)h (stored)f(using)g(column)f(major)4 787 y(ordering.)31 b(W)n(ith)16 b(16)h(processors,)h(it)f(is)g(possible)g(to)g(choose)g(one)g(of)f(the)h (\(16,1\),)h(\(8,2\),)f(\(4,4\),)h(\(2,8\))4 841 y(and)d(\(1,16\))g (processor)g(geometries.)27 b(The)15 b(choice)h(of)e(the)i(processor)f (geometry)f(af)o(fects)h(the)g(number)4 895 y(of)h(processors)g(that)g (execute)g(each)g(parallel)f(loop.)29 b(For)15 b(example,)i(a)f(processor)g (geometry)f(of)h(\(8,2\),)4 949 y(implies)11 b(8)g(processors)h(assigned)g (to)e(the)i(inner)e(parallel)h(loop)g(and)g(2)g(processors)g(assigned)h(to)f (the)g(outer)4 1004 y(parallel)h(loop.)77 1079 y(Figure)17 b(2)g(shows)h(the)f(execution)g(time)g(of)g(the)h(application)e(for)h (various)g(processor)g(geometries)4 1133 y(with)e(the)g Fd(\(*,Block,Block\)) e Fo(distribution)h(for)g(the)h(arrays)g(on)g(the)g(KSR1)f(with)h(16)g (processors,)4 1188 y(normalized)d(with)h(respect)g(to)g(the)g(\(16,1\))f (processor)h(geometry)m(.)19 b(For)12 b(a)h(small)g(data)g(size)h (\(64x64x64\),)4 1242 y(execution)22 b(time)f(is)h(minimized)e(by)i(a)g (distribution)e(with)h(equal)h(number)f(of)g(processors)h(in)f(each)4 1296 y(dimension,)15 b(i.e.,)i(\(4,4\).)24 b(This)16 b(is)f(the)f(same)h (distribution)e(scheme)j(suggested)f(in)f(the)h(Syracuse)f(High)4 1350 y(Performance)9 b(Fortran)h(applications)g(suite)772 1332 y Fm(3)801 1350 y Fo(for)g(DMMs.)19 b(However)n(,)11 b(when)f(the)h(data)g (size)g(is)g(lar)o(ge,)g(the)4 1404 y(processor)h(geometry)f(\(4,4\))h(no)g (longer)f(performs)g(the)h(best.)19 b(The)12 b(execution)g(time)g(is)g (minimized)f(with)4 1458 y(a)i(processor)f(geometry)g(of)g(\(8,2\).)77 1534 y(The)20 b(impact)e(of)h(processor)g(geometry)f(on)g(performance)g(is)h (due)g(to)g(cache)g(af)o(\256nity)m(,)h(as)g(can)f(be)4 1588 y(deduced)12 b(from)f(Figures)h(3)g(and)g(4.)19 b(Figure)11 b(3)h(shows)h(the)f(measured)g(number)g(of)f(cache)i(lines)f(accessed)4 1642 y(from)17 b(remote)g(memory)f(modules,)j(normalized)e(with)g(respect)h (to)f(the)h(processor)f(geometry)g(\(16,1\).)4 1697 y(The)h(number)e(of)h (remote)f(memory)g(accesses)j(is)e(minimal)g(when)g(the)g(processor)g (geometry)f(is)h(\(4,4\))4 1751 y(for)h(all)g(data)h(sizes.)38 b(Figure)17 b(4)i(shows)g(the)g(average)f(measured)h(number)e(of)i(cache)g (misses)g(from)f(a)4 1805 y(processor)c(cache,)h(again)e(normalized)g(with)g (respect)h(to)g(the)f(processor)h(geometry)f(\(16,1\).)21 b(When)14 b(the)4 1859 y(data)e(size)h(is)f(small)g(\(64x64x64\),)g(the)g(data)g(used)g (by)g(a)h(processor)f(\256ts)g(into)f(the)h(256k)g(processor)g(cache)4 1913 y(and)19 b(the)g(misses)h(from)e(the)h(cache)h(in)f(this)g(case)h (re\257ect)f(remote)f(memory)g(accesses)j(that)e(occur)g(in)4 1967 y(the)13 b(parallel)g(program.)19 b(Hence,)14 b(the)f(predominant)f (factor)h(af)o(fecting)f(performance)g(is)h(interprocessor)4 2022 y(communication,)f(and)g(the)h(best)f(performance)g(is)g(attained)g (using)h(the)f(\(4,4\))g(geometry)m(.)77 2097 y(However)n(,)17 b(when)f(the)g(arrays)g(are)g(relatively)f(lar)o(ge)h(\(144x144x144\),)g(the) g(cache)g(capacity)h(is)f(no)4 2151 y(longer)g(suf)o(\256cient)h(to)g(hold)f (data)i(from)d(successive)k(iterations)d(of)h(the)g(outer)f(parallel)h(loop,) h(and)f(the)4 2206 y(number)10 b(of)h(cache)g(misses)h(increases.)19 b(When)11 b(the)g(number)f(of)h(processors)g(assigned)h(to)f(the)g(outer)f (loop)4 2260 y(increases,)j(the)f(number)f(of)h(misses)h(from)d(the)i(cache)h (also)f(increases.)19 b(The)12 b(\(4,4\))g(processor)g(geometry)4 2314 y(minimizes)d(the)f(amount)h(of)f(remote)g(memory)g(access,)k(but)c(the) h(\(16,1\))f(processor)h(geometry)f(minimizes)4 2368 y(the)k(amount)f(of)g (cache)h(misses.)19 b(The)12 b(distribution)e(with)i(\(8,2\))f(processor)g (geometry)g(strikes)h(a)g(balance)4 2422 y(between)17 b(the)g(cost)g(of)g (remote)f(memory)g(access)i(and)f(the)g(cost)g(of)g(cache)g(misses,)i (resulting)e(in)f(best)4 2476 y(overall)c(performance,)g(in)g(spite)g(of)g (higher)g(interprocessor)g(communication)f(cost.)4 2638 y Fc(4.2)58 b(False)14 b(Sharing)g(Conscious)g(Data)h(Distribution)4 2736 y Fo(The)d(programs)f Fd(Tred2)h Fo(\(which)f(is)h(part)f(of)g(Eispack\),)i Fd(mdg)p Fo(,)f(and)g Fd(trfd)f Fo(\(which)g(are)h(both)f(part)h(of)f(the)4 2790 y(Perfect)f(Club)h(Benchmark)f(Suite\))g(exhibit)h(parallelism)f(which)h (result)f(in)h(considerable)g(false)g(sharing.)p 4 2835 737 2 v 62 2865 a Fl(3)79 2880 y Fk(http://www)m(.npac.syr)n(.edu/hpfa/)c(.)p eop %%Page: 6 6 6 5 bop 47 586 a @beginspecial 50 @llx 50 @lly 230 @urx 176 @ury 2057 @rwi @setspecial %%BeginDocument: spmiss.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.050 0.050 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V 540 251 M (30) Rshow LTa 600 568 M 2817 0 V LTb 600 568 M 63 0 V 2754 0 R -63 0 V 540 568 M (40) Rshow LTa 600 885 M 2817 0 V LTb 600 885 M 63 0 V 2754 0 R -63 0 V 540 885 M (50) Rshow LTa 600 1202 M 2817 0 V LTb 600 1202 M 63 0 V 2754 0 R -63 0 V -2814 0 R (60) Rshow LTa 600 1518 M 2817 0 V LTb 600 1518 M 63 0 V 2754 0 R -63 0 V -2814 0 R (70) Rshow LTa 600 1835 M 2817 0 V LTb 600 1835 M 63 0 V 2754 0 R -63 0 V -2814 0 R (80) Rshow LTa 600 2152 M 2817 0 V LTb 600 2152 M 63 0 V 2754 0 R -63 0 V -2814 0 R (90) Rshow LTa 600 2469 M 2817 0 V LTb 600 2469 M 63 0 V 2754 0 R -63 0 V -2814 0 R (100) Rshow LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (\(16,1\)) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(8,2\)) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(4,4\)) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(2,8\)) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(1,16\)) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 340 1360 M currentpoint gsave translate 90 rotate 0 0 M (Normalized subpage misses \(w.r.t. \(16,1\)\)) Cshow grestore 2008 51 M (Processor Geometry - 16 Processors ) Cshow LT1 1654 2106 M (160x160x160) Rshow 1714 2106 M 180 0 V 600 2469 M 1304 1154 L 2009 495 L 705 646 V 3417 2453 L 1774 2106 A 600 2469 A 1304 1154 A 2009 495 A 2713 1141 A 3417 2453 A LT2 1654 2006 M (144x144x144) Rshow 1714 2006 M 180 0 V 600 2469 M 1304 1072 L 2009 473 L 705 567 V 3417 2387 L 1774 2006 B 600 2469 B 1304 1072 B 2009 473 B 2713 1040 B 3417 2387 B LT4 1654 1906 M (64x64x64) Rshow 1714 1906 M 180 0 V 600 2469 M 1304 1113 L 2009 543 L 705 649 V 3417 2444 L 1774 1906 T 600 2469 T 1304 1113 T 2009 543 T 2713 1192 T 3417 2444 T stroke grestore end showpage %%EndDocument @endspecial 124 640 a Fo(Figure)12 b(3.)18 b(Remote)12 b(Memory)g(Access.) 899 586 y @beginspecial 50 @llx 50 @lly 230 @urx 176 @ury 2057 @rwi @setspecial %%BeginDocument: datac.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.050 0.050 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V 540 251 M (90) Rshow LTa 600 528 M 2817 0 V LTb 600 528 M 63 0 V 2754 0 R -63 0 V 540 528 M (100) Rshow LTa 600 806 M 2817 0 V LTb 600 806 M 63 0 V 2754 0 R -63 0 V 540 806 M (110) Rshow LTa 600 1083 M 2817 0 V LTb 600 1083 M 63 0 V 2754 0 R -63 0 V -2814 0 R (120) Rshow LTa 600 1360 M 2817 0 V LTb 600 1360 M 63 0 V 2754 0 R -63 0 V -2814 0 R (130) Rshow LTa 600 1637 M 2817 0 V LTb 600 1637 M 63 0 V 2754 0 R -63 0 V -2814 0 R (140) Rshow LTa 600 1915 M 2817 0 V LTb 600 1915 M 63 0 V 2754 0 R -63 0 V -2814 0 R (150) Rshow LTa 600 2192 M 2817 0 V LTb 600 2192 M 63 0 V 2754 0 R -63 0 V -2814 0 R (160) Rshow LTa 600 2469 M 2817 0 V LTb 600 2469 M 63 0 V 2754 0 R -63 0 V -2814 0 R (170) Rshow LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (\(16,1\)) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(8,2\)) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(4,4\)) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(2,8\)) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(1,16\)) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 340 1360 M currentpoint gsave translate 90 rotate 0 0 M (Normalized cache misses \(w.r.t \(16,1\)\)) Cshow grestore 2008 51 M (Processor Geometry - 16 Processors ) Cshow LT1 1654 2106 M (160x160x160) Rshow 1714 2106 M 180 0 V 600 528 M 1304 817 L 705 626 V 705 721 V 704 250 V 1774 2106 A 600 528 A 1304 817 A 2009 1443 A 2713 2164 A 3417 2414 A LT2 1654 2006 M (144x144x144) Rshow 1714 2006 M 180 0 V 600 528 M 1304 678 L 705 488 V 705 610 V 704 333 V 1774 2006 B 600 528 B 1304 678 B 2009 1166 B 2713 1776 B 3417 2109 B LT4 1654 1906 M (64x64x64) Rshow 1714 1906 M 180 0 V 600 528 M 1304 329 L 705 -36 V 705 133 V 704 97 V 1774 1906 T 600 528 T 1304 329 T 2009 293 T 2713 426 T 3417 523 T stroke grestore end showpage %%EndDocument @endspecial 1085 640 a(Figure)g(4.)18 b(Cache)13 b(Misses.)47 1297 y @beginspecial 50 @llx 50 @lly 410 @urx 302 @ury 2057 @rwi @setspecial %%BeginDocument: tred2.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.100 0.100 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V 540 251 M (1e+06) Rshow LTb 600 2469 M 63 0 V 2754 0 R -63 0 V 540 2469 M (4e+06) Rshow LTa 600 1360 M 2817 0 V LTb 600 1360 M 31 0 V 2786 0 R -31 0 V LTa 600 2009 M 2817 0 V LTb 600 2009 M 31 0 V 2786 0 R -31 0 V LTa 600 2469 M 2817 0 V LTb 600 2469 M 31 0 V 2786 0 R -31 0 V LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (0) Cshow LTa 952 251 M 0 2218 V LTb 952 251 M 0 63 V 0 2155 R 0 -63 V 952 151 M (2) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (4) Cshow LTa 1656 251 M 0 2218 V LTb 1656 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (6) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (8) Cshow LTa 2361 251 M 0 2218 V LTb 2361 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (10) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (12) Cshow LTa 3065 251 M 0 2218 V LTb 3065 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (14) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (16) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 340 1260 M currentpoint gsave translate 90 rotate 0 0 M (Execution Time \(Micro Seconds\)) Cshow grestore 2008 51 M (Number of Processors) Cshow LT0 3054 2306 M ("Cyclic") Rshow 3114 2306 M 180 0 V 776 910 M 176 318 V 352 533 V 705 -387 V 704 -149 V 704 707 V 3174 2306 D 776 910 D 952 1228 D 1304 1761 D 2009 1374 D 2713 1225 D 3417 1932 D LT1 3054 2206 M ("BlockCyclic") Rshow 3114 2206 M 180 0 V 776 942 M 952 817 L 1304 662 L 705 -34 V 704 110 V 704 1480 V 3174 2206 A 776 942 A 952 817 A 1304 662 A 2009 628 A 2713 738 A 3417 2218 A stroke grestore end showpage %%EndDocument @endspecial 140 1351 a(Figure)f(5.)18 b(Ef)o(fect)12 b(of)g(False)h (Sharing.)899 1297 y @beginspecial 50 @llx 50 @lly 410 @urx 302 @ury 2057 @rwi @setspecial %%BeginDocument: tred2c.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.100 0.100 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V -2814 0 R (40000) Rshow LTa 600 251 M 2817 0 V LTb 600 251 M 31 0 V 2786 0 R -31 0 V LTa 600 497 M 2817 0 V LTb 600 497 M 31 0 V 2786 0 R -31 0 V LTa 600 697 M 2817 0 V LTb 600 697 M 31 0 V 2786 0 R -31 0 V LTa 600 867 M 2817 0 V LTb 600 867 M 31 0 V 2786 0 R -31 0 V LTa 600 1014 M 2817 0 V LTb 600 1014 M 31 0 V 2786 0 R -31 0 V LTa 600 1144 M 2817 0 V LTb 600 1144 M 31 0 V 2786 0 R -31 0 V LTa 600 1260 M 2817 0 V LTb 600 1260 M 63 0 V 2754 0 R -63 0 V -2814 0 R (100000) Rshow LTa 600 2023 M 2817 0 V LTb 600 2023 M 31 0 V 2786 0 R -31 0 V LTa 600 2469 M 2817 0 V LTb 600 2469 M 31 0 V 2786 0 R -31 0 V LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (0) Cshow LTa 952 251 M 0 2218 V LTb 952 251 M 0 63 V 0 2155 R 0 -63 V 952 151 M (2) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (4) Cshow LTa 1656 251 M 0 2218 V LTb 1656 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (6) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (8) Cshow LTa 2361 251 M 0 2218 V LTb 2361 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (10) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (12) Cshow LTa 3065 251 M 0 2218 V LTb 3065 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (14) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (16) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 400 1660 M currentpoint gsave translate 90 rotate 0 0 M (Cache Misses) Cshow grestore 2008 51 M (Number of Processors) Cshow LT0 3054 2306 M ("Cyclic") Rshow 3114 2306 M 180 0 V 123 -856 R -704 -3 V -704 172 V -705 449 V 952 1885 L 776 2180 L 3174 2306 D 3417 1450 D 2713 1447 D 2009 1619 D 1304 2068 D 952 1885 D 776 2180 D LT1 3054 2206 M ("BlockCyclic") Rshow 3114 2206 M 180 0 V 3417 1029 M 2713 642 L 2009 475 L -705 601 V 952 1596 L 776 2166 L 3174 2206 A 3417 1029 A 2713 642 A 2009 475 A 1304 1076 A 952 1596 A 776 2166 A stroke grestore end showpage %%EndDocument @endspecial 1085 1351 a(Figure)f(6.)18 b(Cache)13 b(Misses.)4 1510 y(These)18 b(programs)f(have)g(triangular)f(iteration)g(spaces)i(which)f (necessitate)h(cyclical)f(distribution)f(for)4 1564 y(load)c(balancing.)19 b(The)13 b(choice)f(of)g(this)h(distribution)e(combined)h(with)g(the)h (storage)f(order)g(of)g(the)g(arrays)4 1619 y(cause)17 b(more)f(than)g(one)g (processor)g(to)g(share)g(the)g(same)h(cache)f(line,)i(leading)e(to)g(false)g (sharing.)29 b(The)4 1673 y(impact)17 b(of)f(this)h(false)h(sharing)e(is)i (shown)f(in)g(Figure)f(5)h(for)f(the)h Fd(Tred2)g Fo(application)f(on)h(the)g (KSR1)4 1727 y(multiprocessor)m(.)37 b(The)19 b(\256gure)f(shows)h(the)g (execution)f(time)g(of)g(the)h(application)f(for)g Fd(Cyclic)g Fo(and)4 1781 y Fd(BlockCyclic)12 b Fo(distributions)g(using)i(1)f(to)g(16)g (processors.)20 b(The)14 b(use)g(of)e(the)h Fd(Cyclic)g Fo(distribution)4 1835 y(results)f(in)g(a)h(lar)o(ge)f(number)f(of)h(cache)h(misses,)g(as)g (can)f(be)g(seen)h(in)f(Figure)g(6.)18 b(The)13 b(resulting)e(overhead)4 1889 y(causes)20 b(execution)f(time)g(to)g(increase)g(as)g(the)g(number)g(of) f(processors)i(increases.)39 b(The)19 b(arrays)g(are)4 1944 y(distributed)c(using)g(a)g Fd(BlockCyclic)f Fo(distribution,)h(where)g(the)g (size)h(of)f(the)g(block)g(is)h(equal)f(to)g(the)4 1998 y(size)22 b(of)e(the)h(cache)g(line,)i(which)e(ef)o(fectively)f(eliminates)h(false)g (sharing.)43 b(When)21 b(the)g(number)f(of)4 2052 y(processors)14 b(is)g(small,)g(the)f(load)h(is)g(relatively)e(well-balanced,)i(and)f(the)h (elimination)e(of)h(false)h(sharing)4 2106 y(improves)h(performance.)25 b(However)n(,)15 b(as)h(the)f(number)f(of)h(processors)g(increases,)i(the)e (load)f(becomes)4 2160 y(increasingly)f(imbalanced,)h(and)f(the)h(negative)f (impact)g(of)g(this)g(load)g(imbalance)h(begins)f(to)g(outweigh)4 2214 y(the)h(bene\256ts)h(of)e(eliminating)h(false)g(sharing.)24 b(A)14 b(compiler)g(for)f(SSMM)h(must)g(consider)h(this)f(tradeof)o(f)4 2269 y(between)f(load)f(imbalance)g(and)g(false)h(sharing)f(when)g (determining)g(data)g(distributions.)4 2450 y Fn(5)71 b(Impact)19 b(on)f(Computation)i(Partitioning)4 2577 y Fo(The)12 b(owner)o(-computes)f (rule)g(has)h(been)f(the)h(computation)f(partitioner)f(of)h(choice)g(for)g (compiling)g(HPF-)4 2631 y(type)17 b(languages)g(on)g(DMMs)h([16].)32 b(The)17 b(owner)o(-computes)f(rule)h(maps)g(a)g(statement)h(such)f(that)g (the)4 2686 y(the)h(computation)e(is)i(executed)g(on)g(the)f(processor)h(on)f (which)h(the)f(data)h(element)f(that)h(is)g(written)e(is)4 2740 y(local.)27 b(All)15 b(the)g(data)g(elements)g(that)g(are)g(required)f (to)h(compute)g(the)g(result)g(\(which)g(may)g(be)g(remote\))4 2794 y(are)h(communicated)f(to)h(the)g(processor)m(.)29 b(A)16 b(strict)g(rule)f(such)i(as)f(owner)o(-computes)f(is)h(not)g(necessary)4 2848 y(on)h(a)f(SSMM)h(because)g(message)h(passing)f(code)g(is)g(not)f (generated)g(at)h(compile)f(time)g([3].)30 b(In)17 b(some)p eop %%Page: 7 7 7 6 bop 482 311 a @beginspecial 127 @llx 520 @lly 393 @urx 632 @ury 2160 @rwi @setspecial %%BeginDocument: adi.idraw /arrowhead { 0 begin transform originalCTM itransform /taily exch def /tailx exch def transform originalCTM itransform /tipy exch def /tipx exch def /dy tipy taily sub def /dx tipx tailx sub def /angle dx 0 ne dy 0 ne or { dy dx atan } { 90 } ifelse def gsave originalCTM setmatrix tipx tipy translate angle rotate newpath arrowHeight neg arrowWidth 2 div moveto 0 0 lineto arrowHeight neg arrowWidth 2 div neg lineto patternNone not { originalCTM setmatrix /padtip arrowHeight 2 exp 0.25 arrowWidth 2 exp mul add sqrt brushWidth mul arrowWidth div def /padtail brushWidth 2 div def tipx tipy translate angle rotate padtip 0 translate arrowHeight padtip add padtail add arrowHeight div dup scale arrowheadpath ifill } if brushNone not { originalCTM setmatrix tipx tipy translate angle rotate arrowheadpath istroke } if grestore end } dup 0 9 dict put def /arrowheadpath { newpath arrowHeight neg arrowWidth 2 div moveto 0 0 lineto arrowHeight neg arrowWidth 2 div neg lineto } def /leftarrow { 0 begin y exch get /taily exch def x exch get /tailx exch def y exch get /tipy exch def x exch get /tipx exch def brushLeftArrow { tipx tipy tailx taily arrowhead } if end } dup 0 4 dict put def /rightarrow { 0 begin y exch get /tipy exch def x exch get /tipx exch def y exch get /taily exch def x exch get /tailx exch def brushRightArrow { tipx tipy tailx taily arrowhead } if end } dup 0 4 dict put def /arrowHeight 10 def /arrowWidth 5 def /IdrawDict 51 dict def IdrawDict begin /reencodeISO { dup dup findfont dup length dict begin { 1 index /FID ne { def }{ pop pop } ifelse } forall /Encoding ISOLatin1Encoding def currentdict end definefont } def /ISOLatin1Encoding [ /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /space/exclam/quotedbl/numbersign/dollar/percent/ampersand/quoteright /parenleft/parenright/asterisk/plus/comma/minus/period/slash /zero/one/two/three/four/five/six/seven/eight/nine/colon/semicolon /less/equal/greater/question/at/A/B/C/D/E/F/G/H/I/J/K/L/M/N /O/P/Q/R/S/T/U/V/W/X/Y/Z/bracketleft/backslash/bracketright /asciicircum/underscore/quoteleft/a/b/c/d/e/f/g/h/i/j/k/l/m /n/o/p/q/r/s/t/u/v/w/x/y/z/braceleft/bar/braceright/asciitilde /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/dotlessi/grave/acute/circumflex/tilde/macron/breve /dotaccent/dieresis/.notdef/ring/cedilla/.notdef/hungarumlaut /ogonek/caron/space/exclamdown/cent/sterling/currency/yen/brokenbar /section/dieresis/copyright/ordfeminine/guillemotleft/logicalnot /hyphen/registered/macron/degree/plusminus/twosuperior/threesuperior /acute/mu/paragraph/periodcentered/cedilla/onesuperior/ordmasculine /guillemotright/onequarter/onehalf/threequarters/questiondown /Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla /Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex /Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis /multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute /Thorn/germandbls/agrave/aacute/acircumflex/atilde/adieresis /aring/ae/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave /iacute/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex /otilde/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis /yacute/thorn/ydieresis ] def /Helvetica reencodeISO def /none null def /numGraphicParameters 17 def /stringLimit 65535 def /Begin { save numGraphicParameters dict begin } def /End { end restore } def /SetB { dup type /nulltype eq { pop false /brushRightArrow idef false /brushLeftArrow idef true /brushNone idef } { /brushDashOffset idef /brushDashArray idef 0 ne /brushRightArrow idef 0 ne /brushLeftArrow idef /brushWidth idef false /brushNone idef } ifelse } def /SetCFg { /fgblue idef /fggreen idef /fgred idef } def /SetCBg { /bgblue idef /bggreen idef /bgred idef } def /SetF { /printSize idef /printFont idef } def /SetP { dup type /nulltype eq { pop true /patternNone idef } { dup -1 eq { /patternGrayLevel idef /patternString idef } { /patternGrayLevel idef } ifelse false /patternNone idef } ifelse } def /BSpl { 0 begin storexyn newpath n 1 gt { 0 0 0 0 0 0 1 1 true subspline n 2 gt { 0 0 0 0 1 1 2 2 false subspline 1 1 n 3 sub { /i exch def i 1 sub dup i dup i 1 add dup i 2 add dup false subspline } for n 3 sub dup n 2 sub dup n 1 sub dup 2 copy false subspline } if n 2 sub dup n 1 sub dup 2 copy 2 copy false subspline patternNone not brushLeftArrow not brushRightArrow not and and { ifill } if brushNone not { istroke } if 0 0 1 1 leftarrow n 2 sub dup n 1 sub dup rightarrow } if end } dup 0 4 dict put def /Circ { newpath 0 360 arc patternNone not { ifill } if brushNone not { istroke } if } def /CBSpl { 0 begin dup 2 gt { storexyn newpath n 1 sub dup 0 0 1 1 2 2 true subspline 1 1 n 3 sub { /i exch def i 1 sub dup i dup i 1 add dup i 2 add dup false subspline } for n 3 sub dup n 2 sub dup n 1 sub dup 0 0 false subspline n 2 sub dup n 1 sub dup 0 0 1 1 false subspline patternNone not { ifill } if brushNone not { istroke } if } { Poly } ifelse end } dup 0 4 dict put def /Elli { 0 begin newpath 4 2 roll translate scale 0 0 1 0 360 arc patternNone not { ifill } if brushNone not { istroke } if end } dup 0 1 dict put def /Line { 0 begin 2 storexyn newpath x 0 get y 0 get moveto x 1 get y 1 get lineto brushNone not { istroke } if 0 0 1 1 leftarrow 0 0 1 1 rightarrow end } dup 0 4 dict put def /MLine { 0 begin storexyn newpath n 1 gt { x 0 get y 0 get moveto 1 1 n 1 sub { /i exch def x i get y i get lineto } for patternNone not brushLeftArrow not brushRightArrow not and and { ifill } if brushNone not { istroke } if 0 0 1 1 leftarrow n 2 sub dup n 1 sub dup rightarrow } if end } dup 0 4 dict put def /Poly { 3 1 roll newpath moveto -1 add { lineto } repeat closepath patternNone not { ifill } if brushNone not { istroke } if } def /Rect { 0 begin /t exch def /r exch def /b exch def /l exch def newpath l b moveto l t lineto r t lineto r b lineto closepath patternNone not { ifill } if brushNone not { istroke } if end } dup 0 4 dict put def /Text { ishow } def /idef { dup where { pop pop pop } { exch def } ifelse } def /ifill { 0 begin gsave patternGrayLevel -1 ne { fgred bgred fgred sub patternGrayLevel mul add fggreen bggreen fggreen sub patternGrayLevel mul add fgblue bgblue fgblue sub patternGrayLevel mul add setrgbcolor eofill } { eoclip originalCTM setmatrix pathbbox /t exch def /r exch def /b exch def /l exch def /w r l sub ceiling cvi def /h t b sub ceiling cvi def /imageByteWidth w 8 div ceiling cvi def /imageHeight h def bgred bggreen bgblue setrgbcolor eofill fgred fggreen fgblue setrgbcolor w 0 gt h 0 gt and { l w add b translate w neg h scale w h true [w 0 0 h neg 0 h] { patternproc } imagemask } if } ifelse grestore end } dup 0 8 dict put def /istroke { gsave brushDashOffset -1 eq { [] 0 setdash 1 setgray } { brushDashArray brushDashOffset setdash fgred fggreen fgblue setrgbcolor } ifelse brushWidth setlinewidth originalCTM setmatrix stroke grestore } def /ishow { 0 begin gsave fgred fggreen fgblue setrgbcolor /fontDict printFont printSize scalefont dup setfont def /descender fontDict begin 0 [FontBBox] 1 get FontMatrix end transform exch pop def /vertoffset 1 printSize sub descender sub def { 0 vertoffset moveto show /vertoffset vertoffset printSize sub def } forall grestore end } dup 0 3 dict put def /patternproc { 0 begin /patternByteLength patternString length def /patternHeight patternByteLength 8 mul sqrt cvi def /patternWidth patternHeight def /patternByteWidth patternWidth 8 idiv def /imageByteMaxLength imageByteWidth imageHeight mul stringLimit patternByteWidth sub min def /imageMaxHeight imageByteMaxLength imageByteWidth idiv patternHeight idiv patternHeight mul patternHeight max def /imageHeight imageHeight imageMaxHeight sub store /imageString imageByteWidth imageMaxHeight mul patternByteWidth add string def 0 1 imageMaxHeight 1 sub { /y exch def /patternRow y patternByteWidth mul patternByteLength mod def /patternRowString patternString patternRow patternByteWidth getinterval def /imageRow y imageByteWidth mul def 0 patternByteWidth imageByteWidth 1 sub { /x exch def imageString imageRow x add patternRowString putinterval } for } for imageString end } dup 0 12 dict put def /min { dup 3 2 roll dup 4 3 roll lt { exch } if pop } def /max { dup 3 2 roll dup 4 3 roll gt { exch } if pop } def /midpoint { 0 begin /y1 exch def /x1 exch def /y0 exch def /x0 exch def x0 x1 add 2 div y0 y1 add 2 div end } dup 0 4 dict put def /thirdpoint { 0 begin /y1 exch def /x1 exch def /y0 exch def /x0 exch def x0 2 mul x1 add 3 div y0 2 mul y1 add 3 div end } dup 0 4 dict put def /subspline { 0 begin /movetoNeeded exch def y exch get /y3 exch def x exch get /x3 exch def y exch get /y2 exch def x exch get /x2 exch def y exch get /y1 exch def x exch get /x1 exch def y exch get /y0 exch def x exch get /x0 exch def x1 y1 x2 y2 thirdpoint /p1y exch def /p1x exch def x2 y2 x1 y1 thirdpoint /p2y exch def /p2x exch def x1 y1 x0 y0 thirdpoint p1x p1y midpoint /p0y exch def /p0x exch def x2 y2 x3 y3 thirdpoint p2x p2y midpoint /p3y exch def /p3x exch def movetoNeeded { p0x p0y moveto } if p1x p1y p2x p2y p3x p3y curveto end } dup 0 17 dict put def /storexyn { /n exch def /y n array def /x n array def n 1 sub -1 0 { /i exch def y i 3 2 roll put x i 3 2 roll put } for } def /SSten { fgred fggreen fgblue setrgbcolor dup true exch 1 0 0 -1 0 6 -1 roll matrix astore } def /FSten { dup 3 -1 roll dup 4 1 roll exch newpath 0 0 moveto dup 0 exch lineto exch dup 3 1 roll exch lineto 0 lineto closepath bgred bggreen bgblue setrgbcolor eofill SSten } def /Rast { exch dup 3 1 roll 1 0 0 -1 0 6 -1 roll matrix astore } def Begin [ 0.799705 0 0 0.799705 0 0 ] concat /originalCTM matrix currentmatrix def Begin %I Pict Begin %I Pict Begin %I Line 0 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 97 141 ] concat 87 643 119 643 Line End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 224 787 ] concat [ (Phase 1) ] Text End End %I eop Begin %I Pict Begin %I Line 0 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 97 141 ] concat 87 643 87 619 Line End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 6.12303e-17 1 -1 6.12303e-17 176.5 714.5 ] concat [ (Phase2) ] Text End End %I eop Begin %I Pict [ 1 0 0 1 -96 48 ] concat Begin %I Pict [ 1 0 0 1 -8 0 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 344 715 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 224 280 ] concat 79 419 175 443 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -8 8 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 344 683 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 224 248 ] concat 79 419 175 443 Rect End End %I eop Begin %I Pict Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 336 667 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 216 232 ] concat 79 419 175 443 Rect End End %I eop Begin %I Pict [ 1 0 0 1 0 8 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 336 635 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 216 200 ] concat 79 419 175 443 Rect End End %I eop End %I eop End %I eop Begin %I Pict [ 1 0 0 1 15 -1 ] concat Begin %I Pict [ 1 0 0 1 176 192 ] concat Begin %I Pict Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 184 571 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 92 144 ] concat 87 411 111 435 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -168 -72 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 376 643 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 271 483 295 507 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -200 8 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 432 563 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 327 403 351 427 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -280 104 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 536 467 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 431 307 455 331 Rect End End %I eop Begin %I Pict [ 1 0 0 1 72 -24 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 184 571 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 92 144 ] concat 87 411 111 435 Rect End End %I eop Begin %I Pict [ 1 0 0 1 48 -48 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 184 571 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 92 144 ] concat 87 411 111 435 Rect End End %I eop Begin %I Pict [ 1 0 0 1 24 -72 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 184 571 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 92 144 ] concat 87 411 111 435 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -192 -96 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 376 643 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 271 483 295 507 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -120 -120 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 376 643 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 271 483 295 507 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -144 -144 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 376 643 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 271 483 295 507 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -224 -16 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 432 563 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 327 403 351 427 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -248 -40 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 432 563 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 327 403 351 427 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -176 -64 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 432 563 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 327 403 351 427 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -304 80 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 536 467 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 431 307 455 331 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -352 32 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 536 467 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 431 307 455 331 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -328 56 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 536 467 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 431 307 455 331 Rect End End %I eop End %I eop Begin %I Pict [ 1 0 0 1 160 0 ] concat Begin %I Line 0 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 97 141 ] concat 87 643 87 619 Line End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 6.12303e-17 1 -1 6.12303e-17 176.5 714.5 ] concat [ (Phase2) ] Text End End %I eop Begin %I Pict [ 1 0 0 1 160 0 ] concat Begin %I Line 0 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 97 141 ] concat 87 643 119 643 Line End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 224 787 ] concat [ (Phase 1) ] Text End End %I eop End %I eop Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 160.25 662.724 ] concat [ (\(a\) Row Block Distribution.) ] Text End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 323.75 664 ] concat [ (\(b\) Our Proposed Distribution.) ] Text End End %I eop showpage end %%EndDocument @endspecial 269 391 a Fo(Figure)11 b(7:)18 b(Data)13 b(Distribution)e(used)i (to)f(alleviate)g(Memory)g(Contention.)503 1045 y @beginspecial 50 @llx 50 @lly 410 @urx 302 @ury 2057 @rwi @setspecial %%BeginDocument: 256.adi.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.100 0.100 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V 540 251 M (1000) Rshow LTa 600 585 M 2817 0 V LTb 600 585 M 31 0 V 2786 0 R -31 0 V LTa 600 780 M 2817 0 V LTb 600 780 M 31 0 V 2786 0 R -31 0 V LTa 600 919 M 2817 0 V LTb 600 919 M 31 0 V 2786 0 R -31 0 V LTa 600 1026 M 2817 0 V LTb 600 1026 M 31 0 V 2786 0 R -31 0 V LTa 600 1114 M 2817 0 V LTb 600 1114 M 31 0 V 2786 0 R -31 0 V LTa 600 1188 M 2817 0 V LTb 600 1188 M 31 0 V 2786 0 R -31 0 V LTa 600 1253 M 2817 0 V LTb 600 1253 M 31 0 V 2786 0 R -31 0 V LTa 600 1309 M 2817 0 V LTb 600 1309 M 31 0 V 2786 0 R -31 0 V LTa 600 1360 M 2817 0 V LTb 600 1360 M 63 0 V 2754 0 R -63 0 V -2814 0 R (10000) Rshow LTa 600 1694 M 2817 0 V LTb 600 1694 M 31 0 V 2786 0 R -31 0 V LTa 600 1889 M 2817 0 V LTb 600 1889 M 31 0 V 2786 0 R -31 0 V LTa 600 2028 M 2817 0 V LTb 600 2028 M 31 0 V 2786 0 R -31 0 V LTa 600 2135 M 2817 0 V LTb 600 2135 M 31 0 V 2786 0 R -31 0 V LTa 600 2223 M 2817 0 V LTb 600 2223 M 31 0 V 2786 0 R -31 0 V LTa 600 2297 M 2817 0 V LTb 600 2297 M 31 0 V 2786 0 R -31 0 V LTa 600 2362 M 2817 0 V LTb 600 2362 M 31 0 V 2786 0 R -31 0 V LTa 600 2418 M 2817 0 V LTb 600 2418 M 31 0 V 2786 0 R -31 0 V LTa 600 2469 M 2817 0 V LTb 600 2469 M 63 0 V 2754 0 R -63 0 V -2814 0 R (100000) Rshow LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (0) Cshow LTa 952 251 M 0 2218 V LTb 952 251 M 0 63 V 0 2155 R 0 -63 V 952 151 M (2) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (4) Cshow LTa 1656 251 M 0 2218 V LTb 1656 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (6) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (8) Cshow LTa 2361 251 M 0 2218 V LTb 2361 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (10) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (12) Cshow LTa 3065 251 M 0 2218 V LTb 3065 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (14) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (16) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 220 1260 M currentpoint gsave translate 90 rotate 0 0 M (Execution Time \(milli sec\)) Cshow grestore 2008 51 M (Number of Processors) Cshow LT0 2009 1253 M (Owner Computes\(Block, Block\)) Rshow 2069 1253 M 180 0 V 952 2200 M 352 -24 V 705 86 V 1408 89 V 2129 1253 D 952 2200 D 1304 2176 D 2009 2262 D 3417 2351 D LT1 2009 1153 M (Sequential) Rshow 2069 1153 M 180 0 V 776 2146 M 2129 1153 A 776 2146 A LT2 2009 1053 M (No Distributions) Rshow 2069 1053 M 180 0 V 3417 2056 M -704 3 V -704 6 V -705 8 V -352 61 V 2129 1053 B 3417 2056 B 2713 2059 B 2009 2065 B 1304 2073 B 952 2134 B LT3 2009 953 M (\(*,Cyclic\)) Rshow 2069 953 M 180 0 V 952 2090 M 352 -33 V 705 -48 V 704 25 V 704 18 V 2129 953 C 952 2090 C 1304 2057 C 2009 2009 C 2713 2034 C 3417 2052 C LT4 2009 853 M (\(*,Block\)) Rshow 2069 853 M 180 0 V 952 2070 M 352 -53 V 705 -45 V 704 23 V 704 9 V 2129 853 T 952 2070 T 1304 2017 T 2009 1972 T 2713 1995 T 3417 2004 T LT5 2009 753 M (Owner Computes\(*,Cyclic\)) Rshow 2069 753 M 180 0 V 952 2080 M 352 -252 V 705 -78 V 352 216 V 352 20 V 704 128 V 2129 753 S 952 2080 S 1304 1828 S 2009 1750 S 2361 1966 S 2713 1986 S 3417 2114 S LT6 2009 653 M (Owner Computes\(*,Block\)) Rshow 2069 653 M 180 0 V 3417 1820 M 2713 1611 L -704 -55 V -705 118 V 952 1954 L 2129 653 D 3417 1820 D 2713 1611 D 2009 1556 D 1304 1674 D 952 1954 D LT7 2009 553 M (\(Block,Block\)) Rshow 2069 553 M 180 0 V 1168 547 R -704 222 V -704 85 V -705 249 V 952 1951 L 2129 553 A 3417 1100 A 2713 1322 A 2009 1407 A 1304 1656 A 952 1951 A stroke grestore end showpage %%EndDocument @endspecial 532 1111 a(Figure)f(8:)18 b(ADI)12 b(Performance)f(\(256x256\).) 4 1246 y(cases)18 b(adhering)d(to)i(owner)o(-computes)e(rule)h(can)h(incur)e (severe)i(synchronization)f(or)g(ownership)f(test)4 1300 y(overhead)c(which)f (exceeds)h(the)g(cost)g(of)f(accessing)i(remote)e(memory)m(.)17 b(W)l(e)11 b(use)g(the)g(Altering)f(Direction)4 1355 y(Integration)i(\()p Fd(ADI)p Fo(\))f(to)i(illustrate)f(that)h(the)g(shared)f(address)i(space)f (provides)g(\257exibility)e(in)i(the)g(choice)4 1409 y(of)i(computation)f (partitions,)h(reducing)f(contention)g(and)h(synchronization)f(overhead,)i (and)e(resulting)4 1463 y(in)e(signi\256cant)g(performance)g(improvements.)77 1539 y(W)l(e)k(use)f(the)g(Hector)n(,)h(a)f(Non-Uniform)e(Memory)i(Access)h (multiprocessor)n(,)f(as)g(an)h(experimental)4 1593 y(platform.)21 b(Hector)13 b(consists)h(of)f(4)h(sets)g(of)f(processor)o(-memory)g(pairs)g (connected)h(by)f(a)h(bus)g(to)f(form)g(a)4 1647 y(station;)g(4)g(stations)g (are)f(connected)h(by)g(a)g(local)f(ring)g(to)h(form)f(a)h(cluster;)f(4)h (local)g(rings)f(are)h(connected)4 1701 y(by)g(a)g(global)g(ring.)19 b(W)l(e)14 b(use)f(a)g(system)h(with)e(one)h(cluster)m(.)21 b(Each)13 b(processor)o(-memory)f(pair)g(consists)i(of)4 1755 y(a)f(Motorola)f(MC88100)h(CPU,)g(a)g(16)f(KB)h(instruction)f(cache,)i(a)f (16)f(KB)h(data)g(cache)g(and)g(4)f(MB)i(of)e(the)4 1809 y(globally)i (addressable)g(memory)m(.)23 b(The)15 b(hardware)e(provides)h(no)g(support)g (for)f(cache)i(coherence.)23 b(The)4 1864 y(coherence)12 b(of)f(data)h(is)g (maintained)f(by)h(software)f(at)h(cache)g(line)f(granularity)g([10)o(].)18 b(Data)12 b(distributions)4 1918 y(are)g(implemented)g(using)g(the)h(array)e (allocation)h(techniques)h(described)f(in)g([21,)h(3].)4 2079 y Fc(5.1)58 b(Contention)15 b(and)f(Synchr)n(onization)h(Conscious)e (Distribution)4 2177 y Fo(The)21 b Fd(ADI)e Fo(program)g(has)h(two)g(phases)h (with)e(parallelism)g(along)h(orthogonal)f(dimensions)h(in)f(each)4 2231 y(phase.)k(It)13 b(operates)g(on)h(three)f(2-dimensional)g(arrays)g Fj(A)p Fo(,)h Fj(B)i Fo(and)e Fj(X)t Fo(.)22 b(A)14 b(single)g(iteration)e (of)i(an)f(outer)4 2285 y(sequentially)d(iterated)f(loop)h(consists)g(of)g(a) g(forward)f(and)g(a)i(backward)e(sweep)i(phase)f(along)g(the)f(rows)h(of)4 2339 y(three)f(arrays,)h(followed)e(by)h(another)g(forward)e(and)i(backward)g (sweep)h(phase)f(along)g(the)g(columns)g(of)g(the)4 2394 y(arrays)g([18].)17 b(This)10 b(application)f(is)h(typical)f(of)g(other)g(programs)g(such)g(as)h Fd(2D-FFT)f Fo(and)h Fd(Erlebacher)4 2448 y Fo(that)i(have)h(parallelism)f (in)g(orthogonal)f(directions)h(in)g(dif)o(ferent)f(phases)j(of)e(the)g (program.)77 2523 y(The)k(best)g(data)f(distribution)f(scheme)i(for)e Fd(ADI)h Fo(remains)g(an)g(issue)h(of)f(debate)h([18)o(,)g(4].)26 b(The)16 b(two)4 2578 y(proposed)h(schemes)g(partition)f(arrays)g(along)h(a)f (single)h(dimension,)h(either)e(in)h(blocks)g(or)f(cyclically)m(.)4 2632 y(These)f(distributions,)g(in)e(conjunction)h(with)g(the)g(owner)o (-computes)f(rule)h(result)f(in)h(a)h(wavefront)e(type)4 2686 y(computation,)e(leading)g(to)g(heavy)g(synchronization)g(overhead)g(in)g (one)g(of)f(the)i(phases)g(of)e(the)h(program.)4 2740 y(Figure)i(7\(a\))g (shows)i(a)f Fd(Block)f Fo(distribution)g(of)g(the)h(rows)g(of)f(the)h (arrays.)23 b(W)n(ith)13 b(such)h(a)g(distribution,)4 2794 y(during)g(the)h(\256rst)g(phase)g(of)g(the)g(program)e(all)i(the)g (processors)g(access)i(data)e(that)f(is)i(local)f(and)g(require)4 2848 y(no)g(communication.)25 b(During)14 b(the)h(second)g(phase,)h(however)n (,)g(the)f(parallelism)f(is)h(orthogonal)f(to)h(the)p eop %%Page: 8 8 8 7 bop 35 1 a Fo(T)m(able)12 b(1:)18 b(Performance)11 b(Bottlenecks)i(for)e (various)h(data)h(and)f(computation)g(partitioning)e(for)i(ADI.)p 244 33 1363 2 v 243 84 2 51 v 252 84 V 277 69 a Fb(Data)g(Distribution)p 620 84 V 77 w(Compute)f(Rule)p 993 84 V 135 w(Performance)i(Bottleneck)p 1597 84 V 1606 84 V 244 85 1363 2 v 243 136 2 51 v 252 136 V 387 121 a(None)p 620 136 V 247 w(Relaxed)p 993 136 V 228 w(Memory)e(Contention)p 1597 136 V 1606 136 V 243 187 V 252 187 V 344 172 a(\(*,)h(Block\))p 620 187 V 117 w(Owner)o(-Computes)p 993 187 V 125 w(High)f(Synchronization)p 1597 187 V 1606 187 V 243 238 V 252 238 V 344 223 a(\(*,)h(Block\))p 620 238 V 204 w(Relaxed)p 993 238 V 228 w(Memory)f(Contention)p 1597 238 V 1606 238 V 243 289 V 252 289 V 339 273 a(\(*,)h(Cyclic\))p 620 289 V 112 w(Owner)o(-Computes)p 993 289 V 125 w(High)f(Synchronization)p 1597 289 V 1606 289 V 243 340 V 252 340 V 339 324 a(\(*,)h(Cyclic\))p 620 340 V 199 w(Relaxed)p 993 340 V 228 w(Memory)f(Contention)p 1597 340 V 1606 340 V 243 390 V 252 390 V 301 375 a(\(Block,)h(Block\))p 620 390 V 74 w(Owner)o(-Computes)p 993 390 V 180 w(Ownership)f(tests)p 1597 390 V 1606 390 V 243 441 V 252 441 V 301 426 a(\(Block,)h(Block\))p 620 441 V 161 w(Relaxed)p 993 441 V 137 w(High)f(Remote)g(Memory)g(Access)p 1597 441 V 1606 441 V 244 443 1363 2 v 4 606 a Fo(direction)17 b(of)g(distribution.)32 b(Strict)16 b(adherence)h(to)g(the)h(owner)o (-computes)e(rule)h(implies)g(ordering)f(of)4 660 y(the)d(computations)f(by)g (processors)h(on)f(the)g(corresponding)g(chunk)g(of)g(the)h(columns)f(they)g (own.)19 b(Thus,)4 715 y(processor)10 b Fj(i)g Fo(has)h(to)f(wait)f(for)h (processor)g Fj(i)c Fh(\000)g Fo(1)j(to)h(\256nish)g(the)g(computation)f(on)h (its)h(chunk)f(of)f(the)h(column)4 769 y(before)g(proceeding.)17 b(A)10 b(lar)o(ger)f(number)h(of)f(synchronizations)h(are)g(required)f(to)h (maintain)g(the)g(ordering)4 823 y(involved)i(in)g(the)g(wavefront)g (computation.)77 899 y(The)i(synchronization)e(overhead)h(can)g(be)g (eliminated)f(by)h(relaxing)f(the)h(owner)o(-computes)f(rule)h(in)4 953 y(the)18 b(second)g(phase)h(and)f(allowing)f(the)h(processor)g(to)f (write)h(the)f(results)i(to)e(remote)h(memory)f(mod-)4 1007 y(ules.)24 b(This)15 b(eliminates)f(synchronization)f(overhead)h(at)g(the)g (expense)g(of)g(increased)g(remote)g(memory)4 1061 y(accesses.)26 b(However)n(,)15 b(the)g(use)g(of)f(this)g(relaxed)g(compute)g(rule)g(with)g (the)g Fd(\(*,Block\))g Fo(distribution)4 1115 y(results)9 b(in)g(heavy)g(contention.)17 b(Each)9 b(processor)g(is)g(responsible)g(for)f (computing)g(a)i(column,)f(and)g(hence,)4 1169 y(each)14 b(processor)g (accesses)h(every)e(memory)g(module)g(in)g(sequence.)23 b(Thus,)15 b(a)e(given)h(memory)e(module)4 1224 y(is)h(accessed)h(by)e(every)g (processor)g(at)h(the)f(same)h(time,)f(leading)g(to)h(contention.)77 1299 y(The)k(data)e(distribution)g(scheme)h(depicted)f(in)h(Figure)f(7\(b\)) 1149 1281 y Fm(4)1182 1299 y Fo(eliminates)h(contention)f(and)h(results)4 1353 y(in)21 b(the)g(best)h(possible)f(performance)f(with)h(the)g(relaxed)g (compute)g(rule.)44 b(W)n(ith)21 b(this)g(distribution,)4 1408 y(processors)13 b(access)g(data)g(from)e(remote)g(memory)h(modules)g(in)g (both)g(phases)h(of)f(the)g(program.)17 b(In)12 b(both)4 1462 y(phases,)h(processors)f(start)g(working)e(on)i(the)f(columns)h(assigned)g (to)g(them)f(by)g(accessing)i(data)f(that)f(is)h(in)4 1516 y(dif)o(ferent)f(memory)f(modules)i(thus)g(avoiding)f(contention.)18 b(There)12 b(is)g(no)f(wavefront)g(type)h(parallelism,)4 1570 y(and)h(hence)f(no)g(overhead)g(involved)g(due)h(to)f(synchronization.)77 1646 y(The)19 b(use)f(of)f(owner)o(-computes)g(rule)h(with)f(the)h (distribution)f(of)g(Figure)g(7\(b\))g(will)h(not)f(result)h(in)4 1700 y(good)f(performance.)31 b(Either)17 b(ownership)g(tests)h(must)f(be)g (introduced)f(in)h(the)g(body)g(of)g(the)g(loops)g(to)4 1754 y(enforce)c(the)g(owner)o(-computes)f(rule,)i(or)e(the)i(loops)f(must)g(be)g (rewritten)g(with)f(additional)h(strip-mined)4 1808 y(controlling)f(loops)h (to)g(schedule)h(the)f(computations)f(on)h(sub-blocks)g(of)g(the)g(array)m(.) 20 b(The)14 b(former)d(leads)4 1862 y(to)h(overhead)g(and)h(the)f(latter)g (introduces)g(synchronization)g(similar)f(to)i(the)f(wavefront)f (computation.)77 1938 y(The)j(result)g(of)f(executing)g(the)h Fd(ADI)f Fo(application)g(on)h(the)f(Hector)g(multiprocessor)g(for)g(a)h (data)f(size)4 1992 y(of)18 b(256x256)f(with)h(various)g(data)g (distributions)g(and)g(compute)g(rules)g(is)g(shown)g(in)g(Figure)g(8.)35 b(The)4 2046 y Fd(\(Block,Block\))17 b Fo(data)h(distribution)f(that)h (relaxes)g(the)g(owner)o(-computes)f(rule)g(outperforms)g(all)4 2101 y(data)d(distribution)e(schemes)i(that)f(adhere)g(to)g(the)g(rule.)21 b(The)14 b(\256gure)f(also)g(indicates)h(that)f(the)g(overhead)4 2155 y(due)j(to)g(the)f(ownership)h(tests)g(when)g(using)g(the)f(owner)o (-computes)g(rule)h(with)f(a)h Fd(\(Block,Block\))4 2209 y Fo(distribution)d(degrades)h(performance.)21 b(It)14 b(is)g(also)g(clear)g (that)f(the)h(use)g(of)g(data)g(distribution)e(improves)4 2263 y(performance)i(over)h(the)g(use)h(of)f(operating)f(system)i(policies)f(to)g (manage)g(data)h(\(the)e(no)i(distributions)4 2317 y(curve\).)35 b(The)19 b(performance)e(bottlenecks)h(of)g(various)g(distributions)g(for)f Fd(ADI)h Fo(are)g(summarized)g(in)4 2371 y(T)m(able)13 b(1.)p 4 2406 737 2 v 62 2437 a Fl(4)79 2452 y Fb(This)f(is)h(equivalent)g(to)f (!HPF$)i(PROCESSORS)i(PROCS\(N\))g(with)c(!HPF$)i(DISTRIBUTE)h(B\(BLOCK,)4 2503 y(BLOCK\),)10 b(X\(BLOCK,)g(BLOCK\))g(ON)e(PROCS)j(in)d Fa(HPF)p Fb(.)i(In)f(the)f(current)h Fa(HPF)h Fb(speci\256cation,)f(this)f (distribution)4 2554 y(is)18 b(not)g(valid;)k(the)c(rank)h(of)g(each)g (distributee)f(must)f(equal)i(the)g(rank)f(of)h(the)g(named)f(processor)h (grid)f([16].)4 2604 y(Distributions)7 b(in)i(which)g(this)g(is)f(not)h(the)g (case)h(introduce)f(additional)g(complexity)e(on)j(DMMs)e([17].)16 b(In)10 b(contrast,)4 2655 y(SSMMs)h(provide)g(the)g(\257exibility)f(to)h (implement)f(these)h(distributions.)p eop %%Page: 9 9 9 8 bop 4 -21 a Fn(6)71 b(Related)19 b(W)l(ork)4 106 y Fo(Several)12 b(researchers)g(have)g(focused)g(on)g(the)g(problem)f(of)g(deriving)g(data)h (distributions)f(automatically)4 160 y(for)g(DMMs.)20 b(Li)12 b(and)g(Chen)h([22)o(],)f(Gupta)g(and)g(Banerjee)h([12)o(],)f(Zima)h(et)f (al.)h([9)o(])f(and)g(Garcia)g(et)g(al.)h([11)o(])4 215 y(follow)e(the)h (approach)g(of)f(\256nding)h(the)f(alignment)h(constraints)g(between)g(dif)o (ferent)e(dimensions)i(of)g(the)4 269 y(arrays)g(and)g(derive)g(a)g(data)g (distribution)f(that)h(minimizes)g(interprocessor)g(communication.)17 b(T)m(o)12 b(avoid)4 323 y(a)f(heuristic)g(approach,)g(Bixby)g(et)g(al.)h([7) o(])f(formulate)e(a)j(0-1)e(integer)g(programming)g(problem)g(for)g(deriv-)4 377 y(ing)k(data)g(distributions.)21 b(Their)14 b(approach)g(relies)g(on)f (the)h(assumption)g(that)g(a)g(good)f(data)h(distribution)4 431 y(for)h(the)i(entire)e(program)g(can)i(be)f(found)f(by)i(mer)o(ging)e (the)h(data)g(distributions)g(of)f(smaller)h(segments)4 485 y(of)g(the)g(program.)27 b(They)17 b(minimize)e(the)h(interprocessor)f (communication)g(using)h(the)g(\252performance)4 540 y(estimator)r(\272)c (developed)h(by)g(Balasundaram)g(et)g(al.)g([6)o(].)20 b(Anderson)12 b([5])g(presents)i(an)e(algebraic)h(frame-)4 594 y(work)g(for)g(determining)f (data)h(and)h(computation)e(partitions)h(by)g(minimizing)g(communication)f (across)4 648 y(processors.)28 b(Data)16 b(transformations)e(are)i(then)f (applied)h(so)f(that)h(the)f(processors)h(access)h(contiguous)4 702 y(data)g(regions)f(to)h(reduce)g(false)g(sharing.)31 b(This)17 b(technique)g(is)g(oblivious)f(to)h(SSMM)g(speci\256c)g(issues)4 756 y(such)c(as)g(contention)f(and)g(cache)h(af)o(\256nity)m(.)4 938 y Fn(7)71 b(Concluding)19 b(Remarks)4 1065 y Fo(Although)9 b(lar)o(ge)g(SSMMs)i(are)e(built)g(based)h(on)g(an)g(architecture)e(with)i (distributed)f(memory)m(,)g(the)h(shared)4 1119 y(memory)15 b(paradigm)g(introduces)g(performance)g(issues)i(that)e(are)h(dif)o(ferent)f (from)f(those)i(encountered)4 1173 y(in)e(DMMs.)24 b(The)14 b(high)f(cost)i(of)e(interprocessor)g(communication)g(in)h(distributed)f (memory)f(multipro-)4 1227 y(cessors)18 b(makes)e(the)h(minimization)e(of)h (communication)g(the)g(predominant)g(issue)h(in)f(selecting)h(data)4 1282 y(distributions)h(and)i(in)e(partitioning)g(computations.)38 b(On)19 b(SSMMs,)j(a)d(methodology)f(for)h(selecting)4 1336 y(data)14 b(distributions)g(must)g(also)g(consider)g(cache)h(af)o(\256nity)m (,)f(memory)f(contention)h(and)g(false)g(sharing)g(in)4 1390 y(addition)d(to)g(the)g(cost)h(of)f(interprocessor)g(communication.)17 b(Furthermore,)10 b(the)h(single)h(shared)f(address)4 1444 y(space)j(present)f(in)g(SSMMs)g(provides)g(\257exibility)f(in)h(the)g (selection)g(of)f(computation)h(partitions.)19 b(This)4 1498 y(should)e(be)f(exploited)g(in)g(applications)h(in)f(which)g(owner)o (-computes)g(results)g(in)h(poor)f(performance.)4 1552 y(The)f Fe(Jasmine)g Fo(compiler)f(project)g([2)o(])h(is)g(investigating)f(the)g (issues)i(discussed)f(in)g(this)f(paper)h(through)4 1607 y(the)d(development) g(of)g(a)h(framework)e(for)g(automatically)h(deriving)f(data)i(distributions) e(on)i(SSMMs.)4 1777 y Fn(Refer)o(ences)29 1896 y Fo([1])24 b(T)l(.S.)17 b(Abdelrahman)f(et)g(al.)29 b(An)16 b(overview)f(of)h(the)g (NUMAchine)h(multiprocessor)e(project.)28 b(In)112 1943 y Fe(Pr)n(oc.)13 b(of)g(the)f(Canadian)g(Super)n(computing)g(Conf.)p Fo(,)h(pages)g (283\261295,)f(1994.)29 2032 y([2])24 b(T)l(.S.)12 b(Abdelrahman,)f(N.)h (Manjikian,)g(and)f(S.)g(T)m(andri.)16 b(The)11 b(Jasmine)h(Compiler.)k(In)10 b(preparation.)29 2120 y([3])24 b(T)l(.S.)e(Abdelrahman)e(and)h(T)l(.N.)h(W)l (ong.)41 b(Distributed)20 b(array)g(data)h(management)g(on)f(NUMA)112 2167 y(multiprocessors.)d(In)12 b Fe(Pr)n(oc.)i(of)e(SHPCC)p Fo(,)i(pages)f(551\261559,)f(1994.)29 2256 y([4])24 b(S.P)-6 b(.)10 b(Amarasinghe,)g(J.M.)h(Anderson,)f(M.S.)h(Lam,)g(and)e(A.W)-5 b(.)11 b(Lim.)j(An)9 b(overview)g(of)g(a)h(compiler)112 2303 y(for)k(scalable)j(parallel)e(machines.)27 b(In)15 b Fe(Languages)h(and)f (Compilers)i(for)f(Parallel)g(Computing)p Fo(,)112 2350 y(pages)c (253\261272.)h(Springer)o(-V)-6 b(erlag)10 b(LNCS-768,)j(1993.)29 2438 y([5])24 b(J.M.)13 b(Anderson.)j(Demonstration)11 b(of)g(automatic)g (data)h(and)f(computation)g(decomposition)g(tech-)112 2485 y(niques.)f(In)e Fe(Pr)n(oc.)g(of)g(the)g(W)-5 b(orkshop)8 b(on)g(Automatic)g(Data)g(Layout)g(and)g(Performance)g(Pr)n(ediction)p Fo(,)112 2532 y(1995.)29 2620 y([6])24 b(V)-6 b(.)12 b(Balasundaram,)i(G.)f (Fox,)g(K.)g(Kennedy)m(,)g(and)f(U.)i(Kremer)m(.)k(A)13 b(static)g (performance)e(estimator)112 2667 y(to)h(guide)g(data)g(partitioning)f (decisions.)19 b(In)12 b Fe(Pr)n(oc.)i(of)e(PPOPP)p Fo(,)j(pages)e (213\261223,)f(1991.)29 2756 y([7])24 b(R.)9 b(Bixby)m(,)h(K.)g(Kennedy)m(,)f (and)h(U.)f(Kremer)m(.)j(Automatic)d(data)g(layout)f(using)h(0-1)g(integer)f (program-)112 2803 y(ming.)15 b(In)c Fe(Pr)n(oc.)i(of)e(the)g(Int'l)f(Conf.)i (on)f(Parallel)g(Ar)n(chitectur)n(es)i(and)e(Compilation)g(T)-5 b(echniques)p Fo(,)112 2850 y(pages)12 b(111\261122,)h(1994.)p eop %%Page: 10 10 10 9 bop 29 -27 a Fo([8])24 b(W)-5 b(.J.)19 b(Bolosky)f(and)g(M.L.)h(Scott.) 32 b(False)18 b(sharing)f(and)h(its)g(ef)o(fect)f(on)g(shared)h(memory)f (multi-)112 20 y(processors.)k(In)13 b Fe(Pr)n(oc.)j(of)d(4th)g(Symp.)h(on)g (Experiences)h(with)e(Distributed)h(and)f(Multipr)n(ocessor)112 67 y(Systems)p Fo(,)g(pages)g(57\26171,)f(1993.)29 155 y([9])24 b(B.M.)14 b(Chapman,)g(T)l(.)h(Fahringer)n(,)e(and)g(H.)h(Zima.)21 b(Automatic)12 b(support)h(for)f(data)i(distribution)e(on)112 202 y(distributed)g(memory)g(multiprocessor)g(systems.)22 b(In)12 b Fe(Languages)h(and)g(Compilers)h(for)f(Parallel)112 249 y(Computing)p Fo(,)f(pages)h(184\261199.)f(Springer)o(-V)-6 b(erlag)11 b(LNCS-768,)h(1993.) 4 337 y([10])24 b(B.)11 b(Gamsa.)k(Region-oriented)9 b(main)h(memory)f (management)h(in)g(shared-memory)f(NUMA)i(mul-)112 384 y(tiprocessors.)19 b(Master)r(')m(s)13 b(thesis,)h(Department)d(of)i(Computer)f(Science,)h (University)f(of)g(T)m(oronto,)112 431 y(T)m(oronto,)f(CANADA,)i(1992.)4 519 y([11])24 b(J.)f(Garcia,)h(E.)f(A)-5 b(yguade,)26 b(and)c(J.)h(Labarta.) 44 b(A)22 b(novel)g(approach)g(towards)g(automatic)g(data)112 566 y(distribution.)33 b(In)18 b Fe(Pr)n(oc.)i(of)e(the)h(W)-5 b(orkshop)20 b(on)e(Automatic)g(Data)g(Layout)g(and)h(Performance)112 613 y(Pr)n(ediction)p Fo(,)13 b(1995.)4 701 y([12])24 b(M.)16 b(Gupta)f(and)h(P)-6 b(.)17 b(Banerjee.)27 b(Automatic)15 b(data)g (partitioning)g(on)g(distributed)g(memory)g(multi-)112 748 y(processors.)j Fe(IEEE)c(T)m(rans.)f(on)f(Parallel)h(and)f(Distributed)h (Systems)p Fo(,)g(3\(2\):179\261193,)e(1992.)4 836 y([13])24 b(K.)15 b(Harzallah)g(and)g(K.C.)h(Sevcik.)25 b(Hot)15 b(spot)g(analysis)g (in)g(lar)o(ge)g(scale)h(shared)f(memory)f(multi-)112 883 y(processors.)k(In) 12 b Fe(Pr)n(oc.)i(of)e(Super)n(computing'93)p Fo(,)g(pages)h(895\261905.)f (ACM,)i(1993.)4 971 y([14])24 b(M.)11 b(Heinrich)f(et)h(al.)16 b(The)11 b(Stanford)f(FLASH)g(Multiprocessor.)16 b(In)10 b Fe(Pr)n(oc.)i(of)f(the)g(21st)g(Int'l)e(Symp.)112 1018 y(on)j(Computer)g(Ar)n (chitectur)n(e)p Fo(,)j(pages)e(302\261313,)f(1994.)4 1106 y([15])24 b(S.)15 b(Hiranandani,)i(K.)f(Kennedy)m(,)g(and)g(C.)g(T)m(seng.)27 b(Compiler)15 b(optimizations)g(for)g(Fortran)f(D)i(on)112 1153 y(MIMD)f(distributed-memory)e(machines.)25 b(In)15 b Fe(Pr)n(oc.)h(of)f (Super)n(computing'91)p Fo(,)g(pages)h(86\261100,)112 1200 y(Albuquerque,)c(NM,)h(1991.)4 1288 y([16])24 b(HPF)l(.)33 b(High)17 b(Performance)g(Fortran)g(Language)i(Speci\256cation)e(\(High)g (Performance)g(Fortran)112 1335 y(Forum\).)f(T)m(echnical)d(report)e (CRPC-TR92225,)i(Rice)g(University)m(,)f(1994.)4 1424 y([17])24 b(C.)13 b(Koelbel.)18 b(HPF)12 b(constraints.)18 b(Personal)12 b(Communications,)g(1995.)4 1512 y([18])24 b(U.)14 b(Kremer)m(.)23 b(Automatic)14 b(data)g(layout)g(for)g(distributed-memory)e(multiprocessors.) 23 b(T)m(echnical)112 1559 y(report)11 b(CRPC-TR93229-S,)h(Center)h(for)e (Research)i(on)f(Parallel)g(Computation,)g(1993.)4 1647 y([19])24 b(T)l(.T)l(.)15 b(Kwan,)f(B.K.)h(T)m(otty)m(,)e(and)g(D.A.)h(Reed.)21 b(Communication)12 b(and)i(computation)e(performance)112 1694 y(of)f(the)i(CM5.)19 b(In)12 b Fe(Pr)n(oc.)h(of)g(Super)n(computing'93)p Fo(,)f(pages)h(192\261201.)f(ACM,)h(1993.)4 1782 y([20])24 b(D.)15 b(Lenoski)h(et)f(al.)26 b(The)15 b(Stanford)f(DASH)h(multiprocessor)m (.)25 b Fe(IEEE)16 b(Computer)p Fo(,)h(25\(3\):63\26179,)112 1829 y(1992.)4 1917 y([21])24 b(H.)12 b(Li)g(and)g(K.C.)h(Sevcik.)k (Numacros:)g(Data)12 b(parallel)f(programming)f(on)i(NUMA)g(multiproces-)112 1964 y(sors.)h(In)c Fe(Pr)n(oc.)i(of)e(4th)g(Symp.)h(on)g(Experiences)g(with) g(Distributed)f(and)g(Multipr)n(ocessor)j(Systems)p Fo(,)112 2011 y(pages)g(247\261263,)h(1993.)4 2099 y([22])24 b(J.)11 b(Li)h(and)f(M.)h(Chen.)k(Compiling)10 b(communication-ef)o(\256cient)f (programs)i(for)f(massively)h(parallel)112 2146 y(machines.)18 b Fe(Journal)12 b(of)h(Parallel)f(and)h(Distributed)f(Computing)p Fo(,)g(2\(3\):361\261376,)f(1991.)4 2234 y([23])24 b(Cray)14 b(Research.)25 b(The)16 b(Cray)e(Research)h(Massively)h(Parallel)e(Processor) g(System)h(-)f(Cray)h(T3D.)112 2281 y(T)m(echnical)d(report)f(80922,)i (Munchen,)g(Germany)m(,)f(1993.)4 2369 y([24])24 b(Kendall)12 b(Square)f(Research.)19 b Fe(KSR1)13 b(Principles)h(of)e(Operation)p Fo(.)18 b(W)l(altham,)13 b(MA,)g(1991.)4 2457 y([25])24 b(J.)12 b(T)m(orres,)g(E.)h(A)-5 b(yguade,)13 b(J.)g(Labarta,)f(and)g(M.)h(V)-6 b(alero.)17 b(Align)12 b(and)g(distribute-based)f(linear)g(loop)112 2504 y(transformations.)16 b(In)11 b Fe(Languages)g(and)h(Compilers)h(for)f (Parallel)g(Computing)p Fo(,)g(pages)g(321\261339.)112 2551 y(Springer)o(-V)-6 b(erlag)10 b(LNCS-768,)j(1993.)4 2639 y([26])24 b(Z.)17 b(V)m(ranesic,)h(M.)f(Stumm,)f(R.)i(White,)f(and)f(D.)h(Lewis.)30 b(The)16 b(Hector)g(Multiprocessor.)29 b Fe(IEEE)112 2686 y(Computer)p Fo(,)13 b(24\(1\):72\26180,)e(1991.)4 2774 y([27])24 b(R.W)-5 b(.)12 b(W)n(isniewski,)g(L.I.)g(Kontothanassis,)h(and)e(M.L.)h(Scott.)k (High)11 b(performance)e(synchroniza-)112 2821 y(tion)i(algorithms)h(for)g (multiprogrammed)e(multiprocessors.)18 b(In)12 b Fe(Pr)n(oc.)h(of)g(PPOPP)p Fo(,)h(1995.)p eop %%Trailer end userdict /end-hook known{end-hook}if %%EOF |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Tandri_Abdel_PDPTA95.ps version [2089955127].
|
|
%!PS-Adobe-2.0 %%Creator: dvips 5.512 Copyright 1986, 1993 Radical Eye Software %%Title: pdpta.dvi %%CreationDate: Thu Nov 23 17:27:55 1995 %%Pages: 10 %%PageOrder: Ascend %%BoundingBox: 0 0 612 792 %%DocumentFonts: Times-Bold Times-Roman Times-Italic Courier %%EndComments %DVIPSCommandLine: dvips -o pdpta.ps pdpta.dvi %DVIPSSource: TeX output 1995.08.11:1234 %%BeginProcSet: tex.pro /TeXDict 250 dict def TeXDict begin /N{def}def /B{bind def}N /S{exch}N /X{S N} B /TR{translate}N /isls false N /vsize 11 72 mul N /@rigin{isls{[0 -1 1 0 0 0] concat}if 72 Resolution div 72 VResolution div neg scale isls{Resolution hsize -72 div mul 0 TR}if Resolution VResolution vsize -72 div 1 add mul TR matrix currentmatrix dup dup 4 get round 4 exch put dup dup 5 get round 5 exch put setmatrix}N /@landscape{/isls true N}B /@manualfeed{statusdict /manualfeed true put}B /@copies{/#copies X}B /FMat[1 0 0 -1 0 0]N /FBB[0 0 0 0]N /nn 0 N /IE 0 N /ctr 0 N /df-tail{/nn 8 dict N nn begin /FontType 3 N /FontMatrix fntrx N /FontBBox FBB N string /base X array /BitMaps X /BuildChar{ CharBuilder}N /Encoding IE N end dup{/foo setfont}2 array copy cvx N load 0 nn put /ctr 0 N[}B /df{/sf 1 N /fntrx FMat N df-tail}B /dfs{div /sf X /fntrx[sf 0 0 sf neg 0 0]N df-tail}B /E{pop nn dup definefont setfont}B /ch-width{ch-data dup length 5 sub get}B /ch-height{ch-data dup length 4 sub get}B /ch-xoff{128 ch-data dup length 3 sub get sub}B /ch-yoff{ch-data dup length 2 sub get 127 sub}B /ch-dx{ch-data dup length 1 sub get}B /ch-image{ch-data dup type /stringtype ne{ctr get /ctr ctr 1 add N}if}B /id 0 N /rw 0 N /rc 0 N /gp 0 N /cp 0 N /G 0 N /sf 0 N /CharBuilder{save 3 1 roll S dup /base get 2 index get S /BitMaps get S get /ch-data X pop /ctr 0 N ch-dx 0 ch-xoff ch-yoff ch-height sub ch-xoff ch-width add ch-yoff setcachedevice ch-width ch-height true[1 0 0 -1 -.1 ch-xoff sub ch-yoff .1 add]{ch-image}imagemask restore}B /D{/cc X dup type /stringtype ne{]}if nn /base get cc ctr put nn /BitMaps get S ctr S sf 1 ne{dup dup length 1 sub dup 2 index S get sf div put}if put /ctr ctr 1 add N} B /I{cc 1 add D}B /bop{userdict /bop-hook known{bop-hook}if /SI save N @rigin 0 0 moveto /V matrix currentmatrix dup 1 get dup mul exch 0 get dup mul add .99 lt{/QV}{/RV}ifelse load def pop pop}N /eop{SI restore showpage userdict /eop-hook known{eop-hook}if}N /@start{userdict /start-hook known{start-hook} if pop /VResolution X /Resolution X 1000 div /DVImag X /IE 256 array N 0 1 255 {IE S 1 string dup 0 3 index put cvn put}for 65781.76 div /vsize X 65781.76 div /hsize X}N /p{show}N /RMat[1 0 0 -1 0 0]N /BDot 260 string N /rulex 0 N /ruley 0 N /v{/ruley X /rulex X V}B /V{}B /RV statusdict begin /product where{ pop product dup length 7 ge{0 7 getinterval dup(Display)eq exch 0 4 getinterval(NeXT)eq or}{pop false}ifelse}{false}ifelse end{{gsave TR -.1 -.1 TR 1 1 scale rulex ruley false RMat{BDot}imagemask grestore}}{{gsave TR -.1 -.1 TR rulex ruley scale 1 1 false RMat{BDot}imagemask grestore}}ifelse B /QV{ gsave transform round exch round exch itransform moveto rulex 0 rlineto 0 ruley neg rlineto rulex neg 0 rlineto fill grestore}B /a{moveto}B /delta 0 N /tail{dup /delta X 0 rmoveto}B /M{S p delta add tail}B /b{S p tail}B /c{-4 M} B /d{-3 M}B /e{-2 M}B /f{-1 M}B /g{0 M}B /h{1 M}B /i{2 M}B /j{3 M}B /k{4 M}B /w{0 rmoveto}B /l{p -4 w}B /m{p -3 w}B /n{p -2 w}B /o{p -1 w}B /q{p 1 w}B /r{ p 2 w}B /s{p 3 w}B /t{p 4 w}B /x{0 S rmoveto}B /y{3 2 roll p a}B /bos{/SS save N}B /eos{SS restore}B end %%EndProcSet %%BeginProcSet: texps.pro TeXDict begin /rf{findfont dup length 1 add dict begin{1 index /FID ne 2 index /UniqueID ne and{def}{pop pop}ifelse}forall[1 index 0 6 -1 roll exec 0 exch 5 -1 roll VResolution Resolution div mul neg 0 0]/Metrics exch def dict begin Encoding{exch dup type /integertype ne{pop pop 1 sub dup 0 le{pop}{[}ifelse}{ FontMatrix 0 get div Metrics 0 get div def}ifelse}forall Metrics /Metrics currentdict end def[2 index currentdict end definefont 3 -1 roll makefont /setfont load]cvx def}def /ObliqueSlant{dup sin S cos div neg}B /SlantFont{4 index mul add}def /ExtendFont{3 -1 roll mul exch}def /ReEncodeFont{/Encoding exch def}def end %%EndProcSet %%BeginProcSet: special.pro TeXDict begin /SDict 200 dict N SDict begin /@SpecialDefaults{/hs 612 N /vs 792 N /ho 0 N /vo 0 N /hsc 1 N /vsc 1 N /ang 0 N /CLIP 0 N /rwiSeen false N /rhiSeen false N /letter{}N /note{}N /a4{}N /legal{}N}B /@scaleunit 100 N /@hscale{@scaleunit div /hsc X}B /@vscale{@scaleunit div /vsc X}B /@hsize{/hs X /CLIP 1 N}B /@vsize{/vs X /CLIP 1 N}B /@clip{/CLIP 2 N}B /@hoffset{/ho X}B /@voffset{/vo X}B /@angle{/ang X}B /@rwi{10 div /rwi X /rwiSeen true N}B /@rhi {10 div /rhi X /rhiSeen true N}B /@llx{/llx X}B /@lly{/lly X}B /@urx{/urx X}B /@ury{/ury X}B /magscale true def end /@MacSetUp{userdict /md known{userdict /md get type /dicttype eq{userdict begin md length 10 add md maxlength ge{/md md dup length 20 add dict copy def}if end md begin /letter{}N /note{}N /legal{ }N /od{txpose 1 0 mtx defaultmatrix dtransform S atan/pa X newpath clippath mark{transform{itransform moveto}}{transform{itransform lineto}}{6 -2 roll transform 6 -2 roll transform 6 -2 roll transform{itransform 6 2 roll itransform 6 2 roll itransform 6 2 roll curveto}}{{closepath}}pathforall newpath counttomark array astore /gc xdf pop ct 39 0 put 10 fz 0 fs 2 F/|______Courier fnt invertflag{PaintBlack}if}N /txpose{pxs pys scale ppr aload pop por{noflips{pop S neg S TR pop 1 -1 scale}if xflip yflip and{pop S neg S TR 180 rotate 1 -1 scale ppr 3 get ppr 1 get neg sub neg ppr 2 get ppr 0 get neg sub neg TR}if xflip yflip not and{pop S neg S TR pop 180 rotate ppr 3 get ppr 1 get neg sub neg 0 TR}if yflip xflip not and{ppr 1 get neg ppr 0 get neg TR}if}{noflips{TR pop pop 270 rotate 1 -1 scale}if xflip yflip and{TR pop pop 90 rotate 1 -1 scale ppr 3 get ppr 1 get neg sub neg ppr 2 get ppr 0 get neg sub neg TR}if xflip yflip not and{TR pop pop 90 rotate ppr 3 get ppr 1 get neg sub neg 0 TR}if yflip xflip not and{TR pop pop 270 rotate ppr 2 get ppr 0 get neg sub neg 0 S TR}if}ifelse scaleby96{ppr aload pop 4 -1 roll add 2 div 3 1 roll add 2 div 2 copy TR .96 dup scale neg S neg S TR}if}N /cp{pop pop showpage pm restore}N end}if}if}N /normalscale{Resolution 72 div VResolution 72 div neg scale magscale{DVImag dup scale}if 0 setgray}N /psfts{S 65781.76 div N}N /startTexFig{/psf$SavedState save N userdict maxlength dict begin /magscale false def normalscale currentpoint TR /psf$ury psfts /psf$urx psfts /psf$lly psfts /psf$llx psfts /psf$y psfts /psf$x psfts currentpoint /psf$cy X /psf$cx X /psf$sx psf$x psf$urx psf$llx sub div N /psf$sy psf$y psf$ury psf$lly sub div N psf$sx psf$sy scale psf$cx psf$sx div psf$llx sub psf$cy psf$sy div psf$ury sub TR /showpage{}N /erasepage{}N /copypage{}N /p 3 def @MacSetUp}N /doclip{psf$llx psf$lly psf$urx psf$ury currentpoint 6 2 roll newpath 4 copy 4 2 roll moveto 6 -1 roll S lineto S lineto S lineto closepath clip newpath moveto}N /endTexFig{end psf$SavedState restore}N /@beginspecial{ SDict begin /SpecialSave save N gsave normalscale currentpoint TR @SpecialDefaults count /ocount X /dcount countdictstack N}N /@setspecial{CLIP 1 eq{newpath 0 0 moveto hs 0 rlineto 0 vs rlineto hs neg 0 rlineto closepath clip}if ho vo TR hsc vsc scale ang rotate rwiSeen{rwi urx llx sub div rhiSeen{ rhi ury lly sub div}{dup}ifelse scale llx neg lly neg TR}{rhiSeen{rhi ury lly sub div dup scale llx neg lly neg TR}if}ifelse CLIP 2 eq{newpath llx lly moveto urx lly lineto urx ury lineto llx ury lineto closepath clip}if /showpage{}N /erasepage{}N /copypage{}N newpath}N /@endspecial{count ocount sub{pop}repeat countdictstack dcount sub{end}repeat grestore SpecialSave restore end}N /@defspecial{SDict begin}N /@fedspecial{end}B /li{lineto}B /rl{ rlineto}B /rc{rcurveto}B /np{/SaveX currentpoint /SaveY X N 1 setlinecap newpath}N /st{stroke SaveX SaveY moveto}N /fil{fill SaveX SaveY moveto}N /ellipse{/endangle X /startangle X /yrad X /xrad X /savematrix matrix currentmatrix N TR xrad yrad scale 0 0 1 startangle endangle arc savematrix setmatrix}N end %%EndProcSet TeXDict begin 40258431 52099146 1000 300 300 (/stumm/a0/tandri/pdpta/pdpta.dvi) @start /Fa 175[27 7[27 1[27 70[{}3 45.833332 /Courier rf /Fb 80[25 25 51[20 23 23 33 23 23 13 18 15 23 23 23 23 36 13 23 1[13 23 23 15 20 23 20 23 20 3[15 1[15 2[33 2[33 28 25 30 1[25 33 33 41 28 33 1[15 33 1[25 28 33 30 30 33 5[13 3[23 23 4[23 2[11 15 11 1[23 15 15 3[23 2[15 33[{}60 45.833332 /Times-Roman rf /Fc 81[29 51[23 26 2[26 29 16 23 23 2[29 29 42 16 2[16 29 29 16 26 29 26 29 29 13[29 2[36 42 1[48 6[36 1[42 39 1[36 11[29 29 29 29 29 2[15 19 45[{}36 58.333336 /Times-Italic rf /Fd 134[30 2[30 30 30 30 30 1[30 30 30 30 30 30 1[30 30 30 30 30 30 30 30 30 12[30 6[30 3[30 2[30 30 30 30 30 30 14[30 4[30 30 1[30 30 30 40[{}36 50.000000 /Courier rf /Fe 134[22 22 33 1[25 14 19 19 25 25 25 25 36 14 22 1[14 25 25 14 22 25 22 25 25 9[41 2[28 25 30 1[30 36 1[41 28 33 22 17 36 2[30 36 33 1[30 7[25 4[25 25 25 25 2[12 17 5[17 39[{}47 50.000000 /Times-Italic rf /Ff 1 1 df<FFFFF0FFFFF014027D881B>0 D E /Fg 4 117 df<1F0006000600060006000C000C000C00 0C0018F01B181C08180838183018301830306030603160616062C022C03C10177E9614>104 D<0300038003000000000000000000000000001C002400460046008C000C001800180018003100 3100320032001C0009177F960C>I<383C0044C6004702004602008E06000C06000C06000C0C00 180C00180C40181840181880300880300F00120E7F8D15>110 D<030003000600060006000600 FFC00C000C000C001800180018001800300030803080310031001E000A147F930D>116 D E /Fh 3 3 df<FFFFFFFCFFFFFFFC1E027C8C27>0 D<70F8F8F87005057C8E0E>I<C00003E0 000770000E38001C1C00380E00700700E00381C001C38000E700007E00003C00003C00007E0000 E70001C3800381C00700E00E00701C003838001C70000EE00007C000031818799727>I E /Fi 4 62 df<00200040008001000300060004000C000C001800180030003000300070006000 60006000E000E000E000E000E000E000E000E000E000E000E000E000E000E00060006000600070 00300030003000180018000C000C0004000600030001000080004000200B327CA413>40 D<800040002000100018000C000400060006000300030001800180018001C000C000C000C000E0 00E000E000E000E000E000E000E000E000E000E000E000E000E000C000C000C001C00180018001 80030003000600060004000C00180010002000400080000B327DA413>I<000180000001800000 018000000180000001800000018000000180000001800000018000000180000001800000018000 00018000000180000001800000018000FFFFFFFEFFFFFFFE000180000001800000018000000180 000001800000018000000180000001800000018000000180000001800000018000000180000001 800000018000000180001F227D9C26>43 D<FFFFFFFEFFFFFFFE00000000000000000000000000 00000000000000000000000000000000000000FFFFFFFEFFFFFFFE1F0C7D9126>61 D E /Fj 16 111 dfk 134[21 1[30 1[21 12 16 14 1[21 21 21 32 12 2[12 21 21 14 18 21 18 21 18 3[14 1[14 17[14 5[28 8[12 21 21 5[21 21 1[12 10 14 45[{}32 41.666668 /Times-Roman rf /Fl 203[15 15 15 15 49[{}4 29.166668 /Times-Roman rf /Fm 203[17 17 17 17 17 48[{}5 33.333332 /Times-Roman rf /Fn 138[39 23 27 31 1[39 35 39 59 20 39 1[20 1[35 23 31 39 31 39 35 9[71 4[51 1[43 6[27 2[43 1[51 51 11[35 35 35 35 35 35 35 49[{}32 70.833336 /Times-Bold rf /Fo 69[22 8[25 1[28 28 3[22 47[22 25 25 36 25 25 14 19 17 25 25 25 25 39 14 25 14 14 25 25 17 22 25 22 25 22 3[17 1[17 30 2[47 36 36 30 28 33 1[28 36 36 44 30 36 19 17 36 36 28 30 36 33 33 36 3[28 1[14 14 25 25 25 25 25 25 25 25 25 25 1[12 17 12 2[17 17 17 39[{}75 50.000000 /Times-Roman rf /Fp 139[17 19 22 14[22 28 25 31[36 65[{}7 50.000000 /Times-Bold rf /Fq 2 104 dfr 134[29 2[29 29 16 23 19 1[29 29 29 45 16 29 1[16 29 29 19 26 29 26 29 26 11[42 36 32 5[52 7[36 42 39 1[42 54 5[16 4[29 29 2[29 2[15 19 15 44[{}37 58.333336 /Times-Roman rf /Fs 134[42 3[46 28 32 37 1[46 42 46 69 23 2[23 46 42 1[37 46 37 46 42 13[46 2[51 2[78 8[60 60 67[{}23 83.333336 /Times-Bold rf end %%EndProlog %%BeginSetup %%Feature: *Resolution 300dpi TeXDict begin %%EndSetup %%Page: 1 1 1 0 bop 80 177 a Fs(Computation)19 b(and)i(Data)e(Partitioning)g(on)h (Scalable)341 280 y(Shar)o(ed)g(Memory)g(Multipr)o(ocessors)403 451 y Fr(Sudarsan)15 b(T)l(andri)29 b(and)g(T)l(arek)14 b(S.)g(Abdelrahman) 316 526 y(Department)h(of)g(Electrical)f(and)h(Computer)g(Engineering)284 601 y(The)f(University)h(of)g(T)l(oronto,)f(T)l(oronto,)g(Canada,)f(M5S)i (1A4)478 675 y(e-mail:)g Fq(f)p Fr(tandri,tsa)p Fq(g)p Fr(@eecg.toronto.edu) 833 865 y Fp(Abstract)217 945 y Fo(In)g(this)h(paper)f(we)h(identify)f(the)h (factors)f(that)h(af)o(fect)f(the)h(derivation)e(of)i(com-)217 999 y(putation)10 b(and)h(data)g(partitions)g(on)g(scalable)g(shared)g (memory)g(multiprocessors)217 1053 y(\(SSMMs\).)18 b(W)l(e)12 b(show)h(that)f(these)h(factors)f(necessitate)i(an)e(SSMM-conscious)217 1107 y(approach.)17 b(In)10 b(addition)g(to)g(remote)g(memory)f(access,)k (which)d(is)h(the)f(sole)h(factor)217 1161 y(on)19 b(distributed)g(memory)f (multiprocessors,)k(cache)d(af)o(\256nity)m(,)i(memory)e(con-)217 1216 y(tention)12 b(and)h(false)g(sharing)f(are)h(important)f(factors)g(that) h(must)g(be)g(considered.)217 1270 y(Experimental)g(evidence)h(is)g (presented)g(to)g(demonstrate)f(the)h(impact)f(of)h(these)217 1324 y(factors)i(on)g(performance)g(using)g(three)h(applications)f(on)h(the)f (KSR1)h(and)f(the)217 1378 y(Hector)c(multiprocessors.)4 1540 y Fn(1)71 b(Intr)o(oduction)4 1667 y Fo(Scalable)12 b(shared)g(memory)f (multiprocessors)g(\(SSMMs\))g(are)h(becoming)f(increasingly)h(popular)f(and) h(a)4 1721 y(viable)e(alternative)f(to)h(distributed)f(memory)g (multiprocessors)h(\(DMMs\).)17 b(The)11 b(Stanford)e(DASH)g([20],)4 1775 y(FLASH)i([14)o(],)h(the)f(KSR1)f([24],)h(T)m(oronto')m(s)f(Hector)h ([26)o(],)h(NUMAchine)f([1)o(],)h(and)f(the)f(Cray)h(T3D)h([23)o(])4 1830 y(are)d(some)g(SSMMs)g(currently)e(in)i(use)g(or)f(under)g(development.) 17 b(Processors)9 b(in)f(a)h(SSMM)g(share)g(a)g(single)4 1884 y(coherent)f(address)g(space.)17 b(However)n(,)9 b(shared)f(memory)g(is)g (physically)g(distributed)g(to)f(allow)h(scalability)l(,)4 1938 y(as)17 b(shown)f(in)g(Figure)f(1.)29 b(This)17 b(distribution)e(of)g (shared)i(memory)e(results)h(in)g(non-uniform)e(memory)4 1992 y(access)f(latencies,)g(depending)f(on)f(the)h(distance)h(between)f(a)g (processor)f(and)h(memory)m(.)17 b(Consequently)m(,)4 2046 y(careful)12 b(placement)g(and)g(management)g(of)g(data)h(is)g(essential)g (for)e(scaling)i(performance.)77 2122 y(W)l(e)i(believe)f(that)g(data)g (distribution)732 2104 y Fm(1)764 2122 y Fo(is)g(a)h(good)f(paradigm)f(for)h (managing)f(data)i(in)f(data-parallel)4 2176 y(applications)h(on)g(SSMMs)g ([3)o(,)h(21].)25 b(The)16 b(division)e(of)h(array)f(data)h(allows)g(a)g (compiler)f(to)h(place)g(data)4 2230 y(in)g(the)g(physical)f(memory)g(of)h (the)g(processor)f(that)h(uses)h(it)e(the)h(most,)h(and)f(also)g(allows)g (the)g(compiler)4 2284 y(to)k(partition)f(the)h(computations)g(of)f(parallel) h(loops.)38 b(W)l(e)19 b(have)g(experimented)g(with)f(programmer)4 2339 y(speci\256ed)12 b(data)f(distributions)g(on)g(the)h(Hector)f (multiprocessor)f(and)i(have)g(found)e(them)h(to)h(be)f(ef)o(fective)4 2393 y(in)e(improving)e(performance.)16 b(However)n(,)10 b(the)e(task)h(of)g (selecting)g(a)g(good)f(data)h(distribution)f(requires)g(the)4 2447 y(programmer)i(to)h(understand)g(both)f(the)i(parallel)e(machine)h (architecture)g(and)g(the)g(data)g(access)i(patterns)4 2501 y(in)19 b(the)f(program.)37 b(Porting)17 b(programs)h(to)h(various)g (machines)g(and)f(tuning)h(them)f(for)g(performance)4 2555 y(becomes)g(a)f(tedious)g(and)g(laborious)g(process.)33 b(Consequently)m(,)19 b(it)e(is)h(desirable)f(to)g(derive)f(data)i(and)4 2609 y(computation)h (partitions)g(automatically)h(using)g(a)g(compiler)m(.)40 b(The)21 b(objective)e(of)h(this)g(paper)g(is)g(to)4 2664 y(describe)13 b(the)f(factors)g(that)g(af)o(fect)g(the)g(derivation)g(of)g(computation)f (and)i(data)f(partitions)g(on)g(SSMMs.)77 2739 y(On)19 b(DMMs,)k(the)c(main)g (factor)f(that)i(af)o(fects)f(the)g(performance)f(of)h(an)g(application)g(is) g(the)g(cost)4 2793 y(of)d(interprocessor)f(communication.)28 b(Consequently)m(,)17 b(scalable)g(performance)e(can)h(be)g(achieved)g(by)p 4 2838 737 2 v 62 2869 a Fl(1)79 2884 y Fk(In)10 b(this)f(paper)i(we)g(use)f (the)g(terms)h(data)g(distributi)o(ons)c(and)k(data)f(partitions)f (interchangeably)m(.)p eop %%Page: 2 2 2 1 bop 175 533 a @beginspecial 114 @llx 408 @lly 476 @urx 553 @ury 3600 @rwi @setspecial %%BeginDocument: numaarch1.ps /arrowHeight 10 def /arrowWidth 5 def /IdrawDict 51 dict def IdrawDict begin /reencodeISO { dup dup findfont dup length dict begin { 1 index /FID ne { def }{ pop pop } ifelse } forall /Encoding ISOLatin1Encoding def currentdict end definefont } def /ISOLatin1Encoding [ /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /space/exclam/quotedbl/numbersign/dollar/percent/ampersand/quoteright /parenleft/parenright/asterisk/plus/comma/minus/period/slash /zero/one/two/three/four/five/six/seven/eight/nine/colon/semicolon /less/equal/greater/question/at/A/B/C/D/E/F/G/H/I/J/K/L/M/N /O/P/Q/R/S/T/U/V/W/X/Y/Z/bracketleft/backslash/bracketright /asciicircum/underscore/quoteleft/a/b/c/d/e/f/g/h/i/j/k/l/m /n/o/p/q/r/s/t/u/v/w/x/y/z/braceleft/bar/braceright/asciitilde /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/dotlessi/grave/acute/circumflex/tilde/macron/breve /dotaccent/dieresis/.notdef/ring/cedilla/.notdef/hungarumlaut /ogonek/caron/space/exclamdown/cent/sterling/currency/yen/brokenbar /section/dieresis/copyright/ordfeminine/guillemotleft/logicalnot /hyphen/registered/macron/degree/plusminus/twosuperior/threesuperior /acute/mu/paragraph/periodcentered/cedilla/onesuperior/ordmasculine /guillemotright/onequarter/onehalf/threequarters/questiondown /Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla /Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex /Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis /multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute /Thorn/germandbls/agrave/aacute/acircumflex/atilde/adieresis /aring/ae/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave /iacute/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex /otilde/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis /yacute/thorn/ydieresis ] def /Times-Roman reencodeISO def /none null def /numGraphicParameters 17 def /stringLimit 65535 def /Begin { save numGraphicParameters dict begin } def /End { end restore } def /SetB { dup type /nulltype eq { pop false /brushRightArrow idef false /brushLeftArrow idef true /brushNone idef } { /brushDashOffset idef /brushDashArray idef 0 ne /brushRightArrow idef 0 ne /brushLeftArrow idef /brushWidth idef false /brushNone idef } ifelse } def /SetCFg { /fgblue idef /fggreen idef /fgred idef } def /SetCBg { /bgblue idef /bggreen idef /bgred idef } def /SetF { /printSize idef /printFont idef } def /SetP { dup type /nulltype eq { pop true /patternNone idef } { dup -1 eq { /patternGrayLevel idef /patternString idef } { /patternGrayLevel idef } ifelse false /patternNone idef } ifelse } def /BSpl { 0 begin storexyn newpath n 1 gt { 0 0 0 0 0 0 1 1 true subspline n 2 gt { 0 0 0 0 1 1 2 2 false subspline 1 1 n 3 sub { /i exch def i 1 sub dup i dup i 1 add dup i 2 add dup false subspline } for n 3 sub dup n 2 sub dup n 1 sub dup 2 copy false subspline } if n 2 sub dup n 1 sub dup 2 copy 2 copy false subspline patternNone not brushLeftArrow not brushRightArrow not and and { ifill } if brushNone not { istroke } if 0 0 1 1 leftarrow n 2 sub dup n 1 sub dup rightarrow } if end } dup 0 4 dict put def /Circ { newpath 0 360 arc patternNone not { ifill } if brushNone not { istroke } if } def /CBSpl { 0 begin dup 2 gt { storexyn newpath n 1 sub dup 0 0 1 1 2 2 true subspline 1 1 n 3 sub { /i exch def i 1 sub dup i dup i 1 add dup i 2 add dup false subspline } for n 3 sub dup n 2 sub dup n 1 sub dup 0 0 false subspline n 2 sub dup n 1 sub dup 0 0 1 1 false subspline patternNone not { ifill } if brushNone not { istroke } if } { Poly } ifelse end } dup 0 4 dict put def /Elli { 0 begin newpath 4 2 roll translate scale 0 0 1 0 360 arc patternNone not { ifill } if brushNone not { istroke } if end } dup 0 1 dict put def /Line { 0 begin 2 storexyn newpath x 0 get y 0 get moveto x 1 get y 1 get lineto brushNone not { istroke } if 0 0 1 1 leftarrow 0 0 1 1 rightarrow end } dup 0 4 dict put def /MLine { 0 begin storexyn newpath n 1 gt { x 0 get y 0 get moveto 1 1 n 1 sub { /i exch def x i get y i get lineto } for patternNone not brushLeftArrow not brushRightArrow not and and { ifill } if brushNone not { istroke } if 0 0 1 1 leftarrow n 2 sub dup n 1 sub dup rightarrow } if end } dup 0 4 dict put def /Poly { 3 1 roll newpath moveto -1 add { lineto } repeat closepath patternNone not { ifill } if brushNone not { istroke } if } def /Rect { 0 begin /t exch def /r exch def /b exch def /l exch def newpath l b moveto l t lineto r t lineto r b lineto closepath patternNone not { ifill } if brushNone not { istroke } if end } dup 0 4 dict put def /Text { ishow } def /idef { dup where { pop pop pop } { exch def } ifelse } def /ifill { 0 begin gsave patternGrayLevel -1 ne { fgred bgred fgred sub patternGrayLevel mul add fggreen bggreen fggreen sub patternGrayLevel mul add fgblue bgblue fgblue sub patternGrayLevel mul add setrgbcolor eofill } { eoclip originalCTM setmatrix pathbbox /t exch def /r exch def /b exch def /l exch def /w r l sub ceiling cvi def /h t b sub ceiling cvi def /imageByteWidth w 8 div ceiling cvi def /imageHeight h def bgred bggreen bgblue setrgbcolor eofill fgred fggreen fgblue setrgbcolor w 0 gt h 0 gt and { l b translate w h scale w h true [w 0 0 h neg 0 h] { patternproc } imagemask } if } ifelse grestore end } dup 0 8 dict put def /istroke { gsave brushDashOffset -1 eq { [] 0 setdash 1 setgray } { brushDashArray brushDashOffset setdash fgred fggreen fgblue setrgbcolor } ifelse brushWidth setlinewidth originalCTM setmatrix stroke grestore } def /ishow { 0 begin gsave fgred fggreen fgblue setrgbcolor /fontDict printFont printSize scalefont dup setfont def /descender fontDict begin 0 [FontBBox] 1 get FontMatrix end transform exch pop def /vertoffset 1 printSize sub descender sub def { 0 vertoffset moveto show /vertoffset vertoffset printSize sub def } forall grestore end } dup 0 3 dict put def /patternproc { 0 begin /patternByteLength patternString length def /patternHeight patternByteLength 8 mul sqrt cvi def /patternWidth patternHeight def /patternByteWidth patternWidth 8 idiv def /imageByteMaxLength imageByteWidth imageHeight mul stringLimit patternByteWidth sub min def /imageMaxHeight imageByteMaxLength imageByteWidth idiv patternHeight idiv patternHeight mul patternHeight max def /imageHeight imageHeight imageMaxHeight sub store /imageString imageByteWidth imageMaxHeight mul patternByteWidth add string def 0 1 imageMaxHeight 1 sub { /y exch def /patternRow y patternByteWidth mul patternByteLength mod def /patternRowString patternString patternRow patternByteWidth getinterval def /imageRow y imageByteWidth mul def 0 patternByteWidth imageByteWidth 1 sub { /x exch def imageString imageRow x add patternRowString putinterval } for } for imageString end } dup 0 12 dict put def /min { dup 3 2 roll dup 4 3 roll lt { exch } if pop } def /max { dup 3 2 roll dup 4 3 roll gt { exch } if pop } def /midpoint { 0 begin /y1 exch def /x1 exch def /y0 exch def /x0 exch def x0 x1 add 2 div y0 y1 add 2 div end } dup 0 4 dict put def /thirdpoint { 0 begin /y1 exch def /x1 exch def /y0 exch def /x0 exch def x0 2 mul x1 add 3 div y0 2 mul y1 add 3 div end } dup 0 4 dict put def /subspline { 0 begin /movetoNeeded exch def y exch get /y3 exch def x exch get /x3 exch def y exch get /y2 exch def x exch get /x2 exch def y exch get /y1 exch def x exch get /x1 exch def y exch get /y0 exch def x exch get /x0 exch def x1 y1 x2 y2 thirdpoint /p1y exch def /p1x exch def x2 y2 x1 y1 thirdpoint /p2y exch def /p2x exch def x1 y1 x0 y0 thirdpoint p1x p1y midpoint /p0y exch def /p0x exch def x2 y2 x3 y3 thirdpoint p2x p2y midpoint /p3y exch def /p3x exch def movetoNeeded { p0x p0y moveto } if p1x p1y p2x p2y p3x p3y curveto end } dup 0 17 dict put def /storexyn { /n exch def /y n array def /x n array def n 1 sub -1 0 { /i exch def y i 3 2 roll put x i 3 2 roll put } for } def /SSten { fgred fggreen fgblue setrgbcolor dup true exch 1 0 0 -1 0 6 -1 roll matrix astore } def /FSten { dup 3 -1 roll dup 4 1 roll exch newpath 0 0 moveto dup 0 exch lineto exch dup 3 1 roll exch lineto 0 lineto closepath bgred bggreen bgblue setrgbcolor eofill SSten } def /Rast { exch dup 3 1 roll 1 0 0 -1 0 6 -1 roll matrix astore } def /arrowhead { 0 begin transform originalCTM itransform /taily exch def /tailx exch def transform originalCTM itransform /tipy exch def /tipx exch def /dy tipy taily sub def /dx tipx tailx sub def /angle dx 0 ne dy 0 ne or { dy dx atan } { 90 } ifelse def gsave originalCTM setmatrix tipx tipy translate angle rotate newpath arrowHeight neg arrowWidth 2 div moveto 0 0 lineto arrowHeight neg arrowWidth 2 div neg lineto patternNone not { originalCTM setmatrix /padtip arrowHeight 2 exp 0.25 arrowWidth 2 exp mul add sqrt brushWidth mul arrowWidth div def /padtail brushWidth 2 div def tipx tipy translate angle rotate padtip 0 translate arrowHeight padtip add padtail add arrowHeight div dup scale arrowheadpath ifill } if brushNone not { originalCTM setmatrix tipx tipy translate angle rotate arrowheadpath istroke } if grestore end } dup 0 9 dict put def /arrowheadpath { newpath arrowHeight neg arrowWidth 2 div moveto 0 0 lineto arrowHeight neg arrowWidth 2 div neg lineto } def /leftarrow { 0 begin y exch get /taily exch def x exch get /tailx exch def y exch get /tipy exch def x exch get /tipx exch def brushLeftArrow { tipx tipy tailx taily arrowhead } if end } dup 0 4 dict put def /rightarrow { 0 begin y exch get /tipy exch def x exch get /tipx exch def y exch get /taily exch def x exch get /tailx exch def brushRightArrow { tipx tipy tailx taily arrowhead } if end } dup 0 4 dict put def Begin [ 0.799705 0 0 0.799705 0 0 ] concat /originalCTM matrix currentmatrix def Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 433.125 504.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 433.125 552.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 265.125 504.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 265.125 552.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 137.125 504.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 137.125 552.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 321.125 600.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 489.125 600.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 193.125 600.625 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 144 410 ] concat 453 529 448 32 Elli End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 486.625 598.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 505.272 606.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 318.625 598.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 337.272 606.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 190.625 598.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 209.272 606.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 134.625 502.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 153.272 510.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.125 -0 -0 0.125 262.625 502.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.125 -0 -0 0.125 430.625 502.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 362.875 611.625 ] concat 617 99 16 16 Elli End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 378.875 611.625 ] concat 617 99 16 16 Elli End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 133.5 439.5 ] concat 117 369 181 369 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 261.5 439.5 ] concat 117 369 181 369 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 429.5 439.5 ] concat 117 369 181 369 Line End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 179.5 532 ] concat [ (Procr) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 307.5 532 ] concat [ (Procr) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 475.5 532 ] concat [ (Procr) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 235.5 628 ] concat [ (Mem) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 363.5 628 ] concat [ (Mem) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 531.5 628 ] concat [ (Mem) ] Text End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.125 -0 -0 0.125 346.875 611.625 ] concat 617 99 16 16 Elli End Begin %I BSpl 1 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 19.5 393.5 ] concat 457 333 473 381 441 365 457 413 4 BSpl End Begin %I BSpl 1 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 147.5 393.5 ] concat 457 333 473 381 441 365 457 413 4 BSpl End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 360 555 ] concat [ (Remote) (memory) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 232 555 ] concat [ (Local) (memory) ] Text End Begin %I BSpl 1 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 0.5 -0 -0 0.5 315.5 393.5 ] concat 457 333 473 381 441 365 457 413 4 BSpl End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 528 555 ] concat [ (Remote) (memory) ] Text End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0.75 SetP [ 0.125 -0 -0 0.125 134.625 550.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl none SetB %I b n 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.084375 -0 -0 0.084375 153.272 558.534 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.125 -0 -0 0.125 262.625 550.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I CBSpl 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.125 -0 -0 0.125 430.625 550.125 ] concat 267 271 267 335 331 335 587 335 651 335 651 271 651 143 651 79 587 79 331 79 267 79 267 143 12 CBSpl End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 177 580 ] concat [ (Cache) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 305 580 ] concat [ (Cache) ] Text End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 473 580 ] concat [ (Cache) ] Text End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 60 364 ] concat 132 180 132 196 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 188 364 ] concat 132 180 132 196 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 356 364 ] concat 132 180 132 196 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 49 237 ] concat 143 355 143 435 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 49 237 ] concat 439 355 439 427 Line End Begin %I Line 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 49 237 ] concat 271 355 271 427 Line End Begin %I Elli 1 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 1 SetP [ 0.5 -0 -0 0.5 141.5 407.5 ] concat 453 529 448 32 Elli End Begin %I Text 0 0 0 SetCFg Times-Roman 12 SetF [ 1 0 0 1 306.5 676 ] concat [ (Interconnection Network) ] Text End End %I eop showpage end %%EndDocument @endspecial 295 598 a Fo(Figure)11 b(1:)18 b(Scalable)12 b(shared-memory)f (multiprocessor)h(architecture.)4 719 y(partitioning)g(data)i(and)g (computations)f(in)g(a)h(way)f(that)h(minimizes)f(interprocessor)g (communications.)4 773 y(On)f(SSMMs,)h(processors)f(communicate)f(through)g (shared)g(memory)m(,)h(and)f(the)h(cost)g(of)f(interprocessor)4 827 y(communications)h(\(i.e.,)i(remote)e(memory)f(access\))j(is)f (relatively)e(inexpensive.)19 b(W)l(e)13 b(show)g(that)f(cache)4 881 y(af)o(\256nity)m(,)i(memory)f(contention)h(and)g(false)g(sharing)g(are)g (additional)g(factors)g(that)f(must)i(be)f(considered)4 935 y(in)i(the)g(selection)g(of)f(data)h(distributions.)28 b(Furthermore,)16 b(the)g(presence)g(of)f(a)h(single)g(shared)g(address)4 989 y(space)i(allows)g(\257exibility)f(in)g(the)h(selection)g(of)f(a)h (computation)e(partition.)33 b(Speci\256cally)m(,)19 b(we)f(show)4 1044 y(that)h(relaxing)g(the)h(commonly)e(used)i(owner)o(-computes)f(rule)g ([15)o(])h(has)g(performance)e(advantages.)4 1098 y(W)l(e)d(present)g (experimental)f(results)i(to)e(support)h(our)f(conclusions)i(using)f(three)f (applications)h(on)g(two)4 1152 y(SSMMs,)e(the)g(Hector)f(and)g(the)g(KSR1)h (multiprocessors.)77 1228 y(The)g(remainder)f(of)g(this)g(paper)g(is)h(or)o (ganized)f(as)h(follows.)18 b(Section)12 b(2)h(presents)f(an)h(overview)f (data)4 1282 y(distributions.)35 b(Section)18 b(3)g(describes)h(the)f (factors)f(that)h(impact)g(on)g(the)h(selection)f(of)g(computation)4 1336 y(and)i(data)g(partitions)f(on)g(SSMMs.)41 b(Section)19 b(4)h(gives)g(experimental)f(evidence)h(of)f(the)h(impact)f(of)4 1390 y(cache)e(af)o(\256nity)e(and)h(false)g(sharing)g(on)g(the)g(choice)h (of)e(data)h(partitions.)29 b(Section)16 b(5)g(presents)h(results)4 1444 y(to)d(show)g(that)g(the)g(\257exibility)f(in)h(selecting)h(the)f (computation)f(partitioning)g(can)h(be)g(used)h(to)f(improve)4 1498 y(performance.)j(Section)9 b(6)i(reviews)f(related)g(work.)17 b(Finally)m(,)11 b(Section)e(7)i(presents)f(concluding)g(remarks)4 1553 y(and)j(directions)e(for)h(future)f(work.)4 1734 y Fn(2)71 b(Data)19 b(Distributions)4 1861 y Fo(Data)10 b(distribution)f([15)o(,)i(16]) e(is)i(achieved)f(by)g(specifying)f(a)h(partitioning)f(scheme)h(for)f(each)i (array)e(in)h(the)4 1915 y(program)h(and)i(by)f(specifying)g(a)g(processor)h (geometry)e(to)i(which)f(array)g(partitions)f(map.)18 b(A)13 b(processor)4 1970 y(geometry)g(is)i(an)f Fj(n)p Fo(-dimensional)f(Cartesian) h(grid)f(of)h(virtual)f(processors)h Fi(\()p Fj(V)1385 1977 y Fm(0)1404 1970 y Fj(;)8 b(V)1454 1977 y Fm(1)1473 1970 y Fj(;)g Fh(\001)g(\001)g(\001)g Fj(;)g(V)1611 1977 y Fg(n)p Ff(\000)p Fm(1)1679 1970 y Fi(\))p Fo(,)15 b(where)4 2024 y Fj(V)32 2031 y Fg(i)63 2024 y Fo(is)i(the)f(number)g(of)g(processors)h(in)f (the)g Fj(i)793 2006 y Fg(th)844 2024 y Fo(dimension)g(of)g(the)h(grid,)g (and)f Fj(V)1430 2031 y Fm(0)1463 2024 y Fh(\002)d Fj(V)1543 2031 y Fm(1)1575 2024 y Fh(\002)g(\001)8 b(\001)g(\001)14 b(\002)f Fj(V)1779 2031 y Fg(n)p Ff(\000)p Fm(1)4 2078 y Fo(=)i Fj(P)7 b Fo(,)16 b(the)f(total)f(number)h(of)f(processors.)26 b(A)15 b(partitioning)e(scheme)i(assigns)h(a)f Fe(partitioning)f(attribute)4 2132 y Fo(to)k(each)g(dimension)g(the)g(array)m(.)34 b(There)18 b(are)g(four)f(partitioning)f(attributes.)35 b(The)18 b Fd(Block)g Fo(attribute)4 2186 y(divides)f(the)g(corresponding)g(dimension)g(of)f(the)h (array)g(in)g(approximately)f(equal)h(size)h(blocks)f(such)4 2240 y(that)j(a)g(processor)g(owns)h(a)f(contiguous)g(range)g(of)f(that)h (dimension)g(of)g(the)g(array)m(.)41 b(The)20 b Fd(Cyclic)4 2295 y Fo(attribute)11 b(divides)h(the)h(corresponding)e(array)g(dimension)h (by)g(distributing)f(the)h(array)f(elements)i(in)f(this)4 2349 y(dimension)g(to)g(processors)h(in)f(a)h(round-robin)d(fashion.)18 b(The)13 b Fd(BlockCyclic)f Fo(attribute)f(\256rst)h(groups)4 2403 y(array)f(elements)h(in)f(the)g(corresponding)g(dimension)g(in)g (contiguous)g(blocks)h(of)f(a)h(given)f(size,)h(and)g(then)4 2457 y(assigns)k(the)f(blocks)f(to)h(processors)g(in)g(a)g(round-robin)d (fashion.)26 b(The)15 b(block)f(size,)j(called)d(the)h Fe(block-)4 2511 y(cyclic)10 b(factor)p Fo(,)h(is)e(supplied)h(by)f(the)h(programmer)m(.) 16 b(Finally)m(,)9 b(the)h Fd(*)f Fo(attribute)g(is)h(used)g(to)f(indicate)g (that)h(the)4 2565 y(corresponding)f(dimension)g(of)g(the)h(array)f(is)h(not) f(distributed.)17 b(The)10 b(processor)g(geometry)f(on)g(which)h(the)4 2620 y(array)h(is)i(mapped)e(determines)h(the)g(number)f(of)g(processors)i (assigned)f(to)g(each)g(distributed)f(dimension)4 2674 y(of)h(the)f(array)m (.)18 b(For)11 b(example,)h(distributing)e(a)i(two)g(dimensional)f(array)h (using)f(the)h Fd(\(Block,Block\))4 2728 y Fo(attributes)g(onto)h(a)g(two)f (dimensional)h(processor)f(geometry)g(of)h(\(2,4\),)f(distributes)h(the)f (array)h(on)f(to)h(the)4 2782 y(8)k(processors,)i(assigning)f(2)f(processors) g(to)g(the)g(\256rst)g(dimension)g(and)g(4)g(processors)g(to)g(the)g(second)4 2836 y(dimension.)p eop %%Page: 3 3 3 2 bop 4 -21 a Fn(3)71 b(Performance)21 b(Factors)4 106 y Fo(The)15 b(main)g(factor)f(that)g(af)o(fects)h(the)f(performance)g(of)g(a)h (parallel)f(application)g(on)h(a)g(DMM)g(is)g(the)g(rel-)4 160 y(atively)i(high)f(cost)h(of)g(interprocessor)f(communication.)30 b(For)17 b(example,)h(the)f(latency)f(for)g(a)h(remote)4 215 y(memory)e(access)i(on)e(the)h(CM5)g(multiprocessor)f(is)h(approximately)e (2560)h(processor)h(cycles)1699 196 y Fm(2)1718 215 y Fo(.)28 b(This)4 269 y(necessitates)16 b(the)e(selection)h(of)f(computation)f(and)h (data)h(partitions)f(that)g(minimize)f(the)i(cost)f(of)g(com-)4 323 y(munication.)27 b(In)15 b(contrast,)h(on)f(SSMMs,)i(processors)f (communicate)e(through)h(shared)g(memory)g(and)4 377 y(the)j(cost)h(of)f (remote)f(memory)h(access)h(is)g(relatively)e(small.)36 b(For)17 b(example,)j(the)f(cost)f(of)g(a)g(remote)4 431 y(read)11 b(operation)g(on)g (the)h(KSR1)f(is)h(approximately)e(170)h(processor)g(cycles)h([24].)17 b(Consequently)m(,)12 b(other)4 485 y(factors)h(come)f(into)h(play)g(in)f (the)h(selection)g(of)f(computation)g(and)h(data)g(partitions.)19 b(In)13 b(this)g(section)g(we)4 540 y(elaborate)h(on)g(these)g(factors)g(and) g(on)g(how)g(they)g(af)o(fect)g(performance,)f(and)h(consequently)m(,)h(af)o (fect)f(the)4 594 y(choice)f(of)f(data)g(and)g(computation)g(partitions.)4 755 y Fc(3.1)58 b(Cache)14 b(Af\256nity)4 853 y Fo(Caches)j(are)e(used)h(in)f (SSMMs)h(to)g(reduce)f(ef)o(fective)g(memory)f(access)j(time)e(and)h(reduce)f (contention)4 907 y(in)e(the)h(interconnection)e(network.)21 b(Data)14 b(is)g(transferred)e(between)i(cache)g(and)f(memory)g(in)g(units)g (of)h(a)4 961 y Fe(cache)g(line)p Fo(,)h(typically)e(a)h(multiple)f(of)g(the) h(processor)g(word)f(size.)24 b Fe(Spatial)13 b(r)n(euse)i Fo(occurs)e(when)h(other)4 1015 y(words)h(on)g(the)g(same)g(line)g(are)g (used)g(by)g(the)g(processor)g(before)f(the)h(line)g(is)g(\257ushed)g(from)f (the)h(cache.)4 1070 y(Analogously)m(,)g Fe(temporal)f(r)n(euse)i Fo(occurs)e(when)g(data)h(on)f(a)g(cache)h(line)f(is)h(used)g(again)f(before) g(the)g(line)4 1124 y(is)i(evicted)g(from)e(the)i(cache.)29 b(The)16 b(performance)e(of)i(an)f(application)h(depends)g(to)f(a)h(lar)o(ge) f(extent)h(on)4 1178 y(the)g(ability)g(of)g(the)g(caches)h(to)f(exploit)g (spatial)h(and)f(temporal)f(reuse.)31 b(In)16 b(some)g(cases,)j(this)d(may)h (be)4 1232 y(dif)o(\256cult)9 b(because)i(of)f(the)g(limited)f(capacity)h (and)g(associativity)g(of)g(caches.)18 b(Data)10 b(brought)f(into)h(a)g (cache)4 1286 y(by)16 b(a)g(reference)f(or)h(a)g(prefetch)f(may)h(be)g (evicted)f(before)h(being)f(used)h(or)g(reused,)h(because)g(of)e(either)4 1340 y(a)i(capacity)g(or)g(a)g(con\257ict)f(miss)i(caused)f(by)g(a)g (subsequent)h(reference.)31 b(Cache)18 b(misses)f(on)g(SSMMs)4 1395 y(adversely)f(af)o(fect)g(performance,)g(since)h(evicted)f(data)g(must)g (be)g(retrieved)f(from)g(its)i(home)e(memory)m(,)4 1449 y(which)k(may)g(be)g (remote)f(to)h(the)f(processor)m(.)38 b(Caches)20 b(play)f(less)h(of)e(an)h (important)f(role)g(in)h(DMMs)4 1503 y(because)g(cache)f(misses)h(result)e (exclusively)h(in)f(local)h(memory)f(accesses,)k(which)d(are)g(inexpensive)4 1557 y(relative)12 b(to)g(interprocessor)g(communications.)4 1718 y Fc(3.2)58 b(False)14 b(Sharing)4 1816 y Fo(In)g(SSMMs)h(data)f(on)h (the)f(same)h(cache)g(line)f(may)g(be)h(shared)f(by)h(more)e(than)i(one)f (processor)n(,)h(and)g(the)4 1870 y(line)j(may)g(exit)g(in)g(more)g(than)g (one)g(processor)r(')m(s)g(cache)h(at)f(the)g(same)h(time.)35 b(Hardware)18 b(is)g(used)h(to)4 1925 y(maintain)13 b(the)f(consistency)i(of) e(the)h(multiple)g(copies)g(of)f(the)h(line,)h(typically)e(using)h(a)g (write-invalidate)4 1979 y(protocol)e([24,)h(14].)18 b Fe(T)m(rue)12 b(sharing)g Fo(occurs)g(when)g(two)g(or)f(more)g(processors)i(access)g(the)f (same)g(data)g(on)4 2033 y(a)k(cache)f(line,)i(and)e(it)g(re\257ects)g (necessary)h(data)f(communications)g(in)g(an)g(application.)27 b(On)15 b(the)g(other)4 2087 y(hand,)h Fe(false)e(sharing)h Fo(occurs)f(when)h(two)f(processors)h(access)h(dif)o(ferent)d(pieces)i(of)f (data)h(on)f(the)g(same)4 2141 y(cache)e(line.)18 b(If)11 b(processors)h (write)g(to)f(the)h(same)g(cache)g(line,)g(the)g(cache)g(consistency)h (hardware)e(causes)4 2195 y(the)j(cache)g(line)g(to)g(be)g(transferred)f (back)h(and)g(forth)f(between)h(processors)g(leading)g(to)g(a)g (\252ping-pong\272)4 2250 y(ef)o(fect)h([8)o(].)27 b(False)16 b(sharing)f(causes)h(extensive)g(invalidation)e(traf)o(\256c)g(and)i(can)f (considerably)g(degrade)4 2304 y(performance.)i(False)c(sharing)f(is)h (non-existent)e(on)i(DMMs.)4 2465 y Fc(3.3)58 b(Memory)14 b(Contention)4 2563 y Fo(Memory)i(contention)g(occurs)g(when)g(many)g(processors)h(access)h (data)e(in)g(a)g(single)h(memory)e(module)4 2617 y(at)j(the)g(same)h(time.)35 b(Since)18 b(the)g(communication)f(protocol)g(in)h(SSMMs)g(is)g(receiver)o (-initiated,)h(and)4 2671 y(transfers)i(data)g(in)f(units)h(of)g(relatively)f (small)h(cache)g(lines,)j(a)d(lar)o(ge)g(number)f(of)h(requests)g(to)g(the)4 2725 y(same)12 b(memory)f(can)h(over\257ow)f(memory)g(buf)o(fers)g(and)h (cause)g(excessive)h(delays)f(in)g(memory)e(response)4 2780 y(time)20 b([13].)42 b(Contention)20 b(has)h(been)g(considered)g(less)g(of)f (a)h(performance)e(bottleneck)h(on)h(DMMs)p 4 2825 737 2 v 62 2855 a Fl(2)79 2870 y Fk(Calculated)10 b(based)h(on)f(the)g(elapsed)h (time)f(for)g(a)g(send-reply)g(message)i(of)e(128)g(bytes)g([19)o(].)p eop %%Page: 4 4 4 3 bop 4 -27 a Fo(because)16 b(a)g(sender)o(-initiated)e(communication)h (protocol)f(is)i(employed,)g(and)g(because)g(programmers)4 27 y(typically)f(communicate)f(data)i(in)f(lar)o(ge)g(infrequent)f(messages.) 28 b(Applications)15 b(on)g(DMMs)h(also)f(use)4 82 y(collective)d (communications)g([15)o(])g(that)h(further)e(reduce)h(contention.)4 243 y Fc(3.4)58 b(Over)o(head)14 b(of)g(Parallelism)4 341 y Fo(In)g(DMM,)i(synchronization)e(is)h(achieved)f(through)g(data)g (communication.)24 b(However)n(,)15 b(on)g(SSMMs,)4 395 y(synchronization)9 b(is)h(explicit)e(and)i(is)g(independent)f(of)f(data)i(communication.)16 b(The)10 b(resulting)f(overhead)4 449 y(can)14 b(become)f(a)h(performance)e (bottleneck)h([27)o(],)h(and)f(must)h(be)f(minimized.)21 b(The)14 b(performance)e(of)h(an)4 503 y(application)e(is)h(also)h(af)o(fected)e(by)h (the)f(overhead)h(involved)f(in)h(parallelizing)e(loops,)j(manifested)e(in)h (the)4 557 y(form)h(of)h(computation)f(partitioning)f(tests)j([25)o(].)23 b(These)15 b(tests)g(can)f(be)g(eliminated)g(in)f(some)i(cases)g(by)4 612 y(compiler)g(analysis,)i(but)d(when)i(not)f(possible,)h(can)g(degrade)f (performance.)26 b(This)15 b(overhead)g(though)4 666 y(also)d(present)g(in)f (the)h(case)g(of)f(DMMs,)j(is)e(not)f(considered)h(signi\256cant)f(because)h (of)g(the)f(predominantly)4 720 y(high)h(cost)h(of)f(remote)g(memory)f (access.)4 902 y Fn(4)71 b(Impact)19 b(on)f(Data)h(Distribution)4 1029 y Fo(In)e(this)g(section)g(we)g(use)g(two)g(applications,)h Fd(Multigrid)e Fo(and)h Fd(Tred2)p Fo(,)h(to)f(illustrate)f(the)h(impact)4 1083 y(of)f(cache)h(af)o(\256nity)f(and)h(false)g(sharing)f(on)h(the)f (choice)h(of)f(a)h(data)g(distribution.)30 b(The)17 b(KSR1)f(system)4 1137 y(is)f(used)f(because)h(of)f(its)h(lar)o(ge)f(cache)g(size,)i(and)e (because)h(of)f(the)g(presence)h(of)f(monitoring)e(hardware)4 1191 y(that)i(enables)h(the)g(measurement)f(of)g(the)g(number)g(of)f (non-local)h(memory)g(accesses)i(and)e(the)g(number)4 1245 y(of)e(caches)h(misses)h(for)d(a)i(processor)m(.)77 1321 y(The)j(KSR1)e(is)h (a)g(Cache)g(only)g(Memory)f(Architecture)g(\(COMA\))g(con\256gured)g(as)h(a) g(hierarchy)f(of)4 1375 y(slotted)c(rings)g(with)g(processing)g(cells)h(on)f (the)g(leaf-level)f(rings.)18 b(The)10 b(local)g(portion)g(of)f(shared)i (memory)4 1429 y(associated)g(with)e(a)i(processor)e(is)i(or)o(ganized)e(as)i (a)f(cache.)18 b(There)10 b(is)g(no)g(home)g(location)f(for)g(data,)i(rather) n(,)4 1483 y(data)k(may)g(exist)f(in)h(more)f(than)h(one)f(local)h(memory)m (.)24 b(The)16 b(hardware)e(maintains)g(the)h(consistency)g(of)4 1538 y(possible)e(multiple)e(copies)i(of)f(the)g(data.)77 1613 y(The)e(KSR1)g(implicitly)e(implements)i(the)f(owner)o(-computes)g(rule,)h (since)g(data)g(written)f(by)g(a)h(proces-)4 1667 y(sor)j(must)f(exclusively) g(reside)h(in)f(the)h(processor)r(')m(s)f(local)g(portion)g(of)g(the)g (shared)h(memory)m(.)k(Hardware)4 1722 y(automatically)j(migrates)g(data)h (to)g(the)f(processor)h(that)f(requests)h(the)g(data)f(in)h(units)g(of)f Fe(subpages)p Fo(.)4 1776 y(Hence,)13 b(the)f(computation)g(partitioning)e (of)i(a)g(loop)g(dictates)h(the)f(residence)g(of)g(a)g(data)h(item)e(and)i (hence)4 1830 y(the)k(distribution)f(of)h(the)g(arrays)g(in)g(the)g(loop.)33 b(Data)17 b(which)g(is)h(read)f(by)g(the)g(processors)h(may)f(exist)4 1884 y(in)e(multiple)e(local)i(memories,)g(and)g(read)f(requests)h(to)g(this) g(data)f(from)g(dif)o(ferent)f(processors)i(may)g(be)4 1938 y(satis\256ed)e(from)e(dif)o(ferent)g(portions)h(of)g(the)g(shared)h(memory)m (.)4 2099 y Fc(4.1)58 b(Cache-Conscious)13 b(Data)i(Distribution)4 2197 y Fo(The)j Fd(Multigrid)e Fo(application)g(from)g(the)h(NAS)f(suite)h (of)g(benchmarks)f(illustrates)h(how)g(data)g(dis-)4 2252 y(tributions)d (must)h(be)g(cache-conscious.)27 b Fd(Multigrid)14 b Fo(is)h(a)g(three)g (dimensional)f(solver)h(calculating)4 2306 y(the)j(potential)f(\256eld)h(on)f (a)h(cubical)g(grid.)34 b(W)l(e)18 b(focus)g(on)f(the)h(subroutine)f Fd(psinv)h Fo(which)f(uses)i(two)4 2360 y(3-dimensional)13 b(arrays)h Fj(U)20 b Fo(and)14 b Fj(R)p Fo(.)25 b(The)14 b(subroutine)g (mainly)f(performs)h(the)g(following)f(computation)4 2414 y(inside)i(a)h (triply)e(nested)h(loop:)23 b Fj(U)5 b Fi(\()p Fj(i;)23 b(j;)h(k)r Fi(\))15 b(+)j(=)33 b Fj(\013)p Fi(\()15 b Fj(R)p Fi(\()p Fj(f)5 b Fi(\()p Fj(i)p Fi(\))p Fj(;)24 b(g)r Fi(\()p Fj(j)s Fi(\))p Fj(;)f(h)p Fi(\()p Fj(k)r Fi(\)\)\))p Fo(,)16 b(where)f Fj(f)5 b Fi(\()p Fj(i)p Fi(\))15 b Fo(=)h Fj(i)c Fh(\000)g Fo(1,)4 2468 y Fj(i)18 b Fo(or)g Fj(i)13 b Fi(+)i Fo(1,)20 b(as)e(are)g(the)g (functions)g Fj(g)i Fo(and)e Fj(h)p Fo(.)36 b(The)18 b(loop)g(nest)g(is)h (fully)e(parallel.)35 b(The)18 b(application)4 2522 y(has)e(nearest)g (neighbor)e(communications)h(along)g(all)g(three)g(dimensions,)i(which)e(is)h (typical)f(of)g(many)4 2577 y(scienti\256c)d(applications.)77 2652 y(In)d(this)g(application,)g(we)g(choose)h(not)e(to)h(parallelize)f(the) h(innermost)g(loop)f(to)h(avoid)g(cache)g(line)g(false)4 2706 y(sharing)k(and)g(cache)h(interference;)e(successive)j(iterations)e(of)f (this)i(loop)f(access)h(successive)h(elements)4 2761 y(on)h(the)f(same)i (cache)f(line.)28 b(Hence)16 b(we)g(use)g(a)g(two)g(dimensional)f(grid)g(for) g(the)h(processor)f(geometry)m(.)4 2815 y(Since)10 b(the)g(application)g(has) h(nearest)f(neighbor)f(communications,)i Fd(Block)f Fo(distribution)f (performs)g(the)4 2869 y(best.)18 b(The)10 b(restriction)e(of)h(the)h (innermost)f(loop)g(to)g(be)h(sequential)f(requires)g(the)g(arrays)h(to)f(be) h(distributed)p eop %%Page: 5 5 5 4 bop 503 532 a @beginspecial 50 @llx 50 @lly 410 @urx 302 @ury 2057 @rwi @setspecial %%BeginDocument: mg.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.100 0.100 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 473 M 2817 0 V LTb 600 473 M 63 0 V 2754 0 R -63 0 V 540 473 M (96) Rshow LTa 600 916 M 2817 0 V LTb 600 916 M 63 0 V 2754 0 R -63 0 V 540 916 M (98) Rshow LTa 600 1360 M 2817 0 V LTb 600 1360 M 63 0 V 2754 0 R -63 0 V -2814 0 R (100) Rshow LTa 600 1804 M 2817 0 V LTb 600 1804 M 63 0 V 2754 0 R -63 0 V -2814 0 R (102) Rshow LTa 600 2247 M 2817 0 V LTb 600 2247 M 63 0 V 2754 0 R -63 0 V -2814 0 R (104) Rshow LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (\(16,1\)) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(8,2\)) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(4,4\)) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(2,8\)) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(1,16\)) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 340 1360 M currentpoint gsave translate 90 rotate 0 0 M (Normalized Execution Time \(w.r.t \(16,1\)\)) Cshow grestore 2008 51 M (Processor Geometry - 16 Processors ) Cshow LT1 1654 2106 M (160x160x160) Rshow 1714 2106 M 180 0 V 1523 341 R 2713 2048 L 2009 1826 L 1304 939 L 600 1360 L 1774 2106 A 3417 2447 A 2713 2048 A 2009 1826 A 1304 939 A 600 1360 A LT2 1654 2006 M (144x144x144) Rshow 1714 2006 M 180 0 V 1523 241 R 2713 1715 L -705 -67 V 1304 850 L 600 1360 L 1774 2006 B 3417 2247 B 2713 1715 B 2009 1648 B 1304 850 B 600 1360 B LT4 1654 1906 M (64x64x64) Rshow 1714 1906 M 180 0 V 1523 -590 R 2713 1160 L 2009 340 L 1304 495 L 600 1360 L 1774 1906 T 3417 1316 T 2713 1160 T 2009 340 T 1304 495 T 600 1360 T stroke grestore end showpage %%EndDocument @endspecial 377 612 a Fo(Figure)11 b(2:)18 b(Normalized)11 b(Execution)i(time)f(of)g Fd(Multigrid)p Fo(.)4 733 y(with)17 b Fd(\(*,Block,Block\))f Fo(since)h(the)g(arrays)g(are)g(assumed)h(to)f(be)h (stored)f(using)g(column)f(major)4 787 y(ordering.)31 b(W)n(ith)16 b(16)h(processors,)h(it)f(is)g(possible)g(to)g(choose)g(one)g(of)f(the)h (\(16,1\),)h(\(8,2\),)f(\(4,4\),)h(\(2,8\))4 841 y(and)d(\(1,16\))g (processor)g(geometries.)27 b(The)15 b(choice)h(of)e(the)i(processor)f (geometry)f(af)o(fects)h(the)g(number)4 895 y(of)h(processors)g(that)g (execute)g(each)g(parallel)f(loop.)29 b(For)15 b(example,)i(a)f(processor)g (geometry)f(of)h(\(8,2\),)4 949 y(implies)11 b(8)g(processors)h(assigned)g (to)e(the)i(inner)e(parallel)h(loop)g(and)g(2)g(processors)g(assigned)h(to)f (the)g(outer)4 1004 y(parallel)h(loop.)77 1079 y(Figure)17 b(2)g(shows)h(the)f(execution)g(time)g(of)g(the)h(application)e(for)h (various)g(processor)g(geometries)4 1133 y(with)e(the)g Fd(\(*,Block,Block\)) e Fo(distribution)h(for)g(the)h(arrays)g(on)g(the)g(KSR1)f(with)h(16)g (processors,)4 1188 y(normalized)d(with)h(respect)g(to)g(the)g(\(16,1\))f (processor)h(geometry)m(.)19 b(For)12 b(a)h(small)g(data)g(size)h (\(64x64x64\),)4 1242 y(execution)22 b(time)f(is)h(minimized)e(by)i(a)g (distribution)e(with)h(equal)h(number)f(of)g(processors)h(in)f(each)4 1296 y(dimension,)15 b(i.e.,)i(\(4,4\).)24 b(This)16 b(is)f(the)f(same)h (distribution)e(scheme)j(suggested)f(in)f(the)h(Syracuse)f(High)4 1350 y(Performance)9 b(Fortran)h(applications)g(suite)772 1332 y Fm(3)801 1350 y Fo(for)g(DMMs.)19 b(However)n(,)11 b(when)f(the)h(data)g (size)g(is)g(lar)o(ge,)g(the)4 1404 y(processor)h(geometry)f(\(4,4\))h(no)g (longer)f(performs)g(the)h(best.)19 b(The)12 b(execution)g(time)g(is)g (minimized)f(with)4 1458 y(a)i(processor)f(geometry)g(of)g(\(8,2\).)77 1534 y(The)20 b(impact)e(of)h(processor)g(geometry)f(on)g(performance)g(is)h (due)g(to)g(cache)g(af)o(\256nity)m(,)h(as)g(can)f(be)4 1588 y(deduced)12 b(from)f(Figures)h(3)g(and)g(4.)19 b(Figure)11 b(3)h(shows)h(the)f(measured)g(number)g(of)f(cache)i(lines)f(accessed)4 1642 y(from)17 b(remote)g(memory)f(modules,)j(normalized)e(with)g(respect)h (to)f(the)h(processor)f(geometry)g(\(16,1\).)4 1697 y(The)h(number)e(of)h (remote)f(memory)g(accesses)j(is)e(minimal)g(when)g(the)g(processor)g (geometry)f(is)h(\(4,4\))4 1751 y(for)h(all)g(data)h(sizes.)38 b(Figure)17 b(4)i(shows)g(the)g(average)f(measured)h(number)e(of)i(cache)g (misses)g(from)f(a)4 1805 y(processor)c(cache,)h(again)e(normalized)g(with)g (respect)h(to)g(the)f(processor)h(geometry)f(\(16,1\).)21 b(When)14 b(the)4 1859 y(data)e(size)h(is)f(small)g(\(64x64x64\),)g(the)g(data)g(used)g (by)g(a)h(processor)f(\256ts)g(into)f(the)h(256k)g(processor)g(cache)4 1913 y(and)19 b(the)g(misses)h(from)e(the)h(cache)h(in)f(this)g(case)h (re\257ect)f(remote)f(memory)g(accesses)j(that)e(occur)g(in)4 1967 y(the)13 b(parallel)g(program.)19 b(Hence,)14 b(the)f(predominant)f (factor)h(af)o(fecting)f(performance)g(is)h(interprocessor)4 2022 y(communication,)f(and)g(the)h(best)f(performance)g(is)g(attained)g (using)h(the)f(\(4,4\))g(geometry)m(.)77 2097 y(However)n(,)17 b(when)f(the)g(arrays)g(are)g(relatively)f(lar)o(ge)h(\(144x144x144\),)g(the) g(cache)g(capacity)h(is)f(no)4 2151 y(longer)g(suf)o(\256cient)h(to)g(hold)f (data)i(from)d(successive)k(iterations)d(of)h(the)g(outer)f(parallel)h(loop,) h(and)f(the)4 2206 y(number)10 b(of)h(cache)g(misses)h(increases.)19 b(When)11 b(the)g(number)f(of)h(processors)g(assigned)h(to)f(the)g(outer)f (loop)4 2260 y(increases,)j(the)f(number)f(of)h(misses)h(from)d(the)i(cache)h (also)f(increases.)19 b(The)12 b(\(4,4\))g(processor)g(geometry)4 2314 y(minimizes)d(the)f(amount)h(of)f(remote)g(memory)g(access,)k(but)c(the) h(\(16,1\))f(processor)h(geometry)f(minimizes)4 2368 y(the)k(amount)f(of)g (cache)h(misses.)19 b(The)12 b(distribution)e(with)i(\(8,2\))f(processor)g (geometry)g(strikes)h(a)g(balance)4 2422 y(between)17 b(the)g(cost)g(of)g (remote)f(memory)g(access)i(and)f(the)g(cost)g(of)g(cache)g(misses,)i (resulting)e(in)f(best)4 2476 y(overall)c(performance,)g(in)g(spite)g(of)g (higher)g(interprocessor)g(communication)f(cost.)4 2638 y Fc(4.2)58 b(False)14 b(Sharing)g(Conscious)g(Data)h(Distribution)4 2736 y Fo(The)d(programs)f Fd(Tred2)h Fo(\(which)f(is)h(part)f(of)g(Eispack\),)i Fd(mdg)p Fo(,)f(and)g Fd(trfd)f Fo(\(which)g(are)h(both)f(part)h(of)f(the)4 2790 y(Perfect)f(Club)h(Benchmark)f(Suite\))g(exhibit)h(parallelism)f(which)h (result)f(in)h(considerable)g(false)g(sharing.)p 4 2835 737 2 v 62 2865 a Fl(3)79 2880 y Fk(http://www)m(.npac.syr)n(.edu/hpfa/)c(.)p eop %%Page: 6 6 6 5 bop 47 586 a @beginspecial 50 @llx 50 @lly 230 @urx 176 @ury 2057 @rwi @setspecial %%BeginDocument: spmiss.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.050 0.050 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V 540 251 M (30) Rshow LTa 600 568 M 2817 0 V LTb 600 568 M 63 0 V 2754 0 R -63 0 V 540 568 M (40) Rshow LTa 600 885 M 2817 0 V LTb 600 885 M 63 0 V 2754 0 R -63 0 V 540 885 M (50) Rshow LTa 600 1202 M 2817 0 V LTb 600 1202 M 63 0 V 2754 0 R -63 0 V -2814 0 R (60) Rshow LTa 600 1518 M 2817 0 V LTb 600 1518 M 63 0 V 2754 0 R -63 0 V -2814 0 R (70) Rshow LTa 600 1835 M 2817 0 V LTb 600 1835 M 63 0 V 2754 0 R -63 0 V -2814 0 R (80) Rshow LTa 600 2152 M 2817 0 V LTb 600 2152 M 63 0 V 2754 0 R -63 0 V -2814 0 R (90) Rshow LTa 600 2469 M 2817 0 V LTb 600 2469 M 63 0 V 2754 0 R -63 0 V -2814 0 R (100) Rshow LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (\(16,1\)) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(8,2\)) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(4,4\)) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(2,8\)) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(1,16\)) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 340 1360 M currentpoint gsave translate 90 rotate 0 0 M (Normalized subpage misses \(w.r.t. \(16,1\)\)) Cshow grestore 2008 51 M (Processor Geometry - 16 Processors ) Cshow LT1 1654 2106 M (160x160x160) Rshow 1714 2106 M 180 0 V 600 2469 M 1304 1154 L 2009 495 L 705 646 V 3417 2453 L 1774 2106 A 600 2469 A 1304 1154 A 2009 495 A 2713 1141 A 3417 2453 A LT2 1654 2006 M (144x144x144) Rshow 1714 2006 M 180 0 V 600 2469 M 1304 1072 L 2009 473 L 705 567 V 3417 2387 L 1774 2006 B 600 2469 B 1304 1072 B 2009 473 B 2713 1040 B 3417 2387 B LT4 1654 1906 M (64x64x64) Rshow 1714 1906 M 180 0 V 600 2469 M 1304 1113 L 2009 543 L 705 649 V 3417 2444 L 1774 1906 T 600 2469 T 1304 1113 T 2009 543 T 2713 1192 T 3417 2444 T stroke grestore end showpage %%EndDocument @endspecial 124 640 a Fo(Figure)12 b(3.)18 b(Remote)12 b(Memory)g(Access.) 899 586 y @beginspecial 50 @llx 50 @lly 230 @urx 176 @ury 2057 @rwi @setspecial %%BeginDocument: datac.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.050 0.050 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V 540 251 M (90) Rshow LTa 600 528 M 2817 0 V LTb 600 528 M 63 0 V 2754 0 R -63 0 V 540 528 M (100) Rshow LTa 600 806 M 2817 0 V LTb 600 806 M 63 0 V 2754 0 R -63 0 V 540 806 M (110) Rshow LTa 600 1083 M 2817 0 V LTb 600 1083 M 63 0 V 2754 0 R -63 0 V -2814 0 R (120) Rshow LTa 600 1360 M 2817 0 V LTb 600 1360 M 63 0 V 2754 0 R -63 0 V -2814 0 R (130) Rshow LTa 600 1637 M 2817 0 V LTb 600 1637 M 63 0 V 2754 0 R -63 0 V -2814 0 R (140) Rshow LTa 600 1915 M 2817 0 V LTb 600 1915 M 63 0 V 2754 0 R -63 0 V -2814 0 R (150) Rshow LTa 600 2192 M 2817 0 V LTb 600 2192 M 63 0 V 2754 0 R -63 0 V -2814 0 R (160) Rshow LTa 600 2469 M 2817 0 V LTb 600 2469 M 63 0 V 2754 0 R -63 0 V -2814 0 R (170) Rshow LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (\(16,1\)) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(8,2\)) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(4,4\)) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(2,8\)) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (\(1,16\)) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 340 1360 M currentpoint gsave translate 90 rotate 0 0 M (Normalized cache misses \(w.r.t \(16,1\)\)) Cshow grestore 2008 51 M (Processor Geometry - 16 Processors ) Cshow LT1 1654 2106 M (160x160x160) Rshow 1714 2106 M 180 0 V 600 528 M 1304 817 L 705 626 V 705 721 V 704 250 V 1774 2106 A 600 528 A 1304 817 A 2009 1443 A 2713 2164 A 3417 2414 A LT2 1654 2006 M (144x144x144) Rshow 1714 2006 M 180 0 V 600 528 M 1304 678 L 705 488 V 705 610 V 704 333 V 1774 2006 B 600 528 B 1304 678 B 2009 1166 B 2713 1776 B 3417 2109 B LT4 1654 1906 M (64x64x64) Rshow 1714 1906 M 180 0 V 600 528 M 1304 329 L 705 -36 V 705 133 V 704 97 V 1774 1906 T 600 528 T 1304 329 T 2009 293 T 2713 426 T 3417 523 T stroke grestore end showpage %%EndDocument @endspecial 1085 640 a(Figure)g(4.)18 b(Cache)13 b(Misses.)47 1297 y @beginspecial 50 @llx 50 @lly 410 @urx 302 @ury 2057 @rwi @setspecial %%BeginDocument: tred2.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.100 0.100 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V 540 251 M (1e+06) Rshow LTb 600 2469 M 63 0 V 2754 0 R -63 0 V 540 2469 M (4e+06) Rshow LTa 600 1360 M 2817 0 V LTb 600 1360 M 31 0 V 2786 0 R -31 0 V LTa 600 2009 M 2817 0 V LTb 600 2009 M 31 0 V 2786 0 R -31 0 V LTa 600 2469 M 2817 0 V LTb 600 2469 M 31 0 V 2786 0 R -31 0 V LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (0) Cshow LTa 952 251 M 0 2218 V LTb 952 251 M 0 63 V 0 2155 R 0 -63 V 952 151 M (2) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (4) Cshow LTa 1656 251 M 0 2218 V LTb 1656 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (6) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (8) Cshow LTa 2361 251 M 0 2218 V LTb 2361 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (10) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (12) Cshow LTa 3065 251 M 0 2218 V LTb 3065 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (14) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (16) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 340 1260 M currentpoint gsave translate 90 rotate 0 0 M (Execution Time \(Micro Seconds\)) Cshow grestore 2008 51 M (Number of Processors) Cshow LT0 3054 2306 M ("Cyclic") Rshow 3114 2306 M 180 0 V 776 910 M 176 318 V 352 533 V 705 -387 V 704 -149 V 704 707 V 3174 2306 D 776 910 D 952 1228 D 1304 1761 D 2009 1374 D 2713 1225 D 3417 1932 D LT1 3054 2206 M ("BlockCyclic") Rshow 3114 2206 M 180 0 V 776 942 M 952 817 L 1304 662 L 705 -34 V 704 110 V 704 1480 V 3174 2206 A 776 942 A 952 817 A 1304 662 A 2009 628 A 2713 738 A 3417 2218 A stroke grestore end showpage %%EndDocument @endspecial 140 1351 a(Figure)f(5.)18 b(Ef)o(fect)12 b(of)g(False)h (Sharing.)899 1297 y @beginspecial 50 @llx 50 @lly 410 @urx 302 @ury 2057 @rwi @setspecial %%BeginDocument: tred2c.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.100 0.100 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V -2814 0 R (40000) Rshow LTa 600 251 M 2817 0 V LTb 600 251 M 31 0 V 2786 0 R -31 0 V LTa 600 497 M 2817 0 V LTb 600 497 M 31 0 V 2786 0 R -31 0 V LTa 600 697 M 2817 0 V LTb 600 697 M 31 0 V 2786 0 R -31 0 V LTa 600 867 M 2817 0 V LTb 600 867 M 31 0 V 2786 0 R -31 0 V LTa 600 1014 M 2817 0 V LTb 600 1014 M 31 0 V 2786 0 R -31 0 V LTa 600 1144 M 2817 0 V LTb 600 1144 M 31 0 V 2786 0 R -31 0 V LTa 600 1260 M 2817 0 V LTb 600 1260 M 63 0 V 2754 0 R -63 0 V -2814 0 R (100000) Rshow LTa 600 2023 M 2817 0 V LTb 600 2023 M 31 0 V 2786 0 R -31 0 V LTa 600 2469 M 2817 0 V LTb 600 2469 M 31 0 V 2786 0 R -31 0 V LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (0) Cshow LTa 952 251 M 0 2218 V LTb 952 251 M 0 63 V 0 2155 R 0 -63 V 952 151 M (2) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (4) Cshow LTa 1656 251 M 0 2218 V LTb 1656 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (6) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (8) Cshow LTa 2361 251 M 0 2218 V LTb 2361 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (10) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (12) Cshow LTa 3065 251 M 0 2218 V LTb 3065 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (14) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (16) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 400 1660 M currentpoint gsave translate 90 rotate 0 0 M (Cache Misses) Cshow grestore 2008 51 M (Number of Processors) Cshow LT0 3054 2306 M ("Cyclic") Rshow 3114 2306 M 180 0 V 123 -856 R -704 -3 V -704 172 V -705 449 V 952 1885 L 776 2180 L 3174 2306 D 3417 1450 D 2713 1447 D 2009 1619 D 1304 2068 D 952 1885 D 776 2180 D LT1 3054 2206 M ("BlockCyclic") Rshow 3114 2206 M 180 0 V 3417 1029 M 2713 642 L 2009 475 L -705 601 V 952 1596 L 776 2166 L 3174 2206 A 3417 1029 A 2713 642 A 2009 475 A 1304 1076 A 952 1596 A 776 2166 A stroke grestore end showpage %%EndDocument @endspecial 1085 1351 a(Figure)f(6.)18 b(Cache)13 b(Misses.)4 1510 y(These)18 b(programs)f(have)g(triangular)f(iteration)g(spaces)i(which)f (necessitate)h(cyclical)f(distribution)f(for)4 1564 y(load)c(balancing.)19 b(The)13 b(choice)f(of)g(this)h(distribution)e(combined)h(with)g(the)h (storage)f(order)g(of)g(the)g(arrays)4 1619 y(cause)17 b(more)f(than)g(one)g (processor)g(to)g(share)g(the)g(same)h(cache)f(line,)i(leading)e(to)g(false)g (sharing.)29 b(The)4 1673 y(impact)17 b(of)f(this)h(false)h(sharing)e(is)i (shown)f(in)g(Figure)f(5)h(for)f(the)h Fd(Tred2)g Fo(application)f(on)h(the)g (KSR1)4 1727 y(multiprocessor)m(.)37 b(The)19 b(\256gure)f(shows)h(the)g (execution)f(time)g(of)g(the)h(application)f(for)g Fd(Cyclic)g Fo(and)4 1781 y Fd(BlockCyclic)12 b Fo(distributions)g(using)i(1)f(to)g(16)g (processors.)20 b(The)14 b(use)g(of)e(the)h Fd(Cyclic)g Fo(distribution)4 1835 y(results)f(in)g(a)h(lar)o(ge)f(number)f(of)h(cache)h(misses,)g(as)g (can)f(be)g(seen)h(in)f(Figure)g(6.)18 b(The)13 b(resulting)e(overhead)4 1889 y(causes)20 b(execution)f(time)g(to)g(increase)g(as)g(the)g(number)g(of) f(processors)i(increases.)39 b(The)19 b(arrays)g(are)4 1944 y(distributed)c(using)g(a)g Fd(BlockCyclic)f Fo(distribution,)h(where)g(the)g (size)h(of)f(the)g(block)g(is)h(equal)f(to)g(the)4 1998 y(size)22 b(of)e(the)h(cache)g(line,)i(which)e(ef)o(fectively)f(eliminates)h(false)g (sharing.)43 b(When)21 b(the)g(number)f(of)4 2052 y(processors)14 b(is)g(small,)g(the)f(load)h(is)g(relatively)e(well-balanced,)i(and)f(the)h (elimination)e(of)h(false)h(sharing)4 2106 y(improves)h(performance.)25 b(However)n(,)15 b(as)h(the)f(number)f(of)h(processors)g(increases,)i(the)e (load)f(becomes)4 2160 y(increasingly)f(imbalanced,)h(and)f(the)h(negative)f (impact)g(of)g(this)g(load)g(imbalance)h(begins)f(to)g(outweigh)4 2214 y(the)h(bene\256ts)h(of)e(eliminating)h(false)g(sharing.)24 b(A)14 b(compiler)g(for)f(SSMM)h(must)g(consider)h(this)f(tradeof)o(f)4 2269 y(between)f(load)f(imbalance)g(and)g(false)h(sharing)f(when)g (determining)g(data)g(distributions.)4 2450 y Fn(5)71 b(Impact)19 b(on)f(Computation)i(Partitioning)4 2577 y Fo(The)12 b(owner)o(-computes)f (rule)g(has)h(been)f(the)h(computation)f(partitioner)f(of)h(choice)g(for)g (compiling)g(HPF-)4 2631 y(type)17 b(languages)g(on)g(DMMs)h([16].)32 b(The)17 b(owner)o(-computes)f(rule)h(maps)g(a)g(statement)h(such)f(that)g (the)4 2686 y(the)h(computation)e(is)i(executed)g(on)g(the)f(processor)h(on)f (which)h(the)f(data)h(element)f(that)h(is)g(written)e(is)4 2740 y(local.)27 b(All)15 b(the)g(data)g(elements)g(that)g(are)g(required)f (to)h(compute)g(the)g(result)g(\(which)g(may)g(be)g(remote\))4 2794 y(are)h(communicated)f(to)h(the)g(processor)m(.)29 b(A)16 b(strict)g(rule)f(such)i(as)f(owner)o(-computes)f(is)h(not)g(necessary)4 2848 y(on)h(a)f(SSMM)h(because)g(message)h(passing)f(code)g(is)g(not)f (generated)g(at)h(compile)f(time)g([3].)30 b(In)17 b(some)p eop %%Page: 7 7 7 6 bop 482 311 a @beginspecial 127 @llx 520 @lly 393 @urx 632 @ury 2160 @rwi @setspecial %%BeginDocument: adi.idraw /arrowhead { 0 begin transform originalCTM itransform /taily exch def /tailx exch def transform originalCTM itransform /tipy exch def /tipx exch def /dy tipy taily sub def /dx tipx tailx sub def /angle dx 0 ne dy 0 ne or { dy dx atan } { 90 } ifelse def gsave originalCTM setmatrix tipx tipy translate angle rotate newpath arrowHeight neg arrowWidth 2 div moveto 0 0 lineto arrowHeight neg arrowWidth 2 div neg lineto patternNone not { originalCTM setmatrix /padtip arrowHeight 2 exp 0.25 arrowWidth 2 exp mul add sqrt brushWidth mul arrowWidth div def /padtail brushWidth 2 div def tipx tipy translate angle rotate padtip 0 translate arrowHeight padtip add padtail add arrowHeight div dup scale arrowheadpath ifill } if brushNone not { originalCTM setmatrix tipx tipy translate angle rotate arrowheadpath istroke } if grestore end } dup 0 9 dict put def /arrowheadpath { newpath arrowHeight neg arrowWidth 2 div moveto 0 0 lineto arrowHeight neg arrowWidth 2 div neg lineto } def /leftarrow { 0 begin y exch get /taily exch def x exch get /tailx exch def y exch get /tipy exch def x exch get /tipx exch def brushLeftArrow { tipx tipy tailx taily arrowhead } if end } dup 0 4 dict put def /rightarrow { 0 begin y exch get /tipy exch def x exch get /tipx exch def y exch get /taily exch def x exch get /tailx exch def brushRightArrow { tipx tipy tailx taily arrowhead } if end } dup 0 4 dict put def /arrowHeight 10 def /arrowWidth 5 def /IdrawDict 51 dict def IdrawDict begin /reencodeISO { dup dup findfont dup length dict begin { 1 index /FID ne { def }{ pop pop } ifelse } forall /Encoding ISOLatin1Encoding def currentdict end definefont } def /ISOLatin1Encoding [ /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /space/exclam/quotedbl/numbersign/dollar/percent/ampersand/quoteright /parenleft/parenright/asterisk/plus/comma/minus/period/slash /zero/one/two/three/four/five/six/seven/eight/nine/colon/semicolon /less/equal/greater/question/at/A/B/C/D/E/F/G/H/I/J/K/L/M/N /O/P/Q/R/S/T/U/V/W/X/Y/Z/bracketleft/backslash/bracketright /asciicircum/underscore/quoteleft/a/b/c/d/e/f/g/h/i/j/k/l/m /n/o/p/q/r/s/t/u/v/w/x/y/z/braceleft/bar/braceright/asciitilde /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef/.notdef /.notdef/dotlessi/grave/acute/circumflex/tilde/macron/breve /dotaccent/dieresis/.notdef/ring/cedilla/.notdef/hungarumlaut /ogonek/caron/space/exclamdown/cent/sterling/currency/yen/brokenbar /section/dieresis/copyright/ordfeminine/guillemotleft/logicalnot /hyphen/registered/macron/degree/plusminus/twosuperior/threesuperior /acute/mu/paragraph/periodcentered/cedilla/onesuperior/ordmasculine /guillemotright/onequarter/onehalf/threequarters/questiondown /Agrave/Aacute/Acircumflex/Atilde/Adieresis/Aring/AE/Ccedilla /Egrave/Eacute/Ecircumflex/Edieresis/Igrave/Iacute/Icircumflex /Idieresis/Eth/Ntilde/Ograve/Oacute/Ocircumflex/Otilde/Odieresis /multiply/Oslash/Ugrave/Uacute/Ucircumflex/Udieresis/Yacute /Thorn/germandbls/agrave/aacute/acircumflex/atilde/adieresis /aring/ae/ccedilla/egrave/eacute/ecircumflex/edieresis/igrave /iacute/icircumflex/idieresis/eth/ntilde/ograve/oacute/ocircumflex /otilde/odieresis/divide/oslash/ugrave/uacute/ucircumflex/udieresis /yacute/thorn/ydieresis ] def /Helvetica reencodeISO def /none null def /numGraphicParameters 17 def /stringLimit 65535 def /Begin { save numGraphicParameters dict begin } def /End { end restore } def /SetB { dup type /nulltype eq { pop false /brushRightArrow idef false /brushLeftArrow idef true /brushNone idef } { /brushDashOffset idef /brushDashArray idef 0 ne /brushRightArrow idef 0 ne /brushLeftArrow idef /brushWidth idef false /brushNone idef } ifelse } def /SetCFg { /fgblue idef /fggreen idef /fgred idef } def /SetCBg { /bgblue idef /bggreen idef /bgred idef } def /SetF { /printSize idef /printFont idef } def /SetP { dup type /nulltype eq { pop true /patternNone idef } { dup -1 eq { /patternGrayLevel idef /patternString idef } { /patternGrayLevel idef } ifelse false /patternNone idef } ifelse } def /BSpl { 0 begin storexyn newpath n 1 gt { 0 0 0 0 0 0 1 1 true subspline n 2 gt { 0 0 0 0 1 1 2 2 false subspline 1 1 n 3 sub { /i exch def i 1 sub dup i dup i 1 add dup i 2 add dup false subspline } for n 3 sub dup n 2 sub dup n 1 sub dup 2 copy false subspline } if n 2 sub dup n 1 sub dup 2 copy 2 copy false subspline patternNone not brushLeftArrow not brushRightArrow not and and { ifill } if brushNone not { istroke } if 0 0 1 1 leftarrow n 2 sub dup n 1 sub dup rightarrow } if end } dup 0 4 dict put def /Circ { newpath 0 360 arc patternNone not { ifill } if brushNone not { istroke } if } def /CBSpl { 0 begin dup 2 gt { storexyn newpath n 1 sub dup 0 0 1 1 2 2 true subspline 1 1 n 3 sub { /i exch def i 1 sub dup i dup i 1 add dup i 2 add dup false subspline } for n 3 sub dup n 2 sub dup n 1 sub dup 0 0 false subspline n 2 sub dup n 1 sub dup 0 0 1 1 false subspline patternNone not { ifill } if brushNone not { istroke } if } { Poly } ifelse end } dup 0 4 dict put def /Elli { 0 begin newpath 4 2 roll translate scale 0 0 1 0 360 arc patternNone not { ifill } if brushNone not { istroke } if end } dup 0 1 dict put def /Line { 0 begin 2 storexyn newpath x 0 get y 0 get moveto x 1 get y 1 get lineto brushNone not { istroke } if 0 0 1 1 leftarrow 0 0 1 1 rightarrow end } dup 0 4 dict put def /MLine { 0 begin storexyn newpath n 1 gt { x 0 get y 0 get moveto 1 1 n 1 sub { /i exch def x i get y i get lineto } for patternNone not brushLeftArrow not brushRightArrow not and and { ifill } if brushNone not { istroke } if 0 0 1 1 leftarrow n 2 sub dup n 1 sub dup rightarrow } if end } dup 0 4 dict put def /Poly { 3 1 roll newpath moveto -1 add { lineto } repeat closepath patternNone not { ifill } if brushNone not { istroke } if } def /Rect { 0 begin /t exch def /r exch def /b exch def /l exch def newpath l b moveto l t lineto r t lineto r b lineto closepath patternNone not { ifill } if brushNone not { istroke } if end } dup 0 4 dict put def /Text { ishow } def /idef { dup where { pop pop pop } { exch def } ifelse } def /ifill { 0 begin gsave patternGrayLevel -1 ne { fgred bgred fgred sub patternGrayLevel mul add fggreen bggreen fggreen sub patternGrayLevel mul add fgblue bgblue fgblue sub patternGrayLevel mul add setrgbcolor eofill } { eoclip originalCTM setmatrix pathbbox /t exch def /r exch def /b exch def /l exch def /w r l sub ceiling cvi def /h t b sub ceiling cvi def /imageByteWidth w 8 div ceiling cvi def /imageHeight h def bgred bggreen bgblue setrgbcolor eofill fgred fggreen fgblue setrgbcolor w 0 gt h 0 gt and { l w add b translate w neg h scale w h true [w 0 0 h neg 0 h] { patternproc } imagemask } if } ifelse grestore end } dup 0 8 dict put def /istroke { gsave brushDashOffset -1 eq { [] 0 setdash 1 setgray } { brushDashArray brushDashOffset setdash fgred fggreen fgblue setrgbcolor } ifelse brushWidth setlinewidth originalCTM setmatrix stroke grestore } def /ishow { 0 begin gsave fgred fggreen fgblue setrgbcolor /fontDict printFont printSize scalefont dup setfont def /descender fontDict begin 0 [FontBBox] 1 get FontMatrix end transform exch pop def /vertoffset 1 printSize sub descender sub def { 0 vertoffset moveto show /vertoffset vertoffset printSize sub def } forall grestore end } dup 0 3 dict put def /patternproc { 0 begin /patternByteLength patternString length def /patternHeight patternByteLength 8 mul sqrt cvi def /patternWidth patternHeight def /patternByteWidth patternWidth 8 idiv def /imageByteMaxLength imageByteWidth imageHeight mul stringLimit patternByteWidth sub min def /imageMaxHeight imageByteMaxLength imageByteWidth idiv patternHeight idiv patternHeight mul patternHeight max def /imageHeight imageHeight imageMaxHeight sub store /imageString imageByteWidth imageMaxHeight mul patternByteWidth add string def 0 1 imageMaxHeight 1 sub { /y exch def /patternRow y patternByteWidth mul patternByteLength mod def /patternRowString patternString patternRow patternByteWidth getinterval def /imageRow y imageByteWidth mul def 0 patternByteWidth imageByteWidth 1 sub { /x exch def imageString imageRow x add patternRowString putinterval } for } for imageString end } dup 0 12 dict put def /min { dup 3 2 roll dup 4 3 roll lt { exch } if pop } def /max { dup 3 2 roll dup 4 3 roll gt { exch } if pop } def /midpoint { 0 begin /y1 exch def /x1 exch def /y0 exch def /x0 exch def x0 x1 add 2 div y0 y1 add 2 div end } dup 0 4 dict put def /thirdpoint { 0 begin /y1 exch def /x1 exch def /y0 exch def /x0 exch def x0 2 mul x1 add 3 div y0 2 mul y1 add 3 div end } dup 0 4 dict put def /subspline { 0 begin /movetoNeeded exch def y exch get /y3 exch def x exch get /x3 exch def y exch get /y2 exch def x exch get /x2 exch def y exch get /y1 exch def x exch get /x1 exch def y exch get /y0 exch def x exch get /x0 exch def x1 y1 x2 y2 thirdpoint /p1y exch def /p1x exch def x2 y2 x1 y1 thirdpoint /p2y exch def /p2x exch def x1 y1 x0 y0 thirdpoint p1x p1y midpoint /p0y exch def /p0x exch def x2 y2 x3 y3 thirdpoint p2x p2y midpoint /p3y exch def /p3x exch def movetoNeeded { p0x p0y moveto } if p1x p1y p2x p2y p3x p3y curveto end } dup 0 17 dict put def /storexyn { /n exch def /y n array def /x n array def n 1 sub -1 0 { /i exch def y i 3 2 roll put x i 3 2 roll put } for } def /SSten { fgred fggreen fgblue setrgbcolor dup true exch 1 0 0 -1 0 6 -1 roll matrix astore } def /FSten { dup 3 -1 roll dup 4 1 roll exch newpath 0 0 moveto dup 0 exch lineto exch dup 3 1 roll exch lineto 0 lineto closepath bgred bggreen bgblue setrgbcolor eofill SSten } def /Rast { exch dup 3 1 roll 1 0 0 -1 0 6 -1 roll matrix astore } def Begin [ 0.799705 0 0 0.799705 0 0 ] concat /originalCTM matrix currentmatrix def Begin %I Pict Begin %I Pict Begin %I Line 0 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 97 141 ] concat 87 643 119 643 Line End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 224 787 ] concat [ (Phase 1) ] Text End End %I eop Begin %I Pict Begin %I Line 0 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 97 141 ] concat 87 643 87 619 Line End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 6.12303e-17 1 -1 6.12303e-17 176.5 714.5 ] concat [ (Phase2) ] Text End End %I eop Begin %I Pict [ 1 0 0 1 -96 48 ] concat Begin %I Pict [ 1 0 0 1 -8 0 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 344 715 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 224 280 ] concat 79 419 175 443 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -8 8 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 344 683 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 224 248 ] concat 79 419 175 443 Rect End End %I eop Begin %I Pict Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 336 667 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 216 232 ] concat 79 419 175 443 Rect End End %I eop Begin %I Pict [ 1 0 0 1 0 8 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 336 635 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 216 200 ] concat 79 419 175 443 Rect End End %I eop End %I eop End %I eop Begin %I Pict [ 1 0 0 1 15 -1 ] concat Begin %I Pict [ 1 0 0 1 176 192 ] concat Begin %I Pict Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 184 571 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 92 144 ] concat 87 411 111 435 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -168 -72 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 376 643 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 271 483 295 507 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -200 8 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 432 563 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 327 403 351 427 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -280 104 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 536 467 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 431 307 455 331 Rect End End %I eop Begin %I Pict [ 1 0 0 1 72 -24 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 184 571 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 92 144 ] concat 87 411 111 435 Rect End End %I eop Begin %I Pict [ 1 0 0 1 48 -48 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 184 571 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 92 144 ] concat 87 411 111 435 Rect End End %I eop Begin %I Pict [ 1 0 0 1 24 -72 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 184 571 ] concat [ (P0) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 92 144 ] concat 87 411 111 435 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -192 -96 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 376 643 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 271 483 295 507 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -120 -120 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 376 643 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 271 483 295 507 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -144 -144 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 376 643 ] concat [ (P1) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 271 483 295 507 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -224 -16 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 432 563 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 327 403 351 427 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -248 -40 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 432 563 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 327 403 351 427 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -176 -64 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 432 563 ] concat [ (P2) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 327 403 351 427 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -304 80 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 536 467 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 431 307 455 331 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -352 32 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 536 467 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 431 307 455 331 Rect End End %I eop Begin %I Pict [ 1 0 0 1 -328 56 ] concat Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 536 467 ] concat [ (P3) ] Text End Begin %I Rect 0 0 0 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg none SetP %I p n [ 1 -0 -0 1 100 144 ] concat 431 307 455 331 Rect End End %I eop End %I eop Begin %I Pict [ 1 0 0 1 160 0 ] concat Begin %I Line 0 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 97 141 ] concat 87 643 87 619 Line End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 6.12303e-17 1 -1 6.12303e-17 176.5 714.5 ] concat [ (Phase2) ] Text End End %I eop Begin %I Pict [ 1 0 0 1 160 0 ] concat Begin %I Line 0 0 1 [] 0 SetB 0 0 0 SetCFg 1 1 1 SetCBg 0 SetP [ 1 -0 -0 1 97 141 ] concat 87 643 119 643 Line End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 224 787 ] concat [ (Phase 1) ] Text End End %I eop End %I eop Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 160.25 662.724 ] concat [ (\(a\) Row Block Distribution.) ] Text End Begin %I Text 0 0 0 SetCFg Helvetica 12 SetF [ 1 0 0 1 323.75 664 ] concat [ (\(b\) Our Proposed Distribution.) ] Text End End %I eop showpage end %%EndDocument @endspecial 269 391 a Fo(Figure)11 b(7:)18 b(Data)13 b(Distribution)e(used)i (to)f(alleviate)g(Memory)g(Contention.)503 1045 y @beginspecial 50 @llx 50 @lly 410 @urx 302 @ury 2057 @rwi @setspecial %%BeginDocument: 256.adi.ps /gnudict 40 dict def gnudict begin /Color false def /Solid false def /gnulinewidth 5.000 def /vshift -33 def /dl {10 mul} def /hpt 31.5 def /vpt 31.5 def /M {moveto} bind def /L {lineto} bind def /R {rmoveto} bind def /V {rlineto} bind def /vpt2 vpt 2 mul def /hpt2 hpt 2 mul def /Lshow { currentpoint stroke M 0 vshift R show } def /Rshow { currentpoint stroke M dup stringwidth pop neg vshift R show } def /Cshow { currentpoint stroke M dup stringwidth pop -2 div vshift R show } def /DL { Color {setrgbcolor Solid {pop []} if 0 setdash } {pop pop pop Solid {pop []} if 0 setdash} ifelse } def /BL { stroke gnulinewidth 2 mul setlinewidth } def /AL { stroke gnulinewidth 2 div setlinewidth } def /PL { stroke gnulinewidth setlinewidth } def /LTb { BL [] 0 0 0 DL } def /LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def /LT0 { PL [] 0 1 0 DL } def /LT1 { PL [4 dl 2 dl] 0 0 1 DL } def /LT2 { PL [2 dl 3 dl] 1 0 0 DL } def /LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def /LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def /LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def /LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def /LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def /LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def /P { stroke [] 0 setdash currentlinewidth 2 div sub M 0 currentlinewidth V stroke } def /D { stroke [] 0 setdash 2 copy vpt add M hpt neg vpt neg V hpt vpt neg V hpt vpt V hpt neg vpt V closepath stroke P } def /A { stroke [] 0 setdash vpt sub M 0 vpt2 V currentpoint stroke M hpt neg vpt neg R hpt2 0 V stroke } def /B { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M 0 vpt2 neg V hpt2 0 V 0 vpt2 V hpt2 neg 0 V closepath stroke P } def /C { stroke [] 0 setdash exch hpt sub exch vpt add M hpt2 vpt2 neg V currentpoint stroke M hpt2 neg 0 R hpt2 vpt2 V stroke } def /T { stroke [] 0 setdash 2 copy vpt 1.12 mul add M hpt neg vpt -1.62 mul V hpt 2 mul 0 V hpt neg vpt 1.62 mul V closepath stroke P } def /S { 2 copy A C} def end gnudict begin gsave 50 50 translate 0.100 0.100 scale 0 setgray /Times-Roman findfont 100 scalefont setfont newpath LTa 600 251 M 0 2218 V LTb LTa 600 251 M 2817 0 V LTb 600 251 M 63 0 V 2754 0 R -63 0 V 540 251 M (1000) Rshow LTa 600 585 M 2817 0 V LTb 600 585 M 31 0 V 2786 0 R -31 0 V LTa 600 780 M 2817 0 V LTb 600 780 M 31 0 V 2786 0 R -31 0 V LTa 600 919 M 2817 0 V LTb 600 919 M 31 0 V 2786 0 R -31 0 V LTa 600 1026 M 2817 0 V LTb 600 1026 M 31 0 V 2786 0 R -31 0 V LTa 600 1114 M 2817 0 V LTb 600 1114 M 31 0 V 2786 0 R -31 0 V LTa 600 1188 M 2817 0 V LTb 600 1188 M 31 0 V 2786 0 R -31 0 V LTa 600 1253 M 2817 0 V LTb 600 1253 M 31 0 V 2786 0 R -31 0 V LTa 600 1309 M 2817 0 V LTb 600 1309 M 31 0 V 2786 0 R -31 0 V LTa 600 1360 M 2817 0 V LTb 600 1360 M 63 0 V 2754 0 R -63 0 V -2814 0 R (10000) Rshow LTa 600 1694 M 2817 0 V LTb 600 1694 M 31 0 V 2786 0 R -31 0 V LTa 600 1889 M 2817 0 V LTb 600 1889 M 31 0 V 2786 0 R -31 0 V LTa 600 2028 M 2817 0 V LTb 600 2028 M 31 0 V 2786 0 R -31 0 V LTa 600 2135 M 2817 0 V LTb 600 2135 M 31 0 V 2786 0 R -31 0 V LTa 600 2223 M 2817 0 V LTb 600 2223 M 31 0 V 2786 0 R -31 0 V LTa 600 2297 M 2817 0 V LTb 600 2297 M 31 0 V 2786 0 R -31 0 V LTa 600 2362 M 2817 0 V LTb 600 2362 M 31 0 V 2786 0 R -31 0 V LTa 600 2418 M 2817 0 V LTb 600 2418 M 31 0 V 2786 0 R -31 0 V LTa 600 2469 M 2817 0 V LTb 600 2469 M 63 0 V 2754 0 R -63 0 V -2814 0 R (100000) Rshow LTa 600 251 M 0 2218 V LTb 600 251 M 0 63 V 0 2155 R 0 -63 V 600 151 M (0) Cshow LTa 952 251 M 0 2218 V LTb 952 251 M 0 63 V 0 2155 R 0 -63 V 952 151 M (2) Cshow LTa 1304 251 M 0 2218 V LTb 1304 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (4) Cshow LTa 1656 251 M 0 2218 V LTb 1656 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (6) Cshow LTa 2009 251 M 0 2218 V LTb 2009 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (8) Cshow LTa 2361 251 M 0 2218 V LTb 2361 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (10) Cshow LTa 2713 251 M 0 2218 V LTb 2713 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (12) Cshow LTa 3065 251 M 0 2218 V LTb 3065 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (14) Cshow LTa 3417 251 M 0 2218 V LTb 3417 251 M 0 63 V 0 2155 R 0 -63 V 0 -2255 R (16) Cshow 600 251 M 2817 0 V 0 2218 V -2817 0 V 600 251 L 220 1260 M currentpoint gsave translate 90 rotate 0 0 M (Execution Time \(milli sec\)) Cshow grestore 2008 51 M (Number of Processors) Cshow LT0 2009 1253 M (Owner Computes\(Block, Block\)) Rshow 2069 1253 M 180 0 V 952 2200 M 352 -24 V 705 86 V 1408 89 V 2129 1253 D 952 2200 D 1304 2176 D 2009 2262 D 3417 2351 D LT1 2009 1153 M (Sequential) Rshow 2069 1153 M 180 0 V 776 2146 M 2129 1153 A 776 2146 A LT2 2009 1053 M (No Distributions) Rshow 2069 1053 M 180 0 V 3417 2056 M -704 3 V -704 6 V -705 8 V -352 61 V 2129 1053 B 3417 2056 B 2713 2059 B 2009 2065 B 1304 2073 B 952 2134 B LT3 2009 953 M (\(*,Cyclic\)) Rshow 2069 953 M 180 0 V 952 2090 M 352 -33 V 705 -48 V 704 25 V 704 18 V 2129 953 C 952 2090 C 1304 2057 C 2009 2009 C 2713 2034 C 3417 2052 C LT4 2009 853 M (\(*,Block\)) Rshow 2069 853 M 180 0 V 952 2070 M 352 -53 V 705 -45 V 704 23 V 704 9 V 2129 853 T 952 2070 T 1304 2017 T 2009 1972 T 2713 1995 T 3417 2004 T LT5 2009 753 M (Owner Computes\(*,Cyclic\)) Rshow 2069 753 M 180 0 V 952 2080 M 352 -252 V 705 -78 V 352 216 V 352 20 V 704 128 V 2129 753 S 952 2080 S 1304 1828 S 2009 1750 S 2361 1966 S 2713 1986 S 3417 2114 S LT6 2009 653 M (Owner Computes\(*,Block\)) Rshow 2069 653 M 180 0 V 3417 1820 M 2713 1611 L -704 -55 V -705 118 V 952 1954 L 2129 653 D 3417 1820 D 2713 1611 D 2009 1556 D 1304 1674 D 952 1954 D LT7 2009 553 M (\(Block,Block\)) Rshow 2069 553 M 180 0 V 1168 547 R -704 222 V -704 85 V -705 249 V 952 1951 L 2129 553 A 3417 1100 A 2713 1322 A 2009 1407 A 1304 1656 A 952 1951 A stroke grestore end showpage %%EndDocument @endspecial 532 1111 a(Figure)f(8:)18 b(ADI)12 b(Performance)f(\(256x256\).) 4 1246 y(cases)18 b(adhering)d(to)i(owner)o(-computes)e(rule)h(can)h(incur)e (severe)i(synchronization)f(or)g(ownership)f(test)4 1300 y(overhead)c(which)f (exceeds)h(the)g(cost)g(of)f(accessing)i(remote)e(memory)m(.)17 b(W)l(e)11 b(use)g(the)g(Altering)f(Direction)4 1355 y(Integration)i(\()p Fd(ADI)p Fo(\))f(to)i(illustrate)f(that)h(the)g(shared)f(address)i(space)f (provides)g(\257exibility)e(in)i(the)g(choice)4 1409 y(of)i(computation)f (partitions,)h(reducing)f(contention)g(and)h(synchronization)f(overhead,)i (and)e(resulting)4 1463 y(in)e(signi\256cant)g(performance)g(improvements.)77 1539 y(W)l(e)k(use)f(the)g(Hector)n(,)h(a)f(Non-Uniform)e(Memory)i(Access)h (multiprocessor)n(,)f(as)g(an)h(experimental)4 1593 y(platform.)21 b(Hector)13 b(consists)h(of)f(4)h(sets)g(of)f(processor)o(-memory)g(pairs)g (connected)h(by)f(a)h(bus)g(to)f(form)g(a)4 1647 y(station;)g(4)g(stations)g (are)f(connected)h(by)g(a)g(local)f(ring)g(to)h(form)f(a)h(cluster;)f(4)h (local)g(rings)f(are)h(connected)4 1701 y(by)g(a)g(global)g(ring.)19 b(W)l(e)14 b(use)f(a)g(system)h(with)e(one)h(cluster)m(.)21 b(Each)13 b(processor)o(-memory)f(pair)g(consists)i(of)4 1755 y(a)f(Motorola)f(MC88100)h(CPU,)g(a)g(16)f(KB)h(instruction)f(cache,)i(a)f (16)f(KB)h(data)g(cache)g(and)g(4)f(MB)i(of)e(the)4 1809 y(globally)i (addressable)g(memory)m(.)23 b(The)15 b(hardware)e(provides)h(no)g(support)g (for)f(cache)i(coherence.)23 b(The)4 1864 y(coherence)12 b(of)f(data)h(is)g (maintained)f(by)h(software)f(at)h(cache)g(line)f(granularity)g([10)o(].)18 b(Data)12 b(distributions)4 1918 y(are)g(implemented)g(using)g(the)h(array)e (allocation)h(techniques)h(described)f(in)g([21,)h(3].)4 2079 y Fc(5.1)58 b(Contention)15 b(and)f(Synchr)n(onization)h(Conscious)e (Distribution)4 2177 y Fo(The)21 b Fd(ADI)e Fo(program)g(has)h(two)g(phases)h (with)e(parallelism)g(along)h(orthogonal)f(dimensions)h(in)f(each)4 2231 y(phase.)k(It)13 b(operates)g(on)h(three)f(2-dimensional)g(arrays)g Fj(A)p Fo(,)h Fj(B)i Fo(and)e Fj(X)t Fo(.)22 b(A)14 b(single)g(iteration)e (of)i(an)f(outer)4 2285 y(sequentially)d(iterated)f(loop)h(consists)g(of)g(a) g(forward)f(and)g(a)i(backward)e(sweep)i(phase)f(along)g(the)f(rows)h(of)4 2339 y(three)f(arrays,)h(followed)e(by)h(another)g(forward)e(and)i(backward)g (sweep)h(phase)f(along)g(the)g(columns)g(of)g(the)4 2394 y(arrays)g([18].)17 b(This)10 b(application)f(is)h(typical)f(of)g(other)g(programs)g(such)g(as)h Fd(2D-FFT)f Fo(and)h Fd(Erlebacher)4 2448 y Fo(that)i(have)h(parallelism)f (in)g(orthogonal)f(directions)h(in)g(dif)o(ferent)f(phases)j(of)e(the)g (program.)77 2523 y(The)k(best)g(data)f(distribution)f(scheme)i(for)e Fd(ADI)h Fo(remains)g(an)g(issue)h(of)f(debate)h([18)o(,)g(4].)26 b(The)16 b(two)4 2578 y(proposed)h(schemes)g(partition)f(arrays)g(along)h(a)f (single)h(dimension,)h(either)e(in)h(blocks)g(or)f(cyclically)m(.)4 2632 y(These)f(distributions,)g(in)e(conjunction)h(with)g(the)g(owner)o (-computes)f(rule)h(result)f(in)h(a)h(wavefront)e(type)4 2686 y(computation,)e(leading)g(to)g(heavy)g(synchronization)g(overhead)g(in)g (one)g(of)f(the)i(phases)g(of)e(the)h(program.)4 2740 y(Figure)i(7\(a\))g (shows)i(a)f Fd(Block)f Fo(distribution)g(of)g(the)h(rows)g(of)f(the)h (arrays.)23 b(W)n(ith)13 b(such)h(a)g(distribution,)4 2794 y(during)g(the)h(\256rst)g(phase)g(of)g(the)g(program)e(all)i(the)g (processors)g(access)i(data)e(that)f(is)i(local)f(and)g(require)4 2848 y(no)g(communication.)25 b(During)14 b(the)h(second)g(phase,)h(however)n (,)g(the)f(parallelism)f(is)h(orthogonal)f(to)h(the)p eop %%Page: 8 8 8 7 bop 35 1 a Fo(T)m(able)12 b(1:)18 b(Performance)11 b(Bottlenecks)i(for)e (various)h(data)h(and)f(computation)g(partitioning)e(for)i(ADI.)p 244 33 1363 2 v 243 84 2 51 v 252 84 V 277 69 a Fb(Data)g(Distribution)p 620 84 V 77 w(Compute)f(Rule)p 993 84 V 135 w(Performance)i(Bottleneck)p 1597 84 V 1606 84 V 244 85 1363 2 v 243 136 2 51 v 252 136 V 387 121 a(None)p 620 136 V 247 w(Relaxed)p 993 136 V 228 w(Memory)e(Contention)p 1597 136 V 1606 136 V 243 187 V 252 187 V 344 172 a(\(*,)h(Block\))p 620 187 V 117 w(Owner)o(-Computes)p 993 187 V 125 w(High)f(Synchronization)p 1597 187 V 1606 187 V 243 238 V 252 238 V 344 223 a(\(*,)h(Block\))p 620 238 V 204 w(Relaxed)p 993 238 V 228 w(Memory)f(Contention)p 1597 238 V 1606 238 V 243 289 V 252 289 V 339 273 a(\(*,)h(Cyclic\))p 620 289 V 112 w(Owner)o(-Computes)p 993 289 V 125 w(High)f(Synchronization)p 1597 289 V 1606 289 V 243 340 V 252 340 V 339 324 a(\(*,)h(Cyclic\))p 620 340 V 199 w(Relaxed)p 993 340 V 228 w(Memory)f(Contention)p 1597 340 V 1606 340 V 243 390 V 252 390 V 301 375 a(\(Block,)h(Block\))p 620 390 V 74 w(Owner)o(-Computes)p 993 390 V 180 w(Ownership)f(tests)p 1597 390 V 1606 390 V 243 441 V 252 441 V 301 426 a(\(Block,)h(Block\))p 620 441 V 161 w(Relaxed)p 993 441 V 137 w(High)f(Remote)g(Memory)g(Access)p 1597 441 V 1606 441 V 244 443 1363 2 v 4 606 a Fo(direction)17 b(of)g(distribution.)32 b(Strict)16 b(adherence)h(to)g(the)h(owner)o (-computes)e(rule)h(implies)g(ordering)f(of)4 660 y(the)d(computations)f(by)g (processors)h(on)f(the)g(corresponding)g(chunk)g(of)g(the)h(columns)f(they)g (own.)19 b(Thus,)4 715 y(processor)10 b Fj(i)g Fo(has)h(to)f(wait)f(for)h (processor)g Fj(i)c Fh(\000)g Fo(1)j(to)h(\256nish)g(the)g(computation)f(on)h (its)h(chunk)f(of)f(the)h(column)4 769 y(before)g(proceeding.)17 b(A)10 b(lar)o(ger)f(number)h(of)f(synchronizations)h(are)g(required)f(to)h (maintain)g(the)g(ordering)4 823 y(involved)i(in)g(the)g(wavefront)g (computation.)77 899 y(The)i(synchronization)e(overhead)h(can)g(be)g (eliminated)f(by)h(relaxing)f(the)h(owner)o(-computes)f(rule)h(in)4 953 y(the)18 b(second)g(phase)h(and)f(allowing)f(the)h(processor)g(to)f (write)h(the)f(results)i(to)e(remote)h(memory)f(mod-)4 1007 y(ules.)24 b(This)15 b(eliminates)f(synchronization)f(overhead)h(at)g(the)g (expense)g(of)g(increased)g(remote)g(memory)4 1061 y(accesses.)26 b(However)n(,)15 b(the)g(use)g(of)f(this)g(relaxed)g(compute)g(rule)g(with)g (the)g Fd(\(*,Block\))g Fo(distribution)4 1115 y(results)9 b(in)g(heavy)g(contention.)17 b(Each)9 b(processor)g(is)g(responsible)g(for)f (computing)g(a)i(column,)f(and)g(hence,)4 1169 y(each)14 b(processor)g (accesses)h(every)e(memory)g(module)g(in)g(sequence.)23 b(Thus,)15 b(a)e(given)h(memory)e(module)4 1224 y(is)h(accessed)h(by)e(every)g (processor)g(at)h(the)f(same)h(time,)f(leading)g(to)h(contention.)77 1299 y(The)k(data)e(distribution)g(scheme)h(depicted)f(in)h(Figure)f(7\(b\)) 1149 1281 y Fm(4)1182 1299 y Fo(eliminates)h(contention)f(and)h(results)4 1353 y(in)21 b(the)g(best)h(possible)f(performance)f(with)h(the)g(relaxed)g (compute)g(rule.)44 b(W)n(ith)21 b(this)g(distribution,)4 1408 y(processors)13 b(access)g(data)g(from)e(remote)g(memory)h(modules)g(in)g (both)g(phases)h(of)f(the)g(program.)17 b(In)12 b(both)4 1462 y(phases,)h(processors)f(start)g(working)e(on)i(the)f(columns)h(assigned)g (to)g(them)f(by)g(accessing)i(data)f(that)f(is)h(in)4 1516 y(dif)o(ferent)f(memory)f(modules)i(thus)g(avoiding)f(contention.)18 b(There)12 b(is)g(no)f(wavefront)g(type)h(parallelism,)4 1570 y(and)h(hence)f(no)g(overhead)g(involved)g(due)h(to)f(synchronization.)77 1646 y(The)19 b(use)f(of)f(owner)o(-computes)g(rule)h(with)f(the)h (distribution)f(of)g(Figure)g(7\(b\))g(will)h(not)f(result)h(in)4 1700 y(good)f(performance.)31 b(Either)17 b(ownership)g(tests)h(must)f(be)g (introduced)f(in)h(the)g(body)g(of)g(the)g(loops)g(to)4 1754 y(enforce)c(the)g(owner)o(-computes)f(rule,)i(or)e(the)i(loops)f(must)g(be)g (rewritten)g(with)f(additional)h(strip-mined)4 1808 y(controlling)f(loops)h (to)g(schedule)h(the)f(computations)f(on)h(sub-blocks)g(of)g(the)g(array)m(.) 20 b(The)14 b(former)d(leads)4 1862 y(to)h(overhead)g(and)h(the)f(latter)g (introduces)g(synchronization)g(similar)f(to)i(the)f(wavefront)f (computation.)77 1938 y(The)j(result)g(of)f(executing)g(the)h Fd(ADI)f Fo(application)g(on)h(the)f(Hector)g(multiprocessor)g(for)g(a)h (data)f(size)4 1992 y(of)18 b(256x256)f(with)h(various)g(data)g (distributions)g(and)g(compute)g(rules)g(is)g(shown)g(in)g(Figure)g(8.)35 b(The)4 2046 y Fd(\(Block,Block\))17 b Fo(data)h(distribution)f(that)h (relaxes)g(the)g(owner)o(-computes)f(rule)g(outperforms)g(all)4 2101 y(data)d(distribution)e(schemes)i(that)f(adhere)g(to)g(the)g(rule.)21 b(The)14 b(\256gure)f(also)g(indicates)h(that)f(the)g(overhead)4 2155 y(due)j(to)g(the)f(ownership)h(tests)g(when)g(using)g(the)f(owner)o (-computes)g(rule)h(with)f(a)h Fd(\(Block,Block\))4 2209 y Fo(distribution)d(degrades)h(performance.)21 b(It)14 b(is)g(also)g(clear)g (that)f(the)h(use)g(of)g(data)g(distribution)e(improves)4 2263 y(performance)i(over)h(the)g(use)h(of)f(operating)f(system)i(policies)f(to)g (manage)g(data)h(\(the)e(no)i(distributions)4 2317 y(curve\).)35 b(The)19 b(performance)e(bottlenecks)h(of)g(various)g(distributions)g(for)f Fd(ADI)h Fo(are)g(summarized)g(in)4 2371 y(T)m(able)13 b(1.)p 4 2406 737 2 v 62 2437 a Fl(4)79 2452 y Fb(This)f(is)h(equivalent)g(to)f (!HPF$)i(PROCESSORS)i(PROCS\(N\))g(with)c(!HPF$)i(DISTRIBUTE)h(B\(BLOCK,)4 2503 y(BLOCK\),)10 b(X\(BLOCK,)g(BLOCK\))g(ON)e(PROCS)j(in)d Fa(HPF)p Fb(.)i(In)f(the)f(current)h Fa(HPF)h Fb(speci\256cation,)f(this)f (distribution)4 2554 y(is)18 b(not)g(valid;)k(the)c(rank)h(of)g(each)g (distributee)f(must)f(equal)i(the)g(rank)f(of)h(the)g(named)f(processor)h (grid)f([16].)4 2604 y(Distributions)7 b(in)i(which)g(this)g(is)f(not)h(the)g (case)h(introduce)f(additional)g(complexity)e(on)j(DMMs)e([17].)16 b(In)10 b(contrast,)4 2655 y(SSMMs)h(provide)g(the)g(\257exibility)f(to)h (implement)f(these)h(distributions.)p eop %%Page: 9 9 9 8 bop 4 -21 a Fn(6)71 b(Related)19 b(W)l(ork)4 106 y Fo(Several)12 b(researchers)g(have)g(focused)g(on)g(the)g(problem)f(of)g(deriving)g(data)h (distributions)f(automatically)4 160 y(for)g(DMMs.)20 b(Li)12 b(and)g(Chen)h([22)o(],)f(Gupta)g(and)g(Banerjee)h([12)o(],)f(Zima)h(et)f (al.)h([9)o(])f(and)g(Garcia)g(et)g(al.)h([11)o(])4 215 y(follow)e(the)h (approach)g(of)f(\256nding)h(the)f(alignment)h(constraints)g(between)g(dif)o (ferent)e(dimensions)i(of)g(the)4 269 y(arrays)g(and)g(derive)g(a)g(data)g (distribution)f(that)h(minimizes)g(interprocessor)g(communication.)17 b(T)m(o)12 b(avoid)4 323 y(a)f(heuristic)g(approach,)g(Bixby)g(et)g(al.)h([7) o(])f(formulate)e(a)j(0-1)e(integer)g(programming)g(problem)g(for)g(deriv-)4 377 y(ing)k(data)g(distributions.)21 b(Their)14 b(approach)g(relies)g(on)f (the)h(assumption)g(that)g(a)g(good)f(data)h(distribution)4 431 y(for)h(the)i(entire)e(program)g(can)i(be)f(found)f(by)i(mer)o(ging)e (the)h(data)g(distributions)g(of)f(smaller)h(segments)4 485 y(of)g(the)g(program.)27 b(They)17 b(minimize)e(the)h(interprocessor)f (communication)g(using)h(the)g(\252performance)4 540 y(estimator)r(\272)c (developed)h(by)g(Balasundaram)g(et)g(al.)g([6)o(].)20 b(Anderson)12 b([5])g(presents)i(an)e(algebraic)h(frame-)4 594 y(work)g(for)g(determining)f (data)h(and)h(computation)e(partitions)h(by)g(minimizing)g(communication)f (across)4 648 y(processors.)28 b(Data)16 b(transformations)e(are)i(then)f (applied)h(so)f(that)h(the)f(processors)h(access)h(contiguous)4 702 y(data)g(regions)f(to)h(reduce)g(false)g(sharing.)31 b(This)17 b(technique)g(is)g(oblivious)f(to)h(SSMM)g(speci\256c)g(issues)4 756 y(such)c(as)g(contention)f(and)g(cache)h(af)o(\256nity)m(.)4 938 y Fn(7)71 b(Concluding)19 b(Remarks)4 1065 y Fo(Although)9 b(lar)o(ge)g(SSMMs)i(are)e(built)g(based)h(on)g(an)g(architecture)e(with)i (distributed)f(memory)m(,)g(the)h(shared)4 1119 y(memory)15 b(paradigm)g(introduces)g(performance)g(issues)i(that)e(are)h(dif)o(ferent)f (from)f(those)i(encountered)4 1173 y(in)e(DMMs.)24 b(The)14 b(high)f(cost)i(of)e(interprocessor)g(communication)g(in)h(distributed)f (memory)f(multipro-)4 1227 y(cessors)18 b(makes)e(the)h(minimization)e(of)h (communication)g(the)g(predominant)g(issue)h(in)f(selecting)h(data)4 1282 y(distributions)h(and)i(in)e(partitioning)g(computations.)38 b(On)19 b(SSMMs,)j(a)d(methodology)f(for)h(selecting)4 1336 y(data)14 b(distributions)g(must)g(also)g(consider)g(cache)h(af)o(\256nity)m (,)f(memory)f(contention)h(and)g(false)g(sharing)g(in)4 1390 y(addition)d(to)g(the)g(cost)h(of)f(interprocessor)g(communication.)17 b(Furthermore,)10 b(the)h(single)h(shared)f(address)4 1444 y(space)j(present)f(in)g(SSMMs)g(provides)g(\257exibility)f(in)h(the)g (selection)g(of)f(computation)h(partitions.)19 b(This)4 1498 y(should)e(be)f(exploited)g(in)g(applications)h(in)f(which)g(owner)o (-computes)g(results)g(in)h(poor)f(performance.)4 1552 y(The)f Fe(Jasmine)g Fo(compiler)f(project)g([2)o(])h(is)g(investigating)f(the)g (issues)i(discussed)f(in)g(this)f(paper)h(through)4 1607 y(the)d(development) g(of)g(a)h(framework)e(for)g(automatically)h(deriving)f(data)i(distributions) e(on)i(SSMMs.)4 1777 y Fn(Refer)o(ences)29 1896 y Fo([1])24 b(T)l(.S.)17 b(Abdelrahman)f(et)g(al.)29 b(An)16 b(overview)f(of)h(the)g (NUMAchine)h(multiprocessor)e(project.)28 b(In)112 1943 y Fe(Pr)n(oc.)13 b(of)g(the)f(Canadian)g(Super)n(computing)g(Conf.)p Fo(,)h(pages)g (283\261295,)f(1994.)29 2032 y([2])24 b(T)l(.S.)12 b(Abdelrahman,)f(N.)h (Manjikian,)g(and)f(S.)g(T)m(andri.)16 b(The)11 b(Jasmine)h(Compiler.)k(In)10 b(preparation.)29 2120 y([3])24 b(T)l(.S.)e(Abdelrahman)e(and)h(T)l(.N.)h(W)l (ong.)41 b(Distributed)20 b(array)g(data)h(management)g(on)f(NUMA)112 2167 y(multiprocessors.)d(In)12 b Fe(Pr)n(oc.)i(of)e(SHPCC)p Fo(,)i(pages)f(551\261559,)f(1994.)29 2256 y([4])24 b(S.P)-6 b(.)10 b(Amarasinghe,)g(J.M.)h(Anderson,)f(M.S.)h(Lam,)g(and)e(A.W)-5 b(.)11 b(Lim.)j(An)9 b(overview)g(of)g(a)h(compiler)112 2303 y(for)k(scalable)j(parallel)e(machines.)27 b(In)15 b Fe(Languages)h(and)f (Compilers)i(for)f(Parallel)g(Computing)p Fo(,)112 2350 y(pages)c (253\261272.)h(Springer)o(-V)-6 b(erlag)10 b(LNCS-768,)j(1993.)29 2438 y([5])24 b(J.M.)13 b(Anderson.)j(Demonstration)11 b(of)g(automatic)g (data)h(and)f(computation)g(decomposition)g(tech-)112 2485 y(niques.)f(In)e Fe(Pr)n(oc.)g(of)g(the)g(W)-5 b(orkshop)8 b(on)g(Automatic)g(Data)g(Layout)g(and)g(Performance)g(Pr)n(ediction)p Fo(,)112 2532 y(1995.)29 2620 y([6])24 b(V)-6 b(.)12 b(Balasundaram,)i(G.)f (Fox,)g(K.)g(Kennedy)m(,)g(and)f(U.)i(Kremer)m(.)k(A)13 b(static)g (performance)e(estimator)112 2667 y(to)h(guide)g(data)g(partitioning)f (decisions.)19 b(In)12 b Fe(Pr)n(oc.)i(of)e(PPOPP)p Fo(,)j(pages)e (213\261223,)f(1991.)29 2756 y([7])24 b(R.)9 b(Bixby)m(,)h(K.)g(Kennedy)m(,)f (and)h(U.)f(Kremer)m(.)j(Automatic)d(data)g(layout)f(using)h(0-1)g(integer)f (program-)112 2803 y(ming.)15 b(In)c Fe(Pr)n(oc.)i(of)e(the)g(Int'l)f(Conf.)i (on)f(Parallel)g(Ar)n(chitectur)n(es)i(and)e(Compilation)g(T)-5 b(echniques)p Fo(,)112 2850 y(pages)12 b(111\261122,)h(1994.)p eop %%Page: 10 10 10 9 bop 29 -27 a Fo([8])24 b(W)-5 b(.J.)19 b(Bolosky)f(and)g(M.L.)h(Scott.) 32 b(False)18 b(sharing)f(and)h(its)g(ef)o(fect)f(on)g(shared)h(memory)f (multi-)112 20 y(processors.)k(In)13 b Fe(Pr)n(oc.)j(of)d(4th)g(Symp.)h(on)g (Experiences)h(with)e(Distributed)h(and)f(Multipr)n(ocessor)112 67 y(Systems)p Fo(,)g(pages)g(57\26171,)f(1993.)29 155 y([9])24 b(B.M.)14 b(Chapman,)g(T)l(.)h(Fahringer)n(,)e(and)g(H.)h(Zima.)21 b(Automatic)12 b(support)h(for)f(data)i(distribution)e(on)112 202 y(distributed)g(memory)g(multiprocessor)g(systems.)22 b(In)12 b Fe(Languages)h(and)g(Compilers)h(for)f(Parallel)112 249 y(Computing)p Fo(,)f(pages)h(184\261199.)f(Springer)o(-V)-6 b(erlag)11 b(LNCS-768,)h(1993.) 4 337 y([10])24 b(B.)11 b(Gamsa.)k(Region-oriented)9 b(main)h(memory)f (management)h(in)g(shared-memory)f(NUMA)i(mul-)112 384 y(tiprocessors.)19 b(Master)r(')m(s)13 b(thesis,)h(Department)d(of)i(Computer)f(Science,)h (University)f(of)g(T)m(oronto,)112 431 y(T)m(oronto,)f(CANADA,)i(1992.)4 519 y([11])24 b(J.)f(Garcia,)h(E.)f(A)-5 b(yguade,)26 b(and)c(J.)h(Labarta.) 44 b(A)22 b(novel)g(approach)g(towards)g(automatic)g(data)112 566 y(distribution.)33 b(In)18 b Fe(Pr)n(oc.)i(of)e(the)h(W)-5 b(orkshop)20 b(on)e(Automatic)g(Data)g(Layout)g(and)h(Performance)112 613 y(Pr)n(ediction)p Fo(,)13 b(1995.)4 701 y([12])24 b(M.)16 b(Gupta)f(and)h(P)-6 b(.)17 b(Banerjee.)27 b(Automatic)15 b(data)g (partitioning)g(on)g(distributed)g(memory)g(multi-)112 748 y(processors.)j Fe(IEEE)c(T)m(rans.)f(on)f(Parallel)h(and)f(Distributed)h (Systems)p Fo(,)g(3\(2\):179\261193,)e(1992.)4 836 y([13])24 b(K.)15 b(Harzallah)g(and)g(K.C.)h(Sevcik.)25 b(Hot)15 b(spot)g(analysis)g (in)g(lar)o(ge)g(scale)h(shared)f(memory)f(multi-)112 883 y(processors.)k(In) 12 b Fe(Pr)n(oc.)i(of)e(Super)n(computing'93)p Fo(,)g(pages)h(895\261905.)f (ACM,)i(1993.)4 971 y([14])24 b(M.)11 b(Heinrich)f(et)h(al.)16 b(The)11 b(Stanford)f(FLASH)g(Multiprocessor.)16 b(In)10 b Fe(Pr)n(oc.)i(of)f(the)g(21st)g(Int'l)e(Symp.)112 1018 y(on)j(Computer)g(Ar)n (chitectur)n(e)p Fo(,)j(pages)e(302\261313,)f(1994.)4 1106 y([15])24 b(S.)15 b(Hiranandani,)i(K.)f(Kennedy)m(,)g(and)g(C.)g(T)m(seng.)27 b(Compiler)15 b(optimizations)g(for)g(Fortran)f(D)i(on)112 1153 y(MIMD)f(distributed-memory)e(machines.)25 b(In)15 b Fe(Pr)n(oc.)h(of)f (Super)n(computing'91)p Fo(,)g(pages)h(86\261100,)112 1200 y(Albuquerque,)c(NM,)h(1991.)4 1288 y([16])24 b(HPF)l(.)33 b(High)17 b(Performance)g(Fortran)g(Language)i(Speci\256cation)e(\(High)g (Performance)g(Fortran)112 1335 y(Forum\).)f(T)m(echnical)d(report)e (CRPC-TR92225,)i(Rice)g(University)m(,)f(1994.)4 1424 y([17])24 b(C.)13 b(Koelbel.)18 b(HPF)12 b(constraints.)18 b(Personal)12 b(Communications,)g(1995.)4 1512 y([18])24 b(U.)14 b(Kremer)m(.)23 b(Automatic)14 b(data)g(layout)g(for)g(distributed-memory)e(multiprocessors.) 23 b(T)m(echnical)112 1559 y(report)11 b(CRPC-TR93229-S,)h(Center)h(for)e (Research)i(on)f(Parallel)g(Computation,)g(1993.)4 1647 y([19])24 b(T)l(.T)l(.)15 b(Kwan,)f(B.K.)h(T)m(otty)m(,)e(and)g(D.A.)h(Reed.)21 b(Communication)12 b(and)i(computation)e(performance)112 1694 y(of)f(the)i(CM5.)19 b(In)12 b Fe(Pr)n(oc.)h(of)g(Super)n(computing'93)p Fo(,)f(pages)h(192\261201.)f(ACM,)h(1993.)4 1782 y([20])24 b(D.)15 b(Lenoski)h(et)f(al.)26 b(The)15 b(Stanford)f(DASH)h(multiprocessor)m (.)25 b Fe(IEEE)16 b(Computer)p Fo(,)h(25\(3\):63\26179,)112 1829 y(1992.)4 1917 y([21])24 b(H.)12 b(Li)g(and)g(K.C.)h(Sevcik.)k (Numacros:)g(Data)12 b(parallel)f(programming)f(on)i(NUMA)g(multiproces-)112 1964 y(sors.)h(In)c Fe(Pr)n(oc.)i(of)e(4th)g(Symp.)h(on)g(Experiences)g(with) g(Distributed)f(and)g(Multipr)n(ocessor)j(Systems)p Fo(,)112 2011 y(pages)g(247\261263,)h(1993.)4 2099 y([22])24 b(J.)11 b(Li)h(and)f(M.)h(Chen.)k(Compiling)10 b(communication-ef)o(\256cient)f (programs)i(for)f(massively)h(parallel)112 2146 y(machines.)18 b Fe(Journal)12 b(of)h(Parallel)f(and)h(Distributed)f(Computing)p Fo(,)g(2\(3\):361\261376,)f(1991.)4 2234 y([23])24 b(Cray)14 b(Research.)25 b(The)16 b(Cray)e(Research)h(Massively)h(Parallel)e(Processor) g(System)h(-)f(Cray)h(T3D.)112 2281 y(T)m(echnical)d(report)f(80922,)i (Munchen,)g(Germany)m(,)f(1993.)4 2369 y([24])24 b(Kendall)12 b(Square)f(Research.)19 b Fe(KSR1)13 b(Principles)h(of)e(Operation)p Fo(.)18 b(W)l(altham,)13 b(MA,)g(1991.)4 2457 y([25])24 b(J.)12 b(T)m(orres,)g(E.)h(A)-5 b(yguade,)13 b(J.)g(Labarta,)f(and)g(M.)h(V)-6 b(alero.)17 b(Align)12 b(and)g(distribute-based)f(linear)g(loop)112 2504 y(transformations.)16 b(In)11 b Fe(Languages)g(and)h(Compilers)h(for)f (Parallel)g(Computing)p Fo(,)g(pages)g(321\261339.)112 2551 y(Springer)o(-V)-6 b(erlag)10 b(LNCS-768,)j(1993.)4 2639 y([26])24 b(Z.)17 b(V)m(ranesic,)h(M.)f(Stumm,)f(R.)i(White,)f(and)f(D.)h(Lewis.)30 b(The)16 b(Hector)g(Multiprocessor.)29 b Fe(IEEE)112 2686 y(Computer)p Fo(,)13 b(24\(1\):72\26180,)e(1991.)4 2774 y([27])24 b(R.W)-5 b(.)12 b(W)n(isniewski,)g(L.I.)g(Kontothanassis,)h(and)e(M.L.)h(Scott.)k (High)11 b(performance)e(synchroniza-)112 2821 y(tion)i(algorithms)h(for)g (multiprogrammed)e(multiprocessors.)18 b(In)12 b Fe(Pr)n(oc.)h(of)g(PPOPP)p Fo(,)h(1995.)p eop %%Trailer end userdict /end-hook known{end-hook}if %%EOF |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Unrau_PhD.ps.Z version [32bddefae3].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Unrau_etal_EuroPar95.ps.Z version [9505eb5632].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Unrau_etal_JSC94.ps.Z version [a726b2c85a].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Unrau_etal_OSDI94.ps.Z version [83f3c82777].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Vranesic_etal_IEEEC.ps.Z version [1d9c37a6ec].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Wilton_Vranesic_SPDP.ps.Z version [dc31e62471].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Wu_MASc.ps.Z version [528d8fad81].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/Zhou_Brecht_SM91.ps.Z version [3646cac530].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/depth-guide.ps.Z version [2e30273645].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/headerize version [882320c08b].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
#!/bin/sh # # to run (if called headerize): headerize <text> < in_file > out_file # e.g.: if heading is: "a heading", source file is sfile.ps, destination # file is dfile.ps, then use: # # headerize a heading < sfile.ps > dfile.ps # or # headerize "a heading" < sfile.ps > dfile.ps # gawk -v MYHEADING="$*" ' BEGIN{ begin=1 } begin==1 && $1 !~ /^%.*/ { begin=0 printf "save\n" printf "gsave\n" printf "/Times-Italic findfont 9 scalefont setfont\n" printf "72 750 moveto (%s) show\n", MYHEADING printf "grestore\n" printf "restore\n" } { print } ' |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/ldreport.ps.Z version [a667289c44].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/OLD/titlepage.ps.Z version [97f6c1729a].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Okrieg_PhD.ps.Z version [2a337a5117].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Orran_etal_SPDPW95.ps.Z version [6417b8ad38].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Parsons_Sevcik_IPPS95.ps.Z version [10a026c65e].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Parsons_etal_IWOOS95.ps.Z version [65a84877c7].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/README.Z version [2cf48fe915].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Ravi_Stumm_ICPP95.ps.Z version [9e21d5d59f].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Ravi_Stumm_JIEICE96.ps.Z version [4f687d2453].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Sandhu_et_al_PPOPP.ps.Z version [7cc4c5d88e].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Sevcik_JPE.ps.Z version [450977ab1f].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Sevcik_Zhou_PERF93.ps.Z version [5ff752a92b].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Stumm_Unrau_Krieger_USENIX92.ps.Z version [e4c619b8e2].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Stumm_Vranesic_White_IPPS93.ps.Z version [8754f4d7f3].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Tandri_Abdel_PDPTA95.ps.Z version [bfeee3e72d].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Unrau_PhD.ps.Z version [16101bfa79].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Unrau_etal_EuroPar95.ps.Z version [e35f3da997].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Unrau_etal_JSC94.ps.Z version [f41221d04e].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Unrau_etal_OSDI94.ps.Z version [1bf1a8082f].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Vranesic_etal_IEEEC.ps.Z version [5a576842b8].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Wilton_Vranesic_SPDP.ps.Z version [1d8313eac2].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Wu_MASc.ps.Z version [077988d1fe].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Zhou_Brecht_SM91.ps.Z version [e3e1148cb8].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/depth-guide.ps.Z version [2e30273645].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/headerize version [882320c08b].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
#!/bin/sh # # to run (if called headerize): headerize <text> < in_file > out_file # e.g.: if heading is: "a heading", source file is sfile.ps, destination # file is dfile.ps, then use: # # headerize a heading < sfile.ps > dfile.ps # or # headerize "a heading" < sfile.ps > dfile.ps # gawk -v MYHEADING="$*" ' BEGIN{ begin=1 } begin==1 && $1 !~ /^%.*/ { begin=0 printf "save\n" printf "gsave\n" printf "/Times-Italic findfont 9 scalefont setfont\n" printf "72 750 moveto (%s) show\n", MYHEADING printf "grestore\n" printf "restore\n" } { print } ' |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/ldreport.ps version [8895be96a1].
more than 10,000 changes
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/titlepage.ps.Z version [97f6c1729a].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/arch_button.gif version [02eea96c8d].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/comments.gif version [12d694f7ba].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/comp_button.gif version [f9a5424e0a].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/data_button.gif version [511ec86037].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/journal.gif version [ac0b61628e].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/music1.gif version [b184bb7d75].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/newban_t.gif version [83c41abb45].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/os_button.gif version [1a8a14d7d4].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/people_button.gif version [c1ff15f4b0].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/perf_button.gif version [df68c2fce4].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/proj_button.gif version [d41126a7f0].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/publ_button.gif version [58b6234da6].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/EECG/RESEARCH/ParallelSys/images/sch_button.gif version [b394c1e472].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/Welcome.html version [1af9f0f1ff].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
<!-- RCS $Id: Welcome.html,v 1.2 1994/11/02 16:13:18 caranci Exp caranci --> <HTML> <body bgcolor="#ffffff" text="#000000" link="#0000ff" vlink="#aaaaff" alink="#0077FF"> </body> <FONT SIZE=4> <HEAD> <TITLE>Parallel Systems Group: Home page</TITLE> </HEAD> <BODY> <H1><img align="center" src="../EECG/RESEARCH/ParallelSys/images/newban_t.gif"></A> Parallel Systems Group</H1> <HR> The Parallel Systems Group comprises of researchers from the <A HREF="http://www.utoronto.ca/uoft.html">University of Toronto</A> working in all aspects of parallel systems, including computer architecture, operating systems, compilers, performance evaluation and applications. <P> Previous projects include the <A HREF="hector.html">Hector</A> shared memory multiprocessor, and the <A HREF="hurricane.html">Hurricane</A> multiprocessor operating system. <P> The group is currently building the <A HREF="parallel/NUMA.Welcome.html">NUMAchine</A> multiprocessor, the <A HREF="tornado.html">Tornado</A> operating system, and the <A HREF="../~tsa/jasmine.html">Jasmine</A> compiler. <P> <BR> <BR> <BR> <center> <A HREF="publications.html"> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/publ_button.gif"></A> <P> <A HREF="people.html"> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/people_button.gif"></A> <P> <A HREF="parallel/projects.html"> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/proj_button.gif"></A> <P> <BR> <BR> <BR> <H2>Other Resources</H2> <H3>University of Toronto Resources</H3> <UL> <LI><A HREF="http://www.hprc.utoronto.ca"> University of Toronto High Performance Computing Research Center </A> <LI><A HREF="http://www.eecg.toronto.edu/EECG/EECGhome.html"> University of Toronto Electrical Engineering Computer Group</A> <LI><A HREF="http://www.cdf.toronto.edu"> University of Toronto Department of Computer Science</A> </UL> <H3>Computing Resources</H3> <UL> <LI><A HREF="http://www.ccsf.caltech.edu/other_sites.html"> Supercomputing Web pages</A> <LI><A HREF="http://www.cs.cmu.edu/afs/cs.cmu.edu/project/scandal/public/www/research-groups.html"> Supercomputing & Parallel Computing Research Groups</A> <LI><A HREF="http://www.cs.dartmouth.edu/pario.html"> Parallel I/O archive at Dartmouth</A> </UL> <EM> <!-- <HR> These pages will look best if displayed with the <A HREF="http://home.mcom.com/home/faq_docs/faq_client.html"> Mosaic Netscape</A> web browser. Check it out!<BR> <HR> --> This is still a work in progress... Mail suggestions to:<BR> <A HREF="mailto:kulki@cs.toronto.edu"> kulki </A> or <A HREF="mailto:okrieg@eecg.toronto.edu"> Orran </A> </EM> </BODY> </HTML> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/hector-sys-raw.gif version [0359f1f3c2].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/hector.html version [10c005e5a3].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
<TITLE>Hector</TITLE> <H1>Hector</H1> <P> Under Construction. <P> <A HREF="pubs_abs.html#Stumm_Vranesic_White_IPPS93"> Hector</A> is a shared memory multiprocessor based on a hierarchy of unidirectional slotted rings. The main objective was a simple architecture that is size and generation scalable. The machine was built from scratch with off-the-shelf processors. Please see <a href="publications.html">publications</a> for details such as performance. <HR> <h3> Hector Processor Board </h3> <IMG ALIGN=LEFT HSPACE=15 SRC="hectorboard.gif"> <BR> Each board contains: <ul> <li> MC88100 cpu <li> 4 MB of memory <li> 16 KB of data cache <li> 16 KB of instruction cache </ul> <BR CLEAR=ALL> <HR> <h3> Hector System </h3> <IMG ALIGN=RIGHT SRC="hector-sys-raw.gif"> <BR> This system contains: <ul> <li> 16 MC88100 cpus <li> 16 x 4 MB memory <li> ring interconnect </ul> <BR CLEAR=ALL> <HR> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/hectorboard.gif version [76b1a3a213].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/hurricane.html version [0c2caec185].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
<TITLE>Hurricane</TITLE> <!-- Changed by: Orran Y. Krieger, 2-Oct-1995 --> <H1>Hurricane</H1> <P> Under Construction <P> The <A HREF="pubs_abs.html#Unrau_etal_JSC94">Hurricane</A> operating system is a hierarchically clustered operating system implemented on the Hector multiprocessor. <P> Hierarchical clustering manages the system resources in clusters, using tight coupling within a cluster, and loose coupling across clusters. Distributed systems principles are applied by distributing and replicating system services and data objects to increase locality, increase concurrency, and to avoid centralized bottlenecks, thus making the system scalable. However, tight coupling is used within a cluster, so the system performs well for local interactions. Hierarchical clustering maximizes locality which is key to good performance in large systems, and systems based on hierarchical clustering can easily be adapted to different hardware configurations and architectures by changing the size of the clusters. Finally, hierarchical clustering leads to a modular system composed from easy-to-design and hence efficient building blocks. <P> All the papers are available from <A HREF="publications.html#os"> here.</A> </UL> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/images/comments.gif version [12d694f7ba].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/images/homeblue.gif version [a77b950d99].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/images/redline.GIF version [59a7418809].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/NUMA.Welcome.html version [2a20af86f3].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE> NUMAchine Home Page </TITLE> <META NAME="GENERATOR" CONTENT="Mozilla/3.0Gold (X11; I; SunOS 5.5 sun4m) [Netscape]"> </HEAD> <BODY BACKGROUND="images/maple_back.gif"> <CENTER><P><IMG SRC="numahw/NUMAchine-med.gif" > <clear=left><BR> <BR> </P></CENTER> <H2 ALIGN=CENTER>The NUMAchine Multiprocessor Project</H2> <P>The <I>NUMAchine</I> project at the <A HREF="http://www.utoronto.ca/uoft.html">University of Toronto</A> is a major research project aimed at developing a shared-memory multiprocessor architecture and software support for easy and efficient use of this architecture. Members of both the <A HREF="http://www.ece.toronto.edu/">Department of Electrical and Computer Engineering</A> and the <A HREF="http://www.cs.toronto.edu">Department of Computer Science</A> are collaborating on this project.</P> <P>A key objective is to develop a high-performance architecture that is modular, cost-effective and scalable. At the present time, a prototype machine is being designed and built, and the system software is being developed. Follow the links below for more information. </P> <TABLE BORDER=0 CELLSPACING=1 CELLPADDING=0 WIDTH="100%"> <TR ALIGN=LEFT VALIGN=CENTER> <TD WIDTH="65%"> <P> <IMG SRC="images/computer.gif" ALIGN=CENTER HSPACE=5> <A HREF="numahw/numahw.html">Hardware description with photographs</A> </P> <P> <IMG SRC="images/archiv.gif" ALIGN=CENTER HSPACE=5> <A HREF="numadocs.html">Papers and technical documentation</A> </P> <P> <IMG SRC="images/disk.gif" HSPACE=5 ALIGN=CENTER> System software: </P> <UL> <P> <IMG SRC="images/wh_ball.gif" HSPACE=5 HEIGHT=16 WIDTH=17 ALIGN=BOTTOM> <A HREF="tornado.html">The Tornado Operating System</A><BR> <BR> </P> <P> <IMG SRC="images/wh_ball.gif" HSPACE=5 HEIGHT=16 WIDTH=17 ALIGN=BOTTOM> <A HREF="../../~tsa/jasmine.html">The Jasmine Compiler</A> </P> </UL> <TD> <P ALIGN=CENTER>Click on the figures below for the<BR> NUMAchine architecture and a hardware photo.<BR> <A HREF="images/NUMAfig.gif"> <IMG ALIGN=LEFT WIDTH=80 HEIGHT=80 SRC="images/NUMAfig.gif"> </A> <A HREF="numahw/pictures/dbgstn.jpg"> <IMG ALIGN=RIGHT WIDTH=80 HEIGHT=80 SRC="numahw/numa2.gif"> </A> </P> </TD> </TABLE> <P> <HR WIDTH="100%"></P> <P>Major funding from:<BR> <IMG HSPACE=5 VSPACE=5 SRC="images/NSERC.gif" ALIGN=CENTER> <A HREF="http://www.nserc.ca">Natural Sciences and Engineering Research Council of Canada (NSERC)</A> </P> </BODY> </HTML> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/hector-sys-raw.gif version [0359f1f3c2].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/hector.html version [501d4060d2].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
<TITLE>Hector</TITLE> <H1>Hector</H1> <P> Under Construction. <P> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/pubs_abs.html#Stumm_Vranesic_White_IPPS93"> Hector</A> is a shared memory multiprocessor based on a hierarchy of unidirectional slotted rings. The main objective was a simple architecture that is size and generation scalable. The machine was built from scratch with off-the-shelf processors. Please see <a href="http://www.eecg.toronto.edu/parallel/parallel/publications.html">publications</a> for details such as performance. <HR> <h3> Hector Processor Board </h3> <IMG ALIGN=LEFT HSPACE=15 SRC="hectorboard.gif"> <BR> Each board contains: <ul> <li> MC88100 cpu <li> 4 MB of memory <li> 16 KB of data cache <li> 16 KB of instruction cache </ul> <BR CLEAR=ALL> <HR> <h3> Hector System </h3> <IMG ALIGN=RIGHT SRC="hector-sys-raw.gif"> <BR> This system contains: <ul> <li> 16 MC88100 cpus <li> 16 x 4 MB memory <li> ring interconnect </ul> <BR CLEAR=ALL> <HR> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/hectorboard.gif version [76b1a3a213].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/hurricane.html version [f9112aeaaa].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
<TITLE>Hurricane</TITLE> <!-- Changed by: Orran Y. Krieger, 2-Oct-1995 --> <H1>Hurricane</H1> <P> Under Construction <P> The <A HREF="http://www.eecg.toronto.edu/parallel/parallel/pubs_abs.html#Unrau_etal_JSC94">Hurricane</A> operating system is a hierarchically clustered operating system implemented on the Hector multiprocessor. <P> Hierarchical clustering manages the system resources in clusters, using tight coupling within a cluster, and loose coupling across clusters. Distributed systems principles are applied by distributing and replicating system services and data objects to increase locality, increase concurrency, and to avoid centralized bottlenecks, thus making the system scalable. However, tight coupling is used within a cluster, so the system performs well for local interactions. Hierarchical clustering maximizes locality which is key to good performance in large systems, and systems based on hierarchical clustering can easily be adapted to different hardware configurations and architectures by changing the size of the clusters. Finally, hierarchical clustering leads to a modular system composed from easy-to-design and hence efficient building blocks. <P> All the papers are available from <A HREF="http://www.eecg.toronto.edu/parallel/parallel/publications.html#os"> here.</A> </UL> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/images/NSERC.gif version [1f87abec52].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/images/NUMAchine-small.gif version [132567b0e6].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/images/NUMAfig.gif version [384e3cb6db].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/images/archiv.gif version [97177c2404].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/images/computer.gif version [c6c2ee2011].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/images/disk.gif version [3ae69e93dc].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/images/maple_back.gif version [058528c833].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/images/torn-small.gif version [deb0bec641].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/images/wh_ball.gif version [0952c81ba5].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numachine.html version [2a20af86f3].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE> NUMAchine Home Page </TITLE> <META NAME="GENERATOR" CONTENT="Mozilla/3.0Gold (X11; I; SunOS 5.5 sun4m) [Netscape]"> </HEAD> <BODY BACKGROUND="images/maple_back.gif"> <CENTER><P><IMG SRC="numahw/NUMAchine-med.gif" > <clear=left><BR> <BR> </P></CENTER> <H2 ALIGN=CENTER>The NUMAchine Multiprocessor Project</H2> <P>The <I>NUMAchine</I> project at the <A HREF="http://www.utoronto.ca/uoft.html">University of Toronto</A> is a major research project aimed at developing a shared-memory multiprocessor architecture and software support for easy and efficient use of this architecture. Members of both the <A HREF="http://www.ece.toronto.edu/">Department of Electrical and Computer Engineering</A> and the <A HREF="http://www.cs.toronto.edu">Department of Computer Science</A> are collaborating on this project.</P> <P>A key objective is to develop a high-performance architecture that is modular, cost-effective and scalable. At the present time, a prototype machine is being designed and built, and the system software is being developed. Follow the links below for more information. </P> <TABLE BORDER=0 CELLSPACING=1 CELLPADDING=0 WIDTH="100%"> <TR ALIGN=LEFT VALIGN=CENTER> <TD WIDTH="65%"> <P> <IMG SRC="images/computer.gif" ALIGN=CENTER HSPACE=5> <A HREF="numahw/numahw.html">Hardware description with photographs</A> </P> <P> <IMG SRC="images/archiv.gif" ALIGN=CENTER HSPACE=5> <A HREF="numadocs.html">Papers and technical documentation</A> </P> <P> <IMG SRC="images/disk.gif" HSPACE=5 ALIGN=CENTER> System software: </P> <UL> <P> <IMG SRC="images/wh_ball.gif" HSPACE=5 HEIGHT=16 WIDTH=17 ALIGN=BOTTOM> <A HREF="tornado.html">The Tornado Operating System</A><BR> <BR> </P> <P> <IMG SRC="images/wh_ball.gif" HSPACE=5 HEIGHT=16 WIDTH=17 ALIGN=BOTTOM> <A HREF="../../~tsa/jasmine.html">The Jasmine Compiler</A> </P> </UL> <TD> <P ALIGN=CENTER>Click on the figures below for the<BR> NUMAchine architecture and a hardware photo.<BR> <A HREF="images/NUMAfig.gif"> <IMG ALIGN=LEFT WIDTH=80 HEIGHT=80 SRC="images/NUMAfig.gif"> </A> <A HREF="numahw/pictures/dbgstn.jpg"> <IMG ALIGN=RIGHT WIDTH=80 HEIGHT=80 SRC="numahw/numa2.gif"> </A> </P> </TD> </TABLE> <P> <HR WIDTH="100%"></P> <P>Major funding from:<BR> <IMG HSPACE=5 VSPACE=5 SRC="images/NSERC.gif" ALIGN=CENTER> <A HREF="http://www.nserc.ca">Natural Sciences and Engineering Research Council of Canada (NSERC)</A> </P> </BODY> </HTML> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numadocs.html version [05e65e3777].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
|
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE> Documentation on the NUMAchine Multiprocessor </TITLE> <META NAME="GENERATOR" CONTENT="Mozilla/3.0Gold (X11; I; SunOS 5.5 sun4m) [Netscape]"> </HEAD> <BODY BACKGROUND="images/maple_back.gif"> <CENTER><P><IMG SRC="numahw/NUMAchine-med.gif" > <clear=left><BR> <BR> </P></CENTER> <H2 ALIGN=CENTER>Documentation on the NUMAchine Multiprocessor</H2> <P> <HR WIDTH="100%"><BR> <FONT SIZE=+1>Technical Report</FONT></P> <P>We have written a technical report that describes the NUMAchine architecture, outlines important aspects of its cache coherence protocol, and provides simulation results for parallel execution of a number of benchmark programs. </P> <UL> <LI><A HREF="http://www.eecg.toronto.edu/parallel/parallel/numachin.hier/numachin.html">Hierarchical HTML version of NUMAchine Technical Report</A><BR> <BR> </LI> <LI><A HREF="http://www.eecg.toronto.edu/parallel/parallel/numachin.flat/numachin.html">Monolithic HTML version of NUMAchine Technical Report (103 Kbytes)</A><BR> <BR> </LI> <LI><A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/techreport.ps">PostScript version of NUMAchine Technical Report (451 Kbytes)</A><BR> <BR> </LI> <LI><A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/techreport.pdf">PDF version of NUMAchine Technical Report (175 Kbytes)</A><BR> <I><B>Note: some figures do not come out properly in the PDF.<BR> Grab the PostScript version instead for now.</B></I></LI> </UL> <P> <HR WIDTH="100%"><BR> <FONT SIZE=+1>Papers</FONT></P> <UL> <li>R. Grindley, T. Abdelrahman, S. Brown, S. Caranci, D. DeVries, B. Gamsa, A. Grbic, M. Gusat, R. Ho, O. Krieger, G. Lemieux, K. Loveless, N. Manjikian, P. McHardy, S. Srbljic, M. Stumm, Z. Vranesic and Z. Zilic , "The NUMAchine Multiprocessor", <i>Proceedings of the 2000 International Conference on Parallel Processing</i>, Toronto, August 2000.<br> <a href="http://www.eecg.toronto.edu/parallel/parallel/docs/icpp00.pdf">full paper, (PDF, 109k)</a> <!--A pdf version of the paper is available in ~grbic/icpp00.pdf. Please make a copy of it in the webpage directory.--> <p> <LI>A. Grbic, S. Brown, S. Caranci, R. Grindley, M. Gusat, G. Lemieux, K. Loveless, N. Manjikian, S. Srbljic, M. Stumm, Z. Vranesic, and Z. Zilic, "Design and Implementation of the NUMAchine Multiprocessor," <I>Proceedings of the 35th IEEE Design Automation Conference</I>, San Francisco, June 1998.<BR> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/dac98.ps">full paper (PostScript, 160 Kbytes)</A><BR> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/dac98.pdf">full paper (PDF, 41 Kbytes)</A></LI><BR> <LI>S. Brown, N. Manjikian, Z. Vranesic, S. Caranci, A. Grbic, R. Grindley, M. Gusat, K. Loveless, Z. Zilic, and S. Srbljic, "Experience in Designing a Large-scale Multiprocessor using Field-Programmable Devices and Advanced CAD Tools," <I>Proceedings of the 33rd IEEE Design Automation Conference</I>, Las Vegas, June 1996.<BR> <A HREF="http://www.eecg.toronto.edu/~brown/DAC96.html">abstract<BR></A> <A HREF="http://www.eecg.toronto.edu/~brown/dac96.ps">full paper (PostScript, 177 Kbytes)</A><BR> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/dac96.pdf">full paper (PDF, 63 Kbytes)</A></LI><BR> <LI> Z. Zilic, G. Lemieux, K. Loveless, S. Brown, and Z. Vranesic, "Designing for High Speed-Performance in CPLDs and FPGAs," <I>Proc. 3rd Canadian Workshop on Field-Programable Devices (FPD'95): Technology, Tools, and Applications</I>, Montreal, Canada, pp. 108 - 113, May 1995.</A><BR> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/fpd95.pdf"> full paper (PDF, 31 Kbytes)</A> </LI><BR> <LI> T. Abdelrahman, S. Brown, T. Mowry, K. Sevcik, M. Stumm, Z. Vranesic, S. Zhou, A. Elkateeb, M. Gusat, P. Pereira, B. Gamsa, R. Grindley, O Kreiger, G. Lemieux, K. Loveless, N. Manjikian, G. Ravindran, S. Srbljic, Z. Zilic "An Overview of the NUMAchine Multiprocessor Project," <I>Proceedings of the 8th Canadian Supercomputing Conference</I>, June 1994.</A><BR> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/overview.ps">full paper (PostScript, 224 Kbytes)</A><BR> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/overview.pdf">full paper (PDF, 200 Kbytes)</A></LI><BR> </LI> </UL> <P> <HR WIDTH="100%"><BR> <A NAME="systemmanuals"> <FONT SIZE=+1>System Manuals</FONT></P></A> <P>We are developing the system-level programming documentation to provide details on the NUMAchine address space and describe various special functions controlled by system software.</P> <UL> <LI><A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/sys_prog_manual.pdf"> NUMAchine Principles of Operations for System Programmers (PDF, 228 Kbytes)<BR> </A><B><I>DISCLAIMER: this is a preliminary document and is subject to change at anytime.</I></B></LI> </UL> <P> Also, the hardware reference manual describes all of those nitty-gritty details that the software types don't really care about. Those who work closely with the hardware should be familiar with this manual.</P> <UL> <LI><A HREF="http://www.eecg.toronto.edu/parallel/parallel/docs/hw_maintenance_manual.pdf"> NUMAchine Hardware Reference and Maintenance Manual (PDF, 539 Kbytes)<BR> </A><B><I>DISCLAIMER: this is a preliminary document and is subject to change at anytime.</I></B></LI> </UL> <P> <HR WIDTH="100%"><BR> <FONT SIZE=+1>NUMAchine-related Theses</FONT></P> <UL> The following theses offer greater insight into the details of the NUMAchine hardware. However, note that the content of the theses is dated, and changes to the hardware have been made for various reasons (integration, economics, correctness, etc.). Consequently, the information below does not accurately document the state of the NUMAchine hardware as it is today. Instead, consult either the <A HREF="http://www.eecg.toronto.edu/parallel/numadocs.html#systemmanuals"> <I>System Programming Manual</I></A> or <A HREF="http://www.eecg.toronto.edu/parallel/numadocs.html#systemmanuals"> <I>Hardware Reference and Maintenance Manual</I></A>, as these will be kept as current as possible.<BR><BR> <LI>Eddy Ah Pin, "Hardware Performance Monitoring in Memory of NUMAchine Multiprocessor," <I>Undergraduate Thesis,</I> University of Toronto, 1997. <BR><A HREF="http://www.eecg.toronto.edu/parallel/parallel/theses/ahpin.pdf">PDF, 146k</A></LI><BR> <LI><A HREF="http://www.eecg.toronto.edu/~grbic">Alex Grbic</A>, "Hierarchical Directory Controllers in the NUMAchine Multiprocessor," <I>M.A.Sc. Thesis,</I> University of Toronto, 1996. <BR><A HREF="http://www.eecg.toronto.edu/parallel/parallel/theses/grbic.pdf">PDF, 4120k</A></LI><BR> <LI><A HREF="http://www.eecg.toronto.edu/~grbic">Alex Grbic</A>, "Assessment of Cache Coherence Protocols in Shared-Memory Multiprocessors," <I>Ph.D. Thesis,</I> University of Toronto, 2003. <BR><A HREF="http://www.eecg.toronto.edu/parallel/parallel/theses/grbic_phd.pdf">PDF, 1064k</A></LI><BR> <LI><A HREF="http://www.eecg.toronto.edu/~grindley">Robin Grindley</A>, "The NUMAchine Multiprocessor: Design and Analysis," <I>Ph.D. Thesis,</I> University of Toronto, 1999. <BR><A HREF="http://www.eecg.toronto.edu/parallel/parallel/theses/grindley.pdf">PDF, 1776k</A></LI><BR> <LI><A HREF="http://www.eecg.toronto.edu/~lemieux">Guy Lemieux</A>, "Hardware Performance Monitoring in Multiprocessors," <I>M.A.Sc. Thesis,</I> University of Toronto, 1996. <BR><A HREF="http://www.eecg.toronto.edu/parallel/parallel/theses/lemieux.pdf">PDF, 219k</A></LI><BR> <LI><A HREF="http://www.eecg.toronto.edu/~kelvin">Kelvin Loveless</A>, "The Implementation of Flexible Interconnect in the NUMAchine Multiprocessor," <I>M.A.Sc. Thesis,</I> University of Toronto, 1996. <BR><A HREF="http://www.eecg.toronto.edu/parallel/parallel/theses/loveless.pdf">PDF, 848k</A></LI><BR> <li>Karl Schabas, "The Implementation of Basic Monitoring Functions on the NUMAchine Multiprocessor", <i>Undergraduate Thesis</i>, University of Toronto, 2000.<br> <a href="http://www.eecg.toronto.edu/parallel/parallel/theses/schabas.pdf">PDF, 281K</A> <!-- A pdf version of the paper is available in ~grbic/schabas.pdf. Also make a copy of it in the webpage directory.--> <p> </UL> <HR WIDTH="100%"></P> <P><A HREF="NUMA.Welcome.html">Back to NUMAchine Home Page...</A></P> </BODY> </HTML> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/NUMAchine-med.gif version [663c7b0c5d].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/NUMAchine.arch.gif version [3cceeda173].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/mem.gif version [235779c2b4].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/ni.gif version [e6b2be8207].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/numa2.gif version [fa41f30a77].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/numahw.html version [a421b6b289].
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <TITLE>Hardware Development for the NUMAchine Multiprocessor</TITLE> <META NAME="GENERATOR" CONTENT="Mozilla/3.0Gold (X11; I; SunOS 4.1.3_U1 sun4m) [Netscape]"> <META NAME="Author" CONTENT="Naraig Manjikian"> </HEAD> <BODY background = "../images/maple_back.gif"> <CENTER><P><IMG SRC="NUMAchine-med.gif" HEIGHT=89 WIDTH=502></P></CENTER> <H2 ALIGN=CENTER>Hardware Development for the NUMAchine Multiprocessor<BR> at the University of Toronto</H2> <P>A 64-processor prototype of the <A HREF="../NUMA.Welcome.html">NUMAchine multiprocessor</A> architecture (illustrated below) is under construction in the <A HREF="http://www.ece.toronto.edu">Dept. of Electrical /Computer Engineering</A> at the <A HREF="http://www.toronto.edu">Univ. of Toronto</A>. </P> <CENTER><P><IMG SRC="NUMAchine.arch.gif" BORDER=2 HEIGHT=236 WIDTH=548></P></CENTER> <P>The implementation of each <I><A HREF="numahw.html#station">station</A></I> is based on the FutureBus+ physical standard, but NUMAchine utilizes a custom synchronous bus protocol.</P> <P>A number of printed circuit boards have been designed and fabricated:</P> <UL> <LI><I><A HREF="numahw.html#processor board">processor board</A></I> with a MIPS R4400 microprocessor and 1 MByte of SRAM cache<BR> <BR> </LI> <LI><A HREF="numahw.html#memory board"><I>memory board</I> </A>containing 32-128 MBytes of DRAM and 8 MBytes of SRAM for the cache coherence directory<BR> <BR> </LI> <LI><I><A HREF="numahw.html#network interface board">network interface board</A></I> to link a station to the ring hierarchy; this board also contains<BR> 8 MBytes of DRAM to cache remote data<BR> <BR> </LI> <LI><A HREF="numahw.html#clock generator board"><I>clock generator</I> </A>generates up to 18 differential ECL clocks<BR> <BR> </LI> <LI><A HREF="numahw.html#bus arbiter board"><I>bus arbiter</I> </A>a centralized bus arbiter controls access to the NUMAchine station bus<BR> <BR> </LI> </UL> <P><B>Status:</B> <EM>The I/O board has been fabricated and is working... pictures pending. Also, a number of circuit boards which implement the global ring for the top level of the interconnection network have been fabricated and are being tested.</EM></P> <P>All boards utilize field-programmable devices (FPDs) from the <A HREF="http://www.altera.com">Altera Corporation</A> for much of the control circuitry, such as the system interface for the <A HREF="http://www.mips.com">MIPS R4400 microprocessor</A>, the directory controller on the memory board, and the ring controller on the network interface board. Field-programmable devices provide shorter design cycles and cost-effectiveness (although good performance requires <A HREF="http://www.eecg.toronto.edu/~brown/DAC96.html">careful design</A>). In addition, FPDs provide flexibility to implement new protocols to support future research.<BR> <BR> <BR> <BR> </P> <CENTER><P><B><I>The NUMAchine Hardware Development Group</I></B> </P></CENTER> <CENTER><TABLE CELLSPACING=20 CELLPADDING=0 > <TR> <TD VALIGN=TOP> <LI>Prof. Zvonko G. Vranesic (<I>project leader</I>)</LI> <LI>Prof. Stephen D. Brown</LI> <LI>Prof. Michael Stumm</LI> <LI>Steve Caranci</LI> <LI>Alex Grbic</LI> <LI>Guy Lemieux</LI> <LI>Paul McHardy</LI> <LI>Peter Pereira</LI> </TD> <TD VALIGN=TOP>Major contributors who have moved on: <LI>Dr. Robin Grindley</LI> <LI>Mitch Gusat</LI> <LI>Dr. Orran Krieger</LI> <LI>Kelvin Loveless</LI> <LI>Dr. Naraig Manjikian</LI> <LI>Dr. Sinisa Srbljic</LI> <LI>Michael van Dam</LI> <LI>Dr. Zeljko Zilic</LI> <P>Summer students:</P> <LI>Eddy Ah Pin</LI> <LI>Terry Borer</LI> <LI>Jackson Fung</LI> <LI>Emanuel Istrate</LI> <LI>Daniel Levner</LI> <LI>Karl Schabas</LI> <LI>Deshanand P. Singh</LI> </TD> </TR> </TABLE></CENTER> <P> <HR WIDTH="100%"></P> <H2>Photographs of NUMAchine Hardware</H2> <P><A NAME="station"></A></P> <TABLE BORDER=1 CELLPADDING=10 > <TR> <TD> <A HREF="pictures/dbgstn.jpg"> <IMG SRC="numa2.gif" HEIGHT=225 WIDTH=262 ALIGN=TEXTTOP></A></TD> <TD> <CENTER><P><B><I><FONT SIZE=+1>A Fully-populated<BR> NUMAchine Station</FONT></I></B></P></CENTER> <P>The bus physical backplane is at the bottom of the photograph. The boards plug vertically into the backplane.</P> <P>From left to right:</P> <LI>bus arbiter board</LI> <LI>4 processor boards</LI> <LI>2 memory boards</LI> <LI>network interface board</LI> <P>The power supply is visible directly beneath the bus backplane. A clock generation and distribution board (not visible) is located underneath the backplane.</P> </TD> </TR> </TABLE> <P><A NAME="processor board"></A></P> <TABLE BORDER=1 CELLPADDING=10 > <TR> <TD> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/numahw/pictures/proc.jpg"> <IMG SRC="proc.gif" HEIGHT=378 WIDTH=308 ALIGN=TEXTTOP></A></TD> <TD> <CENTER><P><B><I><FONT SIZE=+1>The Processor Board</FONT></I></B></P></CENTER> <P>At the top of the board are LED displays and connectors for diagnostics, EPROM to program the Altera FPDs, and EPROM with boot code for the R4400.</P> <P>The MIPS R4400 microprocessor with heat sink is at the center of the board, surrounded by SRAM cache chips.</P> <P>Directly below the R4400 is a row of Altera field-programmable devices which serve as the system interface for the R4400. Below these chips is a row of FIFO buffers to and from the NUMAchine station bus. Finally, below the FIFOs is a row of FutureBus+ BTL chips for listening to and driving the NUMAchine station bus.</P> <P>Click on the picture to see the latest version of the processor board, revision 3, in detail.</P> <P>The connector to the NUMAchine station bus is at the bottom of the board.</P> <P><A HREF="http://www.eecg.toronto.edu/parallel/parallel/numahw/procr.pic.ps">Block diagram (PostScript, 95 Kbytes)</A><P> </TD> </TR> </TABLE> <P><A NAME="memory board"></A></P> <TABLE BORDER=1 CELLPADDING=10 > <TR> <TD> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/numahw/pictures/mem.jpg"> <IMG SRC="mem.gif" HEIGHT=450 WIDTH=450 ALIGN=TEXTTOP></A></TD> <TD> <CENTER><P><B><I><FONT SIZE=+1>The Memory Board</FONT></I></B></P></CENTER> <P>DRAM SIMMs occupy the left side of the board. The top right-hand corner is occupied by a bank of SRAM chips used in maintaining the directory for the cache coherence protocol. </P> <P>At the right-hand center of the board are the Altera FPDs which contain the control circuitry for the cache coherence protocol. There is also an Altera FPD at the top center of the board to control the DRAM array.</P> <P>FIFO buffers and BTL interface chips connect the memory board to the NUMAchine station bus through the connector at the bottom of the board.</P> <P>Click on the picture to see the latest version of the memory board, revision 2, in detail. Hardware monitoring, which was not present in the original revision, has been added in the Altera FLEX10K30 device. The patchwires were necessary to correct an FPGA programming problem, and have been eliminated with a final respin of the board.</P> <P><A HREF="http://www.eecg.toronto.edu/parallel/parallel/numahw/mem.pic.ps">Block diagram (PostScript, 99 Kbytes)</A><P> </TD> </TR> </TABLE> <P><A NAME="network interface board"></A></P> <TABLE BORDER=1 CELLPADDING=10 > <TR> <TD> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/numahw/pictures/nic.jpg"> <IMG SRC="ni.gif" HEIGHT=445 WIDTH=313 ALIGN=TEXTTOP></A></TD> <TD> <CENTER><P><B><I><FONT SIZE=+1>The Network Interface Board</FONT></I></B></P></CENTER> <P>The ring connectors are visible in the top corners of the board. The buffers for the ring interconnect occupy the space between the connectors.</P> <P>The DRAM chips for the remote data cache occupy a small area on the underside of the board.</P> <P>The Altera FPDs containing the control circuitry for the cache coherence protocol, the rings, and the remote data cache are clearly visible in their sockets.</P> <P>Pipelining for the wide data paths on this board requires the large number of buffer chips which occupy much of the board.</P> <P>FIFO buffers and BTL chips are located at the bottom left and bottom right, as well as the the bottom edge of the board, directly above the connector to the NUMAchine station bus.</P> <P>Click on the picture to see the latest version of the network interface board, revision 2, in detail. You will notice that many of the discrete buffers have been replaced with Altera FLEX6016 FPGAs. Also, the SDRAM has been moved to the top surface.</P> <P><A HREF="http://www.eecg.toronto.edu/parallel/parallel/numahw/ni.pic.ps">Block diagram (PostScript, 97 Kbytes)</A><P> </TD> </TR> </TABLE> <P><A NAME="clock generator board"></A></P> <TABLE BORDER=1 CELLPADDING=10 > <TR> <TD> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/numahw/pictures/clock.jpg"> <IMG SRC="pictures/clock_small.jpg" WIDTH=200 ALIGN=TEXTTOP></A></TD> <TD> <CENTER><P><B><I><FONT SIZE=+1>The Clock Generator Board</FONT></I></B></P></CENTER> <P>The clock generator board can be programmed to a wide range of frequencies by the red DIP switch block. Differential the ECL master clock is generated by the chip in the top, centre of the board and split 2:1 by the small chip in the centre. The left and right chips are 9:1 fanout replicators, giving a total of 18 ECL clock signals. We distribute the clocks to the NUMAchine backplane via twisted-pair cables. Of course, we must take care that the cables are all the same length to minimize skew mismatch between the signals.</P> </TD> </TR> </TABLE> <P><A NAME="bus arbiter board"></A></P> <TABLE BORDER=1 CELLPADDING=10 > <TR> <TD> <A HREF="http://www.eecg.toronto.edu/parallel/parallel/numahw/pictures/arb.jpg"> <IMG SRC="pictures/arb_small.jpg" WIDTH=300 ALIGN=TEXTTOP></A></TD> <TD> <CENTER><P><B><I><FONT SIZE=+1>The Bus Arbiter Board</FONT></I></B></P></CENTER> <P>The bus arbiter board is a centralized, synchronous arbiter that controls access to the NUMAchine bus. Since this was one of the first boards we made, a few miscellaneous test circuits were also added to experiment with high-speed signalling using Altera devices. These test circuits use the DIP switches to test different functions. Also, a NUMAchine station RESET switch is located on this board, just below the DIP switches.</P> <P>The bus arbiter function has been added to the latest version of the I/O Board. Unfortunately, we do not have scans of that board ready yet for display.</P> </TD> </TR> </TABLE> <P><A HREF="http://www.eecg.toronto.edu/parallel/NUMA.Welcome.html">Back to NUMAchine Home Page...</A></P> </BODY> </HTML> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/pictures/arb_small.jpg version [c4c911d4ec].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/pictures/clock_small.jpg version [86e2cf518a].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/pictures/dbgstn.jpg version [a2d59ea0b6].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/numahw/proc.gif version [0868fe3031].
cannot compute difference between binary files
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/projects.html version [3b9c6d187b].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
<!-- RCS $Id: projects.html,v 1.3 1994/11/02 20:30:13 caranci Exp caranci --> <HTML> <body bgcolor="#ffffff" text="#000000" link="#0000ff" vlink="#aaaaff" alink="#0077FF"> </body> <FONT SIZE=4> <HEAD> <TITLE>Current UofT EECG Projects</TITLE> </HEAD> <BODY> <H1>Current Projects</H1> <DL> <DT><a href="hector.html"> Hector</A> <DD>Hector is a scalable shared memory multiprocessor with an interconnect of hierachical rings. <DT><a href="hurricane.html"> Hurricane</A> <DD> Hurricane is a hierarchically clustered operating system implemented on the Hector multiprocessor. <DT><a href="numachine.html"> <IMG SRC="images/NUMAchine-small.gif" ALT = "NUMAchine"></A> <DD> <A href="numachine.html">NUMAchine</A> is a next-generation implemenation of the basic Hector multi-processor architecture. Features include: hardware cache-coherency, network cache (a lockup-free tertiary cache), efficient multicast mechanism, and hardware performance monitoring support. <DT><a href="tornado.html"> <IMG SRC="images/torn-small.gif" border=0 vspace=5 hspace=5 ALT = "Tornado"></A> </A> <DD> <A href="tornado.html">Tornado</A> is the operating system being implemented for the NUMAchine multiprocessor. It is a multiuser, NUMA-aware, performance-oriented microkernel operating system. Most services are provided by servers and application-level run-time libraries. Tornado has a highly modular structure and is implemented in C++. </DL> <HR> <STRONG>This is still a work in progress...<BR> Please forward any comments, suggestions or questions to:</STRONG><BR> <a href="mailto:caranci@eecg.toronto.edu"><i>caranci@eecg.toronto.edu</i></a> </BODY> </HTML> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/parallel/tornado.html version [6421902085].
> > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title>K42/Tornado Web Page Redirection</title> <META HTTP-EQUIV="Refresh" CONTENT="1; URL=http://www.eecg.toronto.edu/~tornado"> </head> <body> <h1>K42/Tornado Web Page Redirection</h1> <p> The University of Toronto K42/Tornado web page has moved to <a href="http://www.eecg.toronto.edu/~tornado/">http://www.eecg.toronto.edu/~tornado/</a>. If your browser doesn't automatically redirect to its new location, click the above link. </p> <hr> <address><a href="mailto:tamda@eecg.toronto.edu">David Kar-Fai Tam</a></address> <!-- Created: Tue Oct 7 17:23:32 EDT 2003 --> <!-- hhmts start --> Last modified: Tue Oct 7 17:36:05 EDT 2003 <!-- hhmts end --> </body> </html> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/people.html version [3499bb98b9].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
<HTML> <body bgcolor="#ffffff" text="#000000" link="#0000ff" vlink="#aaaaff" alink="#0077FF"> </body> <FONT SIZE=4> <HEAD> <TITLE>People</TITLE> <!-- Changed by: Orran Y. Krieger, 18-Apr-1996 --> </HEAD> <center> <H1> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/music1.gif"></A> <P> People</H1> </center> <BODY> <img align=left src="images/comments.gif"> Please mail changes and additions to <A HREF="mailto:kulki@cs.toronto.edu">Kulki</A> or <A HREF="mailto:okrieg@eecg.toronto.edu">Orran</A> <p> <br> <H2>Faculty</H2> <UL> <LI> <A href="http://www.eecg.toronto.edu/~tsa/Welcome.html">T. Abdelrahman</a> <LI> <A href="../~brown/Welcome.html">S. Brown </a> <LI> <A href="../~corinna.html">C. Lee </a> <LI> <A href="http://www.eecg.toronto.edu/~tcm/Welcome.html">T. Mowry </a> <LI> <A href="http://www.cs.toronto.edu/~kcs">K. Sevcik </a> <LI> <A href="../~stumm/Welcome.html">M. Stumm </a> <LI> <A href="../~zvonko/Welcome.html">Z. Vranesic </a> <LI> <A href="http://www.eecg.toronto.edu/~zhou/Welcome.html">S. Zhou </a> </UL> <P> <H2>Students</H2> <UL> <LI> <A href="http://www.eecg.toronto.edu/~bernecky">R. Bernecky</a> <LI> <A href="../~charlesc.html">C. Chan</a> <LI> <A href="http://www.eecg.toronto.edu/~demke">A. Demke</a> <LI> <A href="http://www.eecg.toronto.edu/~devrier">D. De Vries</a> <LI> <A href="../~dunc/index.html">D. Elliott</a> <LI> <A href="http://www.eecg.toronto.edu/~farkas/">K. Farkas</a> <LI> <A href="../~ben/Welcome.html">B. Gamsa </a> <LI> <A href="http://www.eecg.toronto.edu/~grindley/Welcome.html">R. Grindley </a> <LI> <A href="http://www.eecg.toronto.edu/~grbic">A. Grbic </a> <LI> <A href="http://www.eecg.toronto.edu/~gusat">R. Ho </a> <LI> <A href="http://www.eecg.toronto.edu/~shuynh/va/Welcome.html">S. Huynh </a> <LI> <A href="http://www.eecg.toronto.edu/~hora/Welcome.html">M. Gusat </a> <LI> <A href="http://www.cs.toronto.edu/~karim">K. Harzallah</a> <LI> <A href="http://www.eecg.toronto.edu/~jaseemud">M. Jaseemuddin</a> <LI> <A href="http://www.cs.toronto.edu/~kulki">D. Kulkarni</a> <LI> <A href="http://www.cs.toronto.edu/~lamma">M. Lam</a> <LI> <A href="http://www.eecg.toronto.edu/~lemieux/Welcome.html">G. Lemieux </a> <LI> <A href="http://www.cs.toronto.edu/~paullu">P. Lu</A> <LI> <A href="http://www.cs.toronto.edu/~luk">C. Luk</A> <LI> <A href="http://www.eecg.toronto.edu/~kma">K. Ma </a> <LI> <A href="http://www.cs.toronto.edu/~maione">I. Maione </a> <LI> <A href="http://www.eecg.toronto.edu/~nmanjiki/Welcome.html">N. Manjikian </a> <LI> <A href="http://www.cs.toronto.edu/~neto">D. Neto </a> <LI> <A href="http://www.cs.toronto.edu/~eparsons">E. Parsons</A> <LI> <A href="http://www.cs.toronto.edu/~phan">G. Phan</A> <LI> <A href="http://www.eecg.toronto.edu/~gravin/Welcome.html">G. Ravindran </a> <LI> <A href="http://www.eecg.toronto.edu/~reid/Welcome.html">K. Reid</A> <LI> <A href="http://www.eecg.toronto.edu/~reza">R. Solymaani</A> <LI> <A href="../~saghir.html">M. Saghir</A> <LI> <A href="../~steffan.html">G. Steffan</A> <LI> <A href="http://www.eecg.toronto.edu/~stoodla">M. Stoodley</A> <LI> <A href="http://www.eecg.toronto.edu/~tandri/Welcome.html">S. Tandri</a> <LI> <A href="http://www.eecg.toronto.edu/~zeljko/Welcome.html">Z. Zilic </a> </UL> <P> <H2>Staff</H2> <UL> <LI> <A href="http://www.eecg.toronto.edu/~caranci/Welcome.html">S. Caranci</A> <LI> <A href="http://www.eecg.toronto.edu/~okrieg/Welcome.html">O. Krieger </a> <LI> <A href="http://www.eecg.toronto.edu/~kelvin/Welcome.html">K. Loveless </a> <LI> <A href="../~peterp/Welcome.html">P. Pereira </a> </UL> <H2>Graduates</H2> <UL> <LI> <A href="http://www.cs.ualberta.ca/~unrau">R. Unrau</A> <LI> <A href="http://www.cs.yorku.ca/People/brecht">T. Brecht</A> <LI> <A href="http://www.eecg.toronto.edu/~okrieg/Welcome.html">O. Krieger </a> <LI> <A href="http://www.cs.yorku.ca/People/hsandhu">H. Sandhu</A> <LI> <A href="http://www.eecg.toronto.edu/~hui">H. Li</A> </UL> <A HREF="Welcome.html"> <img align="left" src="images/homeblue.gif"> Return to Parallel Systems Home </A> </BODY> </HTML> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/publications.html version [7c0b03c097].
> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
|
<TITLE>Publications</TITLE> <A NAME="BEG"> </A> <!-- Changed by: Orran Y. Krieger, 12-Nov-1995 --> <body bgcolor="#ffffff" text="#000000" link="#0000ff" vlink="#aaaaff" alink="#0077FF"> </body> <FONT SIZE=4> <BR> <center> <H1> <img align="left" src="../EECG/RESEARCH/ParallelSys/images/journal.gif"> Publications</H1> </center> <BR> <BR> <P> Most of these papers can also be <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel">accessed via ftp</A>. A full, un-sorted, <A HREF="pubs_abs.html">list of publications with abstracts</A> is also available. <P> <em> <img align="left" src="../EECG/RESEARCH/ParallelSys/images/comments.gif"><A HREF="mailto:kulki@cs.toronto.edu"> Please send your suggestions and comments.</A> <p> <BR> <em> Group members: Whenever you have a paper for public eyes please mail <strong> kulki@eecg.toronto.edu </strong> a text with author and source information, text abstract, and a postscript file with embedded source information. </em> <P> <BR> <BR> <BR> <center> <A HREF="publications.html#ca"> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/arch_button.gif"></A> <P> <A HREF="publications.html#os"> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/os_button.gif"></A> <P> <A HREF="publications.html#compilers"> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/comp_button.gif"></A> <P> <A HREF="publications.html#scheduling"> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/sch_button.gif"></A> <P> <A HREF="publications.html#pe"> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/perf_button.gif"></A> <P> <A HREF="publications.html#db"> <img align="center" src="../EECG/RESEARCH/ParallelSys/images/data_button.gif"></A> </center> <P> <BR> <BR> <BR> <BR> <BR> <P> <A HREF="Welcome.html"> <img align="left" src="images/homeblue.gif"> Return to Parallel Systems Home </A> <P> <img src="images/redline.GIF"> <p> <A NAME="ca"><H2>Computer Architecture</H2></A> <UL> <LI> <A HREF="pubs_abs.html#Ravi_Stumm_JIEICE96"> A Comparison of Blocking and Non-blocking Packet Switching Techniques in Hierarchical Ring Networks </A> <br> IEICE Trans 1996 <LI> <A HREF="pubs_abs.html#Ravi_Stumm_ICPP95"> Hierarchical Ring Topologies and the effect of their Bisection Bandwidth Constraints </A> <br> ICPP 1995 <LI> <A HREF="pubs_abs.html#Wilton_Vranesic_SPDP">Architectural Support for Block Transfers in a Shared-Memory Multiprocessor</A> <br> SPDP 1993 <LI> <A HREF="pubs_abs.html#Stumm_Vranesic_White_IPPS93">Experience with the Hector Multiprocessor</A> <br> IPPS 1993 <LI> <A HREF="pubs_abs.html#Vranesic_etal_IEEEC">Hector -- A hierarchically structured shared memory multiprocessor</A> <br> IEEE Computer 1991 <LI> <A HREF="pubs_abs.html#Holliday_Stumm_IEEETC">Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors </A> <br> IEEE Trans. Computer 1992 </UL> <P> <A HREF="publications.html#BEG" > Return to the LIST</A> <P> <img src="images/redline.GIF"> <p> <A NAME="os"><H2>Operating Systems</H2></A> <UL> <LI> <A HREF="pubs_abs.html#Orran_etal_SPDPW95"> Exploiting Mapped Files for Parallel I/O </A> <br> 1995 SPDP Workshop on Modeling and Specification of I/O (MSIO) <LI> <A HREF="pubs_abs.html#Parsons_etal_IWOOS95">(De-)Clustering Objects for Multiprocessor System Software </A> <br> IWOOS95 Workshop <LI> <A HREF="pubs_abs.html#Unrau_etal_EuroPar95">On the Scalability of Demand-Driven Parallel Systems</A> <br> EuroPar 95 <LI> <A HREF="pubs_abs.html#Ben_etal_OOPSLAW94">The Importance of Performance-Oriented Flexibility in System Software for Large-Scale Shared-Memory Multiprocessors </A> <br> OOPSLA 94 Workshop on Flexible System Software <LI> <A HREF="pubs_abs.html#Unrau_etal_OSDI94"> Experiences with Locking in a NUMA Multiprocessor Operating System Kernel </A> <br> OSDI 1994 <LI> <A HREF="pubs_abs.html#Unrau_etal_JSC94">Hierarchical clustering: A structure for scalable multiprocessor operating system design</A> <br> Journal of Supercomputing 1995 <LI> <A HREF="pubs_abs.html#Okrieg_PhD">HFS: A flexible file system for shared-memory multiprocessors</A> <br> PhD thesis, 1994 <LI> <A HREF="pubs_abs.html#Krieger_etal_IEEEComp94">The Alloc Stream Facility: A redesign of application-level Stream I/O</A> <br> IEEE Computer 1994 <LI> <A HREF="pubs_abs.html#Gamsa_etal_ICPP94">Optimizing IPC Performance for Shared-Memory Multiprocessors</A> <br> ICPP 1994 <LI> <A HREF="pubs_abs.html#Sandhu_et_al_PPOPP">The shared regions approach to software cache coherence on multiprocessors</A> <br> PPoPP 1993 <LI> <A HREF="pubs_abs.html#Krieger_Stumm_DAGS93">HFS: A Flexible File System for Large-Scale Multiprocessors</A> <br> DAGS 1993 <LI> <A HREF="pubs_abs.html#Krieger_etal_ICPP93">A Fair Fast Scalable Reader-Writer Lock</A> <br> ICPP 1993 <LI> <A HREF="pubs_abs.html#Stumm_Unrau_Krieger_USENIX92">Hierarchical Clustering: A Structure for Scalable Multiprocessor Operating System Design</A> <br> USENIX 1992 <LI> <A HREF="pubs_abs.html#Gamsa_MASc">Region-Oriented Main Memory Management in Shared-Memory NUMA Multiprocessors</A> <br> MASc thesis, 1992 <LI> <A HREF="pubs_abs.html#Unrau_PhD">Scalable Memory Management through Hierarchical Symmetric Multiprocessing</A> <br> PhD thesis, 1992 </UL> <P> <A HREF="publications.html#BEG" > Return to the LIST</A> <P> <img src="images/redline.GIF"> <p> <A NAME="compilers"><H2>Compilers</H2></A> <UL> <LI> <A HREF="pubs_abs.html#Tandri_Abdel_PDPTA95">Computation and Data Partitioning on Scalable Shared Memory Multiprocessors</A> PDPTA, November 1995 <LI> <A HREF="pubs_abs.html#Kulkarni_Stumm_Tut">Loop and Data Transformations:Tutorial</A> CSRI Tech Report 337, June 1993 <LI> <A HREF="pubs_abs.html#Li_Tandri_et">Locality and Loop Scheduling on Numa Multiprocessors</A> <br> ICPP 92 <LI> <A HREF="pubs_abs.html#Manjikian_Abdelrahaman_315">Fusion of Loops for Parallelism and Locality</A> <br> Tech Report <LI> <A HREF="pubs_abs.html#Kulkarni_Stumm_LCR95">CDA Loop Transformations</A> <br> 3rd LCR Workshop <LI> <A HREF="pubs_abs.html#Kulkarni_Stumm_Unrau_EuroPar95">Implementing Flexible Computation Rules with Subexpression-level Loop Transformations</A> <br> EuroPar 95 <LI> <A HREF="pubs_abs.html#Kulkarni_etal_317">A Generalized Theory of Linear Loop Transformations</A> <br> Tech Report <LI> <A HREF="pubs_abs.html#Kulkarni_Stumm_292">Computational Alignment: A new, unified program transformation for local and global optimization</A> <br> Tech Report <LI> <A HREF="pubs_abs.html#Kulkarni_Stumm_ACJ95">Linear Loop Transformations in Optimizing Compilers for Parallel Machines</A> <br> Australian Computer Journal <LI> <A HREF="pubs_abs.html#Kumar_Kulkarni_ICS92">Deriving Good Transformations for Mapping Nested Loops on Hierarchical Parallel Machines in Polynomial Time</A> <br> ICS 92 <LI> <A HREF="pubs_abs.html#Kumar_Kulkarni_ICPP91">Generalized Unimodular Loop Transformations for Distributed Memory Multiprocessors</A> (does not contain figures) <br> ICPP 91 </UL> <P> <A HREF="publications.html#BEG" > Return to the LIST</A> <P> <img src="images/redline.GIF"> <p> <A NAME="scheduling"><H2>Scheduling</H2></A> <UL> <LI> <A HREF="pubs_abs.html#Zhou_Brecht_SM91">Processor Pool-Based Scheduling for Large-Scale NUMA Multiprocessors</A> <br> Sigmetrics 91 <LI> <A HREF="pubs_abs.html#Brecht_SEDMS93">On the Importance of Parallel Application Placement in NUMA Multiprocessors</A> <br> SEDM 93 <LI> <A HREF="pubs_abs.html#Sevcik_JPE">Application Scheduling and Processor Allocation in Multiprogrammed Parallel Processing Systems</A> <br> (Journal of) Performance Evaluation 94 <LI> <A HREF="pubs_abs.html#Curran_Stumm_CS">A Comparison of basic CPU Scheduling Algorithms for Multiprocessor Unix</A> <br> (Journal) Computer Systems 90 <LI> <A HREF="pubs_abs.html#Brecht_PhD_303">Multiprogrammed Parallel Application Scheduling in NUMA Multiprocessors</A> <br> PhD thesis <LI> <A HREF="pubs_abs.html#Wu_MASc">Processor Scheduling in Multiprogrammed Shared Memory NUMA Multiprocessors</A> </UL> <P> <A HREF="publications.html#BEG" > Return to the LIST</A> <P> <img src="images/redline.GIF"> <p> <A NAME="pe"><H2>Performance Evaluation</H2></A> <UL> <LI> <A HREF="pubs_abs.html#Sevcik_Zhou_PERF93">Performance Benefits and Limitations of Large NUMA Multiprocessors</A> <br> Performance 93 <LI> <A HREF="pubs_abs.html#Harz_Sevcik_SC93">Hot Spot Analysis in Large Scale Shared Memory Multiprocessors</A> <br> Supercomputing 93 <LI> <A HREF="pubs_abs.html#Holliday_Stumm_IEEETC">Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors </A> <br> IEEE Trans. Computer 1992 <LI> <A HREF="pubs_abs.html#Parsons_Sevcik_IPPS95">Multiprocessor Scheduling for High-Variability Service Time Distributions </A <br> IPPS Workshop on Job Scheduling 95 </UL> <P> <A HREF="publications.html#BEG" > Return to the LIST</A> <P> <img src="images/redline.GIF"> <p> <A NAME="db"><H2>Database Systems</H2></A> <UL> <LI> <A HREF="pubs_abs.html#Baru_Zilio_PADS93">Data reorganization in parallel database systems</A> <br> IEEE Workshop PADS 93 </UL> <P> <A HREF="publications.html#BEG" > Return to the LIST</A> <P> <A HREF="Welcome.html"> <img align="left" src="images/homeblue.gif"> Return to Parallel Systems Home </A> |
Added wiki_references/2017/software/eecg_toronto_edu/2017_05_12_wget_copy_of_http_www_eecg_toronto_edu_parallel_publications_html/bonnet/www.eecg.toronto.edu/parallel/pubs_abs.html version [aa83589fa9].
|
|
<!---------------------------------------------------------------------> <HR><A NAME="Ravi_Stumm_JIEICE96">.</A><HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Ravi_Stumm_JIEICE96.ps.Z"> A Comparison of Blocking and Non-blocking Packet Switching Techniques in Hierarchical Ring Networks </A> <P> <B>Authors:</B> G. Ravindran and M. Stumm <P> <B>Where:</B> IEICE Trans. Inf. & Syst., vol. E79-D, No. 8, August 1996 <P> <B>Keywords:</B> Networks, Switching, Wormhole, Virtual Cut-through, Hierarchical Ring Networks, Slotted Rings <P> <B>Abstract:</B> This paper presents the results of a simulation study of blocking and non-blocking switching for hierarchical ring networks. The switching techniques include wormhole, virtual cut-through, and slotted ring. We conclude that slotted ring network performs better than the more popular wormhole and virtual cut-through networks. We also show that the size of the node buffers is an important parameter and that choosing them too large can hurt performance in some cases. Slotted rings have the advantage that the choice of buffer size is easier in that larger than necessary buffers do not hurt performance and hence a single choice of buffer size performs well for all system configurations. In contrast, the optimal buffer size for virtual cut-through and wormhole switching nodes varies depending on the system configuration and the level in the hierarchy in which the switching node lies. <P> <!---------------------------------------------------------------------> <HR><A NAME="Ravi_Stumm_ICPP95">.</A><HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Ravi_Stumm_ICPP95.ps.Z"> Hierarchical Ring Topologies and the effect of their Bisection Bandwidth Constraints</A> <P> <B>Authors:</B> G. Ravindran and M. Stumm <P> <B>Where:</B> Proc. Intl. Conf. on Parallel Processing, pp.I/51-55, 1995 <P> <B>Keywords:</B> Multiprocessor architectures, Interconnection networks, Hierarchical rings, Bisection bandwidth <P> <B>Abstract:</B> Ring-based hierarchical networks are interesting alternatives to popular direct networks such as 2D meshes or tori. They allow for simple router designs, wider communications paths, and faster networks than their direct network counterparts. However, they have a constant bisection bandwidth, regardless of system size. In this paper, we present the results of a simulation study to determine how large hierarchical ring networks can become before their performance deteriorates due to their bisection bandwidth constraint. We show that a system with a maximum of 128 processors can sustain most memory access behaviors, but that larger systems can be sustained, only if their bisection bandwidth is increased. <P> <!---------------------------------------------------------------------> <HR><A NAME="Zhou_Brecht_SM91">.</A><HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Zhou_Brecht_SM91.ps.Z">Processor Pool-Based Scheduling for Large-Scale NUMA Multiprocessors</A> <P> <B>Authors:</B> Songnian Zhou and Timothy Brecht <P> <B>Where:</B> Appears in: Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, May (1991), pp. 133-142. <P> <B>Keywords:</B> NUMA, Schedulling, multiprocessor performance <P> <B>Abstract:</B> <P> Large-scale Non-Uniform Memory Access (NUMA) multiprocessors are gaining increased attention due to their potential for achieving high performance through the replication of relatively simple components. Because of the complexity of such systems, scheduling algorithms for parallel applications are crucial in realizing the performance potential of these systems. In particular, scheduling methods must consider the scale of the system, with the increased likelihood of creating bottlenecks, along with the NUMA characteristics of the system, and the benefits to be gained by placing threads close to their code and data. <P> We propose a class of scheduling algorithms based on processor pools. A processor pool is a software construct for organizing and managing a large number of processors by dividing them into groups called pools. The parallel threads of a job are run in a single processor pool, unless there are performance advantages for a job to span multiple pools. Several jobs may share one pool. Our simulation experiments show that processor pool-based scheduling may effectively reduce the average job response time. The performance improvements attained by using processor pools increase with the average parallelism of the jobs, the load level of the system, the differentials in memory access costs, and the likelihood of having system bottlenecks. As the system size increases, while maintaining the workload composition and intensity, we observed that processor pools can be used to provide significant performance improvements. We therefore conclude that processor pool-based scheduling may be an effective and efficient technique for scalable systems. <!---------------------------------------------------------------------> <HR><A NAME="Brecht_SEDMS93">.</A><HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Brecht_SEDMS93.ps.Z">On the Importance of Parallel Application Placement in NUMA Multiprocessors</A> <P> <B>Authors:</B> Timothy Brecht <P> <B>Where:</B> Proceedings of the Fourth Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, CA, September, 1993. <P> <B>Keywords:</B> NUMA, multiprocessor scheduling, multiprocessor performance <P> <B>Abstract:</B> <P> The thesis of this paper is that scheduling decisions in large-scale, shared-memory, NUMA (Non-Uniform Memory Access) multiprocessors must consider not only how many processors, but also which processors to allocate to each application. We call the problem of assigning parallel processes of an application to processors application placement. <P> We explore the importance of placement decisions by measuring the execution time of several parallel applications using different placements on a shared-memory NUMA multiprocessor. The results of these experiments lead us to conclude that, as expected, in small- scale mildly NUMA multiprocessors, placement decisions have only a minor affect on the execution time of parallel applications. However, the results also show that placement decisions in large-scale multiprocessors are critical and localization that considers the architectural clusters inherent in these systems is essential. Our experiments also show that the importance of placement decisions increases substantially with the size and NUMAness of the system and that the placement of individual processes of an application within the subset of chosen processors also significantly impacts performance. <!---------------------------------------------------------------------> <HR> <A NAME="Kumar_Kulkarni_ICPP91">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kumar_Kulkarni_ICPP91.ps.Z">Generalized Unimodular Loop Transformations for Distributed Memory Multiprocessors</A> (does not contain figures) <P> <B>Authors:</B> K G Kumar*, D Kulkarni+ and A Basu <BLOCKQUOTE> Center for Development of Advanced Computing 2/1 Brunton Road, Bangalore 560 025, India<BR> * Now at IBM TJ Watson, York Town Heights, NY 10598<BR> + Now at Dept of Computer Science, University of Toronto, Toronto, ON M5S 1A4<BR> </BLOCKQUOTE> <P> <B>Where:</B> International Conference of Parallel Processing -91 <P> <B>Keywords:</B> Parallelizing Compilers, Restructuring Transformations, Loop Partitioning, Iteration Spaces, Dependence Vectors. <P> <B>Abstract:</B> <P> In this paper, we present a generalized unimodular loop transformation as a simple, systematic and elegant method for partitioning the iteration spaces of nested loops for execution on distributed memory multiprocessors. We present a methodology for deriving the transformations that internalize multiple dependences in a multidimensional iteration space without resulting in a deadlocking situation. We then derive the general expression for the bounds of the transformed loops in terms of the bounds of the original space and the transformation matrix elements. <!---------------------------------------------------------------------> <HR> <A NAME="Kumar_Kulkarni_ICS92">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kumar_Kulkarni_ICS92.ps.Z">Deriving Good Transformations for Mapping Nested Loops on Hierarchical Parallel Machines in Polynomial Time</A> <P> <B>Authors:</B> K G Kumar*, D Kulkarni+ and A Basu <BLOCKQUOTE> Center for Development of Advanced Computing 2/1 Brunton Road, Bangalore 560 025, India<BR> * Now at IBM TJ Watson, York Town Heights, NY 10598<BR> + Now at Dept of Computer Science, University of Toronto, Toronto, ON M5S 1A4<BR> </BLOCKQUOTE> <P> <B>Where:</B> International Conference on Supercomputing 92 <P> <B>Keywords:</B> Parallelizing Compilers, Restructuring Transformations, Loop Partitioning, Iteration Spaces, Dependence Vectors. <P> <B>Abstract:</B> <P> We present a computationally efficient method for deriving the most appropriate transformation and mapping of a nested loop for a given hierarchical parallel machine. This method is in the context of our systematic and general theory of unimodular loop transformations for the problem of iteration space partitioning \cite{kandk6}. Finding an optimal mapping or an optimal associated unimodular transformation is NP-complete. We present a polynomial time method for obtaining a `good' transformation using a simple parameterized model of the hierarchical machine. We outline a systematic methodology for obtaining the most appropriate mapping. <!---------------------------------------------------------------------> <HR> <A NAME="Li_Tandri_et">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Li_Tandri_et.ps.Z">Locality and Loop Scheduling on Numa Multiprocessors</A> <P> <B>Authors:</B> Hui Li, Sudarsan Tandri Michael Stumm, and Kenneth C. Sevcik <P> <B>Where:</B> International Conference on Parallel Processing 93 <P> <B>Keywords:</B> NUMA multiprocessors, Locality, Scheduling <P> <B>Abstract:</B> <P> An important issue in the parallel execution of loops is how to partition and schedule the loops onto the available processors. While most existing dynamic scheduling algorithms manage load imbalances well, they fail to take locality into account and therefore perform poorly on parallel systems with non-uniform memory access times. In this paper, we propose a new loop scheduling algorithm, Locality-based Dynamic Scheduling (LDS), that exploits locality, and dynamically balances the load. <!---------------------------------------------------------------------> <HR> <A NAME="Sandhu_et_al_PPOPP">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Sandhu_et_al_PPOPP.ps.Z">The shared regions approach to software cache coherence on multiprocessors</A> <P> <B>Authors:</B> Harjinder Sandhu, Benjamin Gamsa and Songnian Zhou <P> <B>Where:</B> Proceedings of the 1993 ACM SIGPLAN Symposium on Principles and Pranctice of Parallel Programming, May (1993). <P> <B>Keywords:</B> NUMA, cache coherence, multiprocessor performance <P> <B>Abstract:</B> <P> The effective management of caches is critical to the performance of applications on shared-memory multiprocessors. In this paper, we discuss a technique for software cache coherence that is based upon the integration of a program-level abstraction for shared data with software cache management. The program-level abstraction, called <EM>Shared Regions</EM>, explicitly relates synchronization objects with the data they protect. Cache coherence algorithms are presented which use the information provided by shared region primitives, and ensure that shared regions are always cacheable by the processors accessing them. Measurements and experiments of the Shared Region approach on a shared-memory multiprocessor are shown. Comparisons with other software based coherence strategies, including a user-controlled strategy and an operating system-based strategy, show that this approach is able to deliver better performance, with relatively low corresponding overhead and only a small increase in the programming effort. Compared to a compiler-based coherence strategy, the Shared Regions approach still performs better than a compiler that can achieve 90\% accuracy in allowing cacheing, as long as the regions are a few hundred bytes or larger, or they are re-used a few times in the cache. <!---------------------------------------------------------------------> <HR> <A NAME="Wilton_Vranesic_SPDP">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Wilton_Vranesic_SPDP.ps.Z">Architectural Support for Block Transfers in a Shared-Memory Multiprocessor</A> <P> <B>Authors:</B> Steven J.E. Wilton and Zvonko G. Vranesic <P> <B>Where:</B> Fifth IEEE Symposium on Parallel and Distributed Processing, Irving, Texas, December 1993 <P> <B>Keywords:</B> Shared-memory multiprocessor, block transfer support <P> <B>Abstract:</B> <P> This paper examines how the performance of a shared-memory multiprocessor can be improved by including hardware support for block transfers. A system similar to the Hector multiprocessor developed at the University of Toronto is used as a base architecture. It is shown that such hardware support can improve the performance of initialization code by as much as 50%, but that the amount of improvement depends on the memory access behavior of the program and the way in which the operating system issues block transfer requests. <!---------------------------------------------------------------------> <HR> <A NAME="Sevcik_Zhou_PERF93">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Sevcik_Zhou_PERF93.ps.Z">Performance Benefits and Limitations of Large NUMA Multiprocessors</A> <P> <B>Authors:</B> Kenneth C. Sevcik and Songnian Zhou <P> <B>Where:</B> Proceedings of Performance '93 , Rome, Italy, September 27 to October 1, 1993, pp. 183-204, Elsevier Science Publ. <P> <B>Abstract:</B> Please see the postscript file. <!---------------------------------------------------------------------> <HR> <A NAME="Harz_Sevcik_SC93">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Harz_Sevcik_SC93.ps.Z">Hot Spot Analysis in Large Scale Shared Memory Multiprocessors</A> <P> <B>Authors:</B> Karim Harzallah and Kenneth C. Sevcik <P> <B>Where:</B> Proceedings of the Supercomputing '93 Conference, November, 1993, Portland, Oregon. <P> <B>Abstract:</B> Please see the postscript file. <!---------------------------------------------------------------------> <HR> <A NAME="Sevcik_JPE">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Sevcik_JPE.ps.Z">Application Scheduling and Processor Allocation in Multiprogrammed Parallel Processing Systems</A> <P> <B>Authors:</B> Kenneth C. Sevcik <P> <B>Where:</B> (Journal of) Performance Evaluation, vol. 19 (1994), pp. 107-140 (Special issue on the performance evaluation of parallel systems) <P> <B>Abstract:</B> Please see the postscript file. <!---------------------------------------------------------------------> <HR> <A NAME="Holliday_Stumm_IEEETC">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Holliday_Stumm_IEEETC.ps.Z">Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors</A> <P> <B>Authors:</B> <BR> Mark Holliday<BR> Dept. of Computer Science, Duke University, Durham, NC 27706 <P> Michael Stumm<BR> Dept. of Electrical and Computer Engineering<BR> University of Toronto, Toronto, Canada M5S 1A4 <P> <B>Date:</B> November 1992; revised April 1993 <P> <B>Where:</B> Technical Report CS-1992-18, Duke University<BR> IEEE Transactions on Computers <P> <B>Keywords:</B> communication locality; hierarchical ring-based networks; hot spots; large scale parallel systems; memory banks; performance evaluation; prefetching; shared memory multiprocessors; simulation. <P> <B>Abstract:</B> <P> This paper investigates the performance of word-packet, slotted unidirectional ring-based hierarchical direct networks in the context of large-scale shared memory multiprocessors. Slotted unidirectional rings are attractive because their electrical characteristics and simple interfaces allow for fast cycle times and large bandwidths. For large-scale systems, it is necessary to use multiple rings for increased aggregate bandwidth. Hierarchies are attractive because the topology ensures unique paths between nodes, simple node interfaces and simple inter-ring connections. <P> To ensure that a realistic region of the design space is examined, the architecture of the network used in the Hector prototype is adopted as the initial design point. A simulator of that architecture has been developed and validated with measurements from the prototype. The system and workload parameterization reflects conditions expected in the near future. <P> The results of our study show the importance of system balance on performance. Large-scale systems inherently have large communication delays for distant accesses, so processor efficiency will be low, unless the processors can operate with multiple outstanding transactions using techniques such as prefetching, asynchronous writes and multiple hardware contexts. However with multiple outstanding transactions and only one memory bank per processing module, memory quickly saturates. Memory saturation can be alleviated by having multiple memory banks per processing module, but this shifts the bottleneck to the ring subsystem. While the topology of the ring hierarchy affects performance, the ring subsystem will inherently limit the throughput of the system. Hence increasing the number of outstanding transactions per processor beyond a certain point only has a limiting effect on performance, since it causes some of the rings to become congested. An adaptive maximum number of outstanding transactions appears necessary to adjust for the appropriate tradeoff between concurrency and contention as the communication locality changes. We show the relationships between processor, ring and memory speeds, and their effects on performance. Using block transfers for prefetching seems unlikely to be advantageous in that the improvement in the cache hit ratio needed to compensate for the increased network utilization is substantial. <!---------------------------------------------------------------------> <HR> <A NAME="Curran_Stumm_CS">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Curran_Stumm_CS.ps.Z">A Comparison of basic CPU Scheduling Algorithms for Multiprocessor Unix</A> <P> <B>Authors:</B> Stephen Curran and Michael Stumm <P> <B>Where:</B> Computer Systems, 3(4), Oct., 1990, pp. 551--579. <P> <B>Abstract:</B> <P> In this paper, we present the results of a simulation study comparing three basic algorithms that schedule independent tasks in multiprocessor versions of Unix. Two of these algorithms, namely Central Queue and Initial Placement, are obvious extensions to the standard uniprocessor scheduling algorithm and are in use in a number of multiprocessor systems. A third algorithm, Take, is a variation on Initial Placement, where processors are allowed to raid the task queues of the other processors. Our simulation results show the difference between the performance of the three algorithms to be small when scheduling a typical Unix workload running on a small, bus-based, shared memory multiprocessor. They also show that the Take algorithm performs best for those multiprocessors on which tasks incur overhead each time they migrate. In particular, the Take algorithm appears to be more stable than the other two algorithms under extreme conditions. <!---------------------------------------------------------------------> <HR> <A NAME="Stumm_Unrau_Krieger_USENIX92">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Stumm_Unrau_Krieger _USENIX92.ps.Z">Hierarchical Clustering: A Structure for Scalable Multiprocessor Operating System Design</A> <P> <B>Authors:</B> Michael Stumm, Ron Unrau, and Orran Krieger <P> <B>Where:</B> Extended version of Clustering Micro-Kernels for Scalability, Proc. of the Usenix Workshop on Micro-Kernels and Other Kernel Architectures, April, 1992. <P> <B>Abstract:</B> Please see the postscript file. <P> <!---------------------------------------------------------------------> <HR> <A NAME="Stumm_Vranesic_White_IPPS93">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Stumm_Vranesic_White_IPPS93.ps.Z">Experience with the Hector Multiprocessor</A> <P> <B>Authors:</B> Michael Stumm, Zvonko Vranesic, Ron White <P> <B>Where:</B> Extended version of paper with same title in Proc. Intl. Parallel Processing Symposium Parallel Systems Fair, 1993, pp. 9-16. <P> <B>Abstract:</B> Please see the postscript file. <P> <!---------------------------------------------------------------------> <HR> <A NAME="Krieger_etal_IEEEComp94">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Krieger_etal_IEEEComp94.ps.Z">The Alloc Stream Facility: A redesign of application-level Strea m I/O</A> <P> <B>Authors:</B> O. Krieger, M. Stumm, and R. Unrau <P> <B>Where:</B>IEEE Computer, 27(3), March, 1994, pp. 75--83. <P> <B>Abstract:</B> <P> This paper introduces a new application level I/O facility called the Alloc Stream Facility (ASF). ASF has several key advantages. First, performance is substantially improved as a result of a)~the structure of the facility that allows it to take advantage of system specific features like mapped files, and b)~a reduction in data copying and the number of I/O system calls. Second, the facility is designed for multi-threaded applications running on multiprocessors and allows for a high degree of concurrency. Finally, the facility can support a variety of I/O interfaces, including stdio, emulated Unix I/O, ASI, and C++ streams, in a way that allows applications to freely intermix calls to the different interfaces, resulting in improved code re-usability. We show that on several Unix workstation platforms, I/O intensive applications perform substantially better when linked to ASF instead of the native facilities -- in the best case, up to twice as good. Modifying the applications to use a new interface provided with ASF can improve performance even more. <P> <!---------------------------------------------------------------------> <HR> <A NAME="Krieger_Stumm_DAGS93">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Krieger_Stumm_DAGS93.ps.Z">HFS: A Flexible File System for Large-Scale Multiprocessors</A> <P> <B>Authors:</B> Orran Krieger and Michael Stumm <P> <B>Where:</B> Proceedings of the 1993 DAGS/PC Symposium <P> <B>Abstract:</B> <P> The Hurricane File System (HFS) is a new file system being developed for large-scale shared memory multiprocessors with distributed disks. The main goal of this file system is scalability; that is, the file system is designed to handle demands that are expected to grow linearly with the number of processors in the system. To achieve this goal, HFS is designed using a new structuring technique called Hierarchical Clustering. HFS is also designed to be flexible in supporting a variety of policies for managing file data and for managing file system state. This flexibility is necessary to support in a scalable fashion the diverse workloads we expect for a multiprocessor file system. <!---------------------------------------------------------------------> <HR> <A NAME="Krieger_etal_ICPP93">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Krieger_etal_ICPP93.ps.Z">A Fair Fast Scalable Reader-Writer Lock</A> <P> <B>Authors:</B> O. Krieger, M. Stumm, R. Unrau, and J. Hanna <P> <B>Where:</B> Proc. Intl. Conf. on Parallel Processing, 1993. <P> <B>Abstract:</B> <P> A reader-writer lock allows either multiple readers to inspect shared data or a single writer exclusive access to that data. On shared memory multiprocessors, the cost of acquiring and releasing these locks can have a large impact on the performance of parallel applications. Other researchers have shown how to implement scalable locks, that is, locks that can become contended without resulting in memory or interconnection network contention. This paper describes a new algorithm for a reader-writer lock that, while being scalable in the contended case, has a low overhead in the uncontended case. This is important because most parallel applications are written so that locks are typically uncontended. The only atomic operation required by this algorithm is fetch_and_store and hence it can be used on most current multiprocessor systems. Experimental results are provided. <!---------------------------------------------------------------------> <HR> <A NAME="Kulkarni_Stumm_Tut">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Kulkarni_Stumm_Tutorial.ps.Z">Loop and Data Transformations: A tutorial</A> <P> <B>Authors:</B> Dattatraya Kulkarni and Michael Stumm <P> <B>Where:</B> CSRI Tech Report 337, University of Toronto, June 1993. <P> <B>Abstract:</B> <P> Hierarchically structured machines appear to be becoming the dominant parallel computing structure. These systems have non-uniform access times. We address the problem of restructuring a possibly sequential program to execute efficiently on such parallel machines. This restructuring involves transforming and partitioning the loop structures and the data to so as to improve <EM>parallelism</EM>, <EM>static</EM> and <EM>dynamic locality</EM>, and <EM>load balance</EM>. The objective of this paper is to present previous and ongoing work on loop and data transformations and motivate a <EM>unified</EM> framework to restructuring of a sequence of loops and data so as to execute efficiently on parallel machines with several levels of hierarchy. <!---------------------------------------------------------------------> <HR> <A NAME="Baru_Zilio_PADS93">.</A> <HR> <B>Title:</B> <A HREF="./../../manually_copied_ftp_colon_doubleslash_ftp_cs_toronto_edu/parallel/Baru_Zilio_PADS93.ps.Z">Data reorganization in parallel database systems</A> <P> <B>Authors:</B> Chaitanya Baru & Daniel C. Zilio <P> <B>Where:</B> Proc. of the IEEE Workshop on Advances in Parallel and Distributed Systems}, Princeton, NJ, pp.102-107, Oct. 1993. <P> <B>Abstract:</B> <P> Parallel database systems are suitable for use in applications with high capacity and high performance and availability requirements. The trend in such systems is to provide efficient < |