I disagree with the characterization that Chapel's parallelization features copied OpenMP without improving upon it:
* Chapel's support for task parallelism predates OpenMP's (~2004 vs. ~2007, where Wikipedia cites Chapel's tasks as being inspiration for OpenMP's, along with Cilk and X10). Chapel's tasks are also arguably more general-purpose (akin to threads) in terms of their ability to synchronize, support data-driven producer/consumer patterns, etc.
* Chapel's forall loops are similar to OpenMP's loop-based parallelization pragmas, though OpenMP wasn't a source of inspiration in their design. Where OpenMP pragmas select from a menu of parallelization strategies baked into the specification and implementation, Chapel's forall loops invoke user-defined parallel iterators that permit abstracting a particular parallel pattern (say, multidimensional tiled iteration or tree traversal) into a named subroutine. These iterators can optionally be made methods of data structures and/or placed within libraries, and can be re-used across a program or multiple programs. One such library, DynamicIters, was community-contributed and specifically inspired by OpenMP's dynamic and guided scheduling strategies.
* Chapel supports parallel zippered iteration, in which two or more data structures and/or parallel iterators can be traversed in a coordinated manner.
* Chapel's parallelism can span multiple compute nodes via its shared namespace, which obviates the need for explicit communication; whereas OpenMP is limited to a single compute node or process unless mixed with MPI, SHMEM, or the like (and even then, OpenMP doesn't gain a cross-node view of parallel computation).
* In Chapel, parallelism can be expressed implicitly, for example, by passing an array argument to a subroutine or operator that is expecting a scalar (e.g., `var B = sin(A);` or `var C = A + B;`).
High bandwidth may mean the need to consult some very large but immutable data structure. As a trivial example, multiplying two matrices requires accessing each matrix fully multiple times over, but neither of them is altered in the process, so it can safely be done in parallel. Recording the result of a (naive) matrix multiplication can also be done without programmatic coordination, because each element is only updated once, independently from others.
This is very unlike, say, a database engine, where mutations occur all the time and may come from multiple threads.
Rust specifically makes it hard to impossible to clobber shared mutable state, e.g. to produce a dangling pointer. But this is not a problem that our matrix-multiplication example would have, so it won't benefit from being implemented in Rust. Maybe this applies to more classes of HPC problems.
The HPC infrastructure is not like you're used to using. It is very high bandwidth but latency is dependent on where your data lives. There's a lot more layers that complicate things and each layer has a very different I/O speed
Also how to handle the data can be very different. Just see how libraries like this work. They take advantage of those burst buffers and try to minimize what's being pulled from storage. Though there's a lot of memory management in the code people write to do all this complex stuff you need so that you aren't waiting around for disks... or worse... tape
In the "maybe we don't need it" you open up with this:
> Another explanation might be that HPC doesn’t really need new languages; that Fortran, C, and C++ are somehow optimal choices for HPC. But this is hard to take very seriously given some of the languages’ demerits
It's honestly hard to think of a less specific claim than "some of [their] demerits", this is clearly preaching to the choir territory. Later hints of substance appear, but the text is merely reminding the reader of something they are expected to already know.
Moving on, the summary for the "ten myths" series starts with:
> I wrote a series of eight blog posts entitled “Myths About Scalable Parallel Programming Languages” [...] In it, I described discouraging attitudes that our team encountered when talking about developing Chapel, and then gave my personal rebuttals to them.
So it appears to be a text about the trouble of trying to break through with a new "HPC" language, and the reader is again expected to already know the (potentially very good) technical reasons for why one would want to create a new one.
Good point on my alluding to demerits of Fortran, C, and C++ without stating them, and thanks for clarifying your criticism. Using the four factors that I focused on as attractive features in new languages:
Productivity: For me, while Fortran has some nice features for HPC (multidimensional arrays), lots about its design feels very old-fashioned to my (not particularly young) eyes. C and C++ are more "my generation" of programming language, so are familiar and comfortable, yet they still seem verbose, convoluted, and less readable (more symbolically oriented) as compared to Python, Julia, or Swift, which are more what I'm looking for in terms of productivity these days. Of the three, C++ has clearly made the biggest strides in recent years to improve productivity, with some successes in my opinion, though I've also had a hard time keeping up with all the changes.
Safety: I consider C and C++ to be fairly unsafe languages compared to more modern alternatives. I don't have enough experience with Fortran to have a particularly informed opinion, but feel as though I've been aware of patterns in the past that have felt unsafe. Here again, I think using modern C++ in a certain style (e.g., smart pointers) probably makes nice strides w.r.t. safety, but I'd still consider there to be a gap between it and Python/Rust (as does my colleague in this post: https://chapel-lang.org/blog/posts/memory-safety/)
Portability: Modulo the degree to which various compilers keep up with the latest standards in Fortran and C++, I'd consider all three languages to be quite portable.
Performance: There's no question that these are high-performing languages in the sequential computing setting. In HPC, while Fortran or C++ and MPI are often considered the gold standard, it's a standard that can be beat if your language maps more natively to the network's capabilities, or knows how to optimize for distributed memory computing rather than relying on the programmer to do it themselves.
With respect to the "10 myths" series, while the focus of the series was about combatting prevalent negative attitudes about new languages in the HPC community, I think there's a lot of content along the way that rationalizes the value of creating new languages in my rebuttals. That said, I fully realize that it's a long read, particularly in its updated "Redux" form.
@yubblegum: I'm unfairly biased towards Chapel (positively), so won't try to characterize HN's opinion on it. But I did want to note that while Chapel's original and main reason for being is HPC, now that everyone lives in a parallel-computing world, users also benefits from using Chapel in desktop environments where they want to do multicore and/or GPU programming. One such example is covered in this interview with an atmospheric science researcher for whom it has replaced Python as his go-to desktop language: https://chapel-lang.org/blog/posts/7qs-dias/
Thank you Brad! I was in fact wondering about GPU use myself. Does it work with Apple's M# GPUs?
Btw, I was looking at the docs for GPU [1] and unsolicited feedback from a potential user is that the setup process needs to become less painful. For example, yesterday installed it via brew but then hit the setup page for GPU and noted I now needed to build from source.
(Back in the day, one reason some of Sun's Java efforts to extend Java's fieddom faltered was because of the friction of setup for (iirc) things like Applets, etc. I think Chapel deserves a far wider audiance.)
@yubblegum: I'm afraid we don't have an update on support for Apple GPUs since last year's comment. While it comes up from time-to-time, nobody has opened an issue for it yet (please feel encouraged to!), and it isn't something we've had the chance to prioritize, where a lot of our recent work has focused on improving tooling support and addressing user requests.
That doesn't seem extreme to me, as I generally feel similarly. If you (or other readers) are genuinely interested in using Chapel with Metal, please open an issue on our GitHub repository capturing your request, as that would be valuable to us.
Just to make sure it didn’t get lost, note that it is possible to develop GPU code in Chapel on a MacBook using the cpu-as-device mode Engin mentions above, and then deploy it on NVIDIA GPUs on production systems by recompiling. This is how I develop/debug GPU computations in Chapel.
These are great questions, and ones we’re very curious about as well. I don’t believe that our current Chapel team has much experience programming NNs and LLMs, having focused on other areas. That said, I’m also not aware of any intrinsic barriers to implementing such algorithms in a portable way within Chapel, potentially calling out to vendor-optimized implementations when available and appropriate.
If you, or others, would be interested in exploring this topic, we’d be very interested in either partnering with you or supporting your efforts.
Chapel was designed for the high performance computing community where programmers often want full control over mapping their computations to their hardware resources without needing to rely on techniques like virtualization or runtime load balancing, which can obscure key details. That said, higher-level abstractions can be (and have been) written in Chapel to insulate many computations from these system-level details, such as distributed arrays and iterators. Users of these higher-level features need not worry about the details of the underlying locales. We refer to this as Chapel's support for multiresolution programming.
That said, other communities may obviously prefer different approaches due to differing needs and constraints.
What you primarily want in HPC is control over where your data is stored. That is is subtly different from where your computations are performed. E.g an HPC computation may use N heterogeneous devices and require fine-grained control over how data is communicated between those devices. The examples with "locales" are too blunt to handle such scenarios.
We agree that the placement of data is important for HPC programmers to control. Locales are the means of controlling such placement in Chapel, whether directly (as in this article’s simple examples) of via abstractions like distributed arrays (whose implementations rely on locales).
Once the data is created, computations can be executed with affinity to a specific variable in a data-driven manner using patterns like `on myVar do foo(myVar, anotherVar)`. Alternatively, an abstraction can abstract such details away from a user's concern and control the affinity within its implementation, as the parallel iterator implementing `forall elem in MyDistributedArray` does.
According to the article, locales control where the code is running, not where the data is stored. Maybe that is implied in some cases such that if you create data in one locale that is also where it is stored, but it tells you nothing about how data created in one locale and accessed in another locale is handled (or even if that's allowed). As you mention other Chapel features that I don't know about they may fill in the gaps. My only point of contention is that the locale feature is poorly thought out and not a good way to address HPC needs.
Locales do control where the data is stored. For example:
var HostArr: [1..10] int; // allocated on the host memory
on here.gpus[0] {
// now we are on a GPU sublocale...
var DevArr:[1..10] int; // allocated on the device memory
...
}
In the near term, we are planning to publish our 2nd GPU blog post where we will discuss how to move data between device and host.
@ColonelPhantom: Thanks very much for your questions. The following are answers I'm relaying from Engin Kayraklioglu, who heads up the Chapel GPU effort:
Re Intel support: That's definitely in our plans. However, there are also many other areas where we are actively working on to add more features, fix bugs, and improve performance. When prioritizing, we typically make decisions based on what our current and potential users might need in the language. Frankly, we are not seeing a big push for Intel GPU support so far. So, currently it is not near the top of our priorities. If you (or other readers) have any input on that matter where lack of Intel support might be a blocker for testing Chapel and/or its GPU support out, definitely let us know.
Re implicit serialization: To clarify; the serialization based on order-dependence is not implicit. The users should use a `for` loop if their loop is order-dependent and `foreach` (and `forall`) if their loop is order-independent. In other words, the Chapel compiler doesn't make decisions about order-dependence. In particular, for GPU execution a `for` loop will never turn into a GPU kernel.
There are, however, some cases where a `foreach` does not turn into a kernel. You may be referring to those cases, but that's not related to order-dependence. Some Chapel features cannot execute on a GPU. If your `foreach` loop's body uses any of those features then it will not be launched as a kernel even though `foreach` signals order-independence. Now, a subset of such features that makes an order-independent loop GPU-ineligible are there because we haven't gotten a chance to properly address them, yet. Another subset of such features will remain thwarters for a longer time and maybe forever. For example, your `foreach` loop could be calling an external host function.
Sorry for what now appears to be a double-post. Engin had just registered for HN, hadn't seen his reply going through, so asked me to relay it.
Re-reading this Q+A this morning, I also wanted to clarify one thing, which is that when a 'foreach' or 'forall' does end up being executed on the CPU, that doesn't mean it has been serialized. 'foreach' loops on the CPU are candidates for vectorization while 'forall' loops typically result in multicore task-parallelism with each task also being a candidate for vectorization.
* Chapel's support for task parallelism predates OpenMP's (~2004 vs. ~2007, where Wikipedia cites Chapel's tasks as being inspiration for OpenMP's, along with Cilk and X10). Chapel's tasks are also arguably more general-purpose (akin to threads) in terms of their ability to synchronize, support data-driven producer/consumer patterns, etc.
* Chapel's forall loops are similar to OpenMP's loop-based parallelization pragmas, though OpenMP wasn't a source of inspiration in their design. Where OpenMP pragmas select from a menu of parallelization strategies baked into the specification and implementation, Chapel's forall loops invoke user-defined parallel iterators that permit abstracting a particular parallel pattern (say, multidimensional tiled iteration or tree traversal) into a named subroutine. These iterators can optionally be made methods of data structures and/or placed within libraries, and can be re-used across a program or multiple programs. One such library, DynamicIters, was community-contributed and specifically inspired by OpenMP's dynamic and guided scheduling strategies.
* Chapel supports parallel zippered iteration, in which two or more data structures and/or parallel iterators can be traversed in a coordinated manner.
* Chapel's parallelism can span multiple compute nodes via its shared namespace, which obviates the need for explicit communication; whereas OpenMP is limited to a single compute node or process unless mixed with MPI, SHMEM, or the like (and even then, OpenMP doesn't gain a cross-node view of parallel computation).
* In Chapel, parallelism can be expressed implicitly, for example, by passing an array argument to a subroutine or operator that is expecting a scalar (e.g., `var B = sin(A);` or `var C = A + B;`).
reply