I've been using Kotlin coroutines for a year now, and while I can say I've felt pretty comfortable working with them this whole time, it's only now that I decided to do a proper deep dive into stuff that I didn't understand much before - namely, coroutine context, exception propagation and coroutine cancellation.
So what exactly is a CoroutineContext? It's described in the official docs as an indexed set of Element instances, which is as vague as you can possibly get.
It sounds like context can be anything, but after some investigation it turns out they're referring to only four types of classes:
It makes me question why they didn't just document it as:
Coroutine context is a thing that can hold 1-4 of these four things
instead of resorting to this generic verbiage. The only other article that I've found that explains it like that are those Android docs.
Even more opaque than that was:
When you launch a new coroutine, it will inherit the four above mentioned elements from its parent and set them as its own coroutineContext. Anything you specify additionally will overwrite each of them respectively. If you pass in a new dispatcher, it's getting overwritten, but everything else just gets reused from the parent.
One exception to this is a Job element, which is always created anew and overridden - as it's basically a handle to the coroutine, each must get a new one.
This makes sense, but only after you figure out there's a finite number of types that can be passed as context. Otherwise you might be lead to believe that, for example, setting a parent Job explicitly might override it's CoroutineDispatcher, which is luckily just not true.
This one was a doozy. In principle, it sounds simple - exceptions in coroutines are always propagated upwards until someone handles them, either a parent coroutine or some plain old try/catch block. If there's no one to catch it, everything crashes. Seems reasonable.
However, things get complicated pretty quickly when you inevitably get to deal with coroutineScope or coroutineContext in your code, which modify error propagation rules. Here are some which I found confusing at first:
There's a concept of using supervisorScope coroutine builder instead of regular coroutineScope, which should allow the child coroutines to fail without affecting itself or its other children. Even though it sounds like with this you can crash child coroutines as much as you want, that's not entirely true. Exceptions still get propagated upwards, and if parent supervisorScope doesn't have some error handling mechanism present, it's still gonna die and drag the whole app down with it.
Getting back to coroutineContext - if you pass in a specific Job or a SupervisorJob as a context, like in launch(SupervisorJob()), it doesn't mean it's gonna do work inside of that Job - it just means it's gonna use that Job as a parent Job for a Job it's anyway gonna create. Because remember, every new coroutine has to create a new Job.
This means launch(SupervisorJob()) won't actually run it in context of a SupervisorJob, but in a regular Job created by it. That's a really subtle way to create bugs. For this case you should use supervisorScope instead. It will create a proper supervisor scope in which all child jobs will be able to run (and crash) independently.
Then we have a CoroutineExceptionHandler parameter that we can send to coroutineContext, which sounds like it's supposed to catch coroutine exceptions and prevent them from being propagated upwards. But it doesn't appear to do so - as the coroutine dies its default behavior is still to report it to a parent. How come?
This is explained nicely in the docs, but it still felt unintuitive to me:
All children coroutines delegate handling of their exceptions to their parent coroutine, which also delegates to the parent, and so on until the root, so the CoroutineExceptionHandler installed in their context is never used.
This again means that adding CoroutineExceptionHandler to a child coroutine, which you can do, doesn't change its exception handling behavior at all. You must specify it all the way up at the root coroutine (which is probably the most reasonable place to put it anyway). But:
While we're on the topic of root coroutines, let me just paste this snippet from the documentation:
Coroutines running with SupervisorJob do not propagate exceptions to their parent and are treated like root coroutines.
Again, really unintuitive and with misleading naming, but it appears to be correct in the tests I ran. It basically means CoroutineExceptionHandler should work fine in any SupervisorJob, regardless if it's technically a child job running as part of some other job.
This one is well-known but just adding it here for completeness: using coroutineScope.async instead of coroutineScope.launch doesn't propagate exceptions to the parent coroutine. If you still want that, you have to call await() on it which blocks the current thread, but it also reports the exceptions that happen.
In general, you should use launch most of the time, unless you specifically need to await for the result which is a use case for async { }.await().
Basically, if there's a TLDR to take out of all this, I'd sum it like this:
Kotlin coroutine will crash your app unless its parent is a SupervisorJob with CoroutineExceptionHandler attached to it.
As for the cancellation, 9 out of 10 docs mention this upfront:
Coroutine cancellation is cooperative.
This also sounds straightforward at first - it basically means that coroutines have to allow to be cancelled. If a parent coroutine scope is closing, and as such it's supposed to cancel its children, it would only be able to do so at points where the children allow it. This is a pretty nice solution - it allows the coroutines to prolong the parent scope to do their work until the end, instead of stopping them forcibly midway, potentially leaving the system in an invalid half-finished state.
By checking a CoroutineScope.isActive boolean or using ensureActive() assert at the appropriate place, we can mark certain parts of our suspendable functions as "safe to cancel here", preventing any lost work.
However, a thought occurred to me. On Android, we're using viewmodelScope (which is just an UI-bound supervisorScope) to keep all work related to that screen under a single closeable coroutine scope. And we're relying on it to cancel all work happening there once the screen is dismissed. But we've never really implemented any code to actually allow the child coroutines to be canceled. Are we leaving some work running even after its UI is gone and creating memory leaks this way?
Luckily, this appears to work fine, at least somewhat accidentally, because all of the functions from kotlinx.coroutine.* implement cancellation checks inside of them! Basically, every time you do a collect, emit or delay, they do the ensureActive() checks, and create a cancellation point.
And since you're using them almost everywhere, looks like cancellation and memory cleanup work just as expected out of the box.
With that in mind, we just need to keep track of where we're using functions which can be cancellation points - for example, if you're doing some local state changes, do it either before or after emit, not scattered at both places, as you might end up without the parts happening after it if coroutine scope closes.
Also, in case you're peddling some heavy long-running code that doesn't use kotlinx.coroutine.* functions, you might want to manually ensure it's cancelled properly. But you're probably just working on a CRUD app where all coroutines are terminated by Flow.collect, so that's probably not an issue.
To a person with a limited threading experience like me, a bunch of this seemed really confusing and unintuitive at first. The official docs really are exhaustive, but there's still a bit too many concepts to distill for what is advertised as just "light-weight threads".
However, there's a lot of content on the side from Roman Elizarov, Kotlin project lead, which helps shed the light on the intricacies I outlined in this article. And obviously you can always just run the code yourself to see how it behaves.
The good thing in all of this is, even though there's a lot of complexity involved, you can get a pretty good experience out of it by just sticking with defaults. So overall, my minor rants aside, things are looking (and working) pretty well.