Am I correct in understanding that the main benefit is that JXCore allows you to run x number of tasks in parallel where x is the number of CPUs?
The problem is solved by running multiple node processes, which is standard deployment for node (i.e. if a machine has 8 cores then 8 node processes are started).
One issue with running multiple threads is that many developers use fail fast as a best practice when building node applications. In other words, uncaught exceptions cause the node process to fail, die and restart. So, it's perfectly acceptable practice to write your application to be written to accept failure as a given (similar to how Netflix uses Simian Army).
That said, how well is each thread isolated from everything else. Does an uncaught exception kill just the thread and its state, or does it kill the process? More specifically, when is the main process killed and what is contained within the thread?
It's actually pretty silly to write node.js apps in a crash-only manner. Since they often handle thousands of connections concurrently, a failure in one client's processing is pretty horrendous if it brings down a whole server (even if the server thread gets immediately restarted). This project doesn't try to solve that either though.
In the tradeoff between a short post and a long elaboration, I erred on the side of being too terse.
Node applications can be split between stateful servers (ex. chat) and stateless (ex. API for mobile clients).
It's the stateless servers where some developers write fault tolerant / fail fast / fast restart applications. Doing so in a stateful server would be counter productive.
Also, this does not mean to imply that developers are writing sloppy code that fails constantly or that they fail to implement proper error handling. What I was stating is that unhandled exceptions are unexpected, but when they do occur they indicate something is seriously wrong. Importantly, this puts the application's state into an unknown state, which is difficult to recover from. In such situations, a robust approach is to let Node fail, restart fast, and have clean state.
The reasons this is a sound approach are:
1. If the failure is due to a memory leak, then the graphs will highlight said leak clearly
2. The unhandled exception indicates that something is very wrong. A server restart is easily seen in the logs and is a warning that deeper inspection is necessary
3. Recovering state after an unhandled exception is difficult. In a stateless server its better to just restart from a clean state. This assumes the engineers wrote the application to work from a clean state (i.e. after a restart there is no need to recreate state)
4. A fault tolerant architecture is good practice as disks can fail, CPUs can fail, network connections can fail, etc. In a cluster failure is expected and applications are architected to continue operation in the face of failure
I actually think I understood you, but I'm saying that in that case where statelessness should be an advantage, node.js is actually a much less fault tolerant environment when you compare it to most other web application servers. Most other web app servers offer
(1) request isolation (so most failures in one request can't break other requests) and
(2) a way to catch all exceptions/errors in a single request (and domains don't accomplish this, unless you know what to expect errors from, or wrap everything).
Since node.js doesn't offer those features, it's not even as fault tolerant as PHP was 15 years ago. I'm a huge fan of node.js, but one of the hardest things to do on a large application with a large number of users is to keep an instance of the server from restarting and dropping all the other in-progress requests. If you write your node.js code to be crash-only (like one might do with erlang) your clients are going to have a terrible time.
We had the problem you're describing for awhile, but have since figured out how to avoid processes going down and interrupting other reqs. Essentially, you attach a global domain, and when that domain catches an error you stop accepting new connections (obviously you have to be load-balancing between procs) and start a countdown. Some reasonable amount of time later (I think we wait 30 seconds?) you assume that any in-progress request is done and restart the process. We've found this to be very successful.
We do this too, attaching req and res objects to a domain, as well as databases and other network related objects (like smtp clients, etc). This is a huge improvement, but I'm still seeing occasional uncaught error events in our logs on a very large codebase and only in production. Some of them are just ECONNRESET events with no details given, so their origins are REALLY hard to track down. Have you got some magic for catching everything without explicitly having to find all objects that could be emitting? I'd love to hear it if so...
As soon as possible during startup, create a domain and enter it. Because entered domains form a stack, this will be a fallback if an error occurs at a place that isn't covered by any other domains.
Thanks for the comments.
>JXCore allows you to run x number of tasks in parallel where x is the number of CPUs?
You may have a 8 cores but configure JXcore to use 64 threads.
> The problem is solved by running multiple node processes, which is standard deployment for node
You can still run multiple node process but 2 threads per each. This will improve the responsiveness of each process by balancing the load exactly on the native side. That means, if anything happens on one of the V8 threads (GC etc). the other one will be handling the load.
> That said, how well is each thread isolated from everything else.
Totally isolated. We already started to update native c,c++ modules for isolated multithreading.
>Does an uncaught exception kill just the thread and its state, or does it kill the process?
On this very beta release, it throws into main thread (when it's uncaught by thread) but coming beta (internal monitoring is implemented) will be optionally resetting the sub thread itself.
You may simply consider each thread as a separate node.js host.
A small diagram on the home page showing x cores * y threads would be interesting / helpful. Such a diagram would also highlight that this adds to the multi-process approach (i.e. I don't have to give up my multi-process approach, but get to add multiple threads to it).
The problem is solved by running multiple node processes, which is standard deployment for node (i.e. if a machine has 8 cores then 8 node processes are started).
One issue with running multiple threads is that many developers use fail fast as a best practice when building node applications. In other words, uncaught exceptions cause the node process to fail, die and restart. So, it's perfectly acceptable practice to write your application to be written to accept failure as a given (similar to how Netflix uses Simian Army).
That said, how well is each thread isolated from everything else. Does an uncaught exception kill just the thread and its state, or does it kill the process? More specifically, when is the main process killed and what is contained within the thread?