While the nbody code is running asynchronously, and is not strongly tied in with the render loop, they do synchronize. The timing is such that the user can specify a desired speed scalar, such as 1000 times faster than realtime, and for every frame nbody is asked to advance the simulation 1s/fps * 1000s
The system is really designed to work for situations where the nbody step is slower than the render step, and not the other way around, so currently a new nbody step is only launched once every n frames. In situations where the nbody step takes multiple frames, as is commonly the case, this works well, but when the nbody step takes much shorter time, it is not optimal.
This is why it can run faster with vsync off, since frames are rendered more frequently and therefore nbody stepping is also launched more frequently.
Generally a very low tolerance will force the nbody code to take multiple smaller substeps internally, which will make it more accurate, and also slower, thus utilizing the computation time better. The question is then if that is what you want, since there is such a thing as "accurate enough" and you may not always want "even more accurate" at the cost of 100% cpu utilization.
As to native vs managed, managed means the c# implementation of the nbody code while native is the c++ implementation (not user friendly names, I know), which still has managed collision-resolution parts, though. Native is generally some 3-5 times faster, and should be the default mode. The nbody code is currently being re-re-rewritten to be even more pure c++, with still better performance, while dropping managed mode entirely.
Currently we are not using the gpu for computations. We did a long time ago use OpenCL for the core of nbody, but with support for multiple platforms, and now even mobile, the pure cpu implementation won out. This is likely to change eventually, but we will not provide gpu computation any time soon.
I hope some of the above made sense. If not, don't hesitate to ask again :-)