IMHO the base of optimization is to find what worths the cost to be optimized. Don't think you know it without measures. Intuitions are often wrong. In the groovebox2 example above with a lot of GUI it appears that it is negligeable in front of the DSP. You can benchmark objects with [cputime] [realtime] reports to evaluate the cost. But it doesn't matter to have a slow module if it is not used a lot. On the other hand, small gain on an object that is highly polled can save much more CPU.
What makes things more complex In Pure Data is that the DSP and the control commands are interleaved and computed in the same thread. That means that each time a process is too long to achieved (e.g. network, I/O file, anything that can wait) the DSP is rendered too late and you have a click. It means that not only the average of CPU cost for each object and/or algorithm matters but also the instantaneous cost. That ruins all what i said in the first paragraph above! That's why tracking each individual small cost can save Xruns, and why optimizing is a top rated topic.
Is profiling something usable by a non-programmer? TBH i think it is not. To understand the measures (and what's going one i.e. the main loop, scheduling, clocks...) you need to look at under the hood. Even for a programmer, it is not easy. You can have wrong results sometimes when the tests are not really representatives (and thus bad assumptions). It requires a bit of experience (trial and error).
Sometimes i forget that not everybody really wants to spend all that time for nothing!