data from the .asm files:
for version with intrinsics, the core loop does:
1 CMPSP,1DADDSP,2 FADDSP, 2 FSUBSP, 4 ADD & 9 MV
for version without intrinsics, the core loop does:
4 MPYSP, 3 FADDSP, 3 FSUBSP, 5 ADD & 7 MV
the version with intrinsics uses 19 ops while the no-intrinsics version uses 22 ops. But both take the same time.