I have not fully debugged the precision issue.
I certainly do have the basic routines working, but the precision issue is another matter. However, I am slowly working through some of the preliminaries. What I definitely won't be able to handle myself is reasons for precision performance of specific sets of chebyshev co-efficients. I am vastly too crappy at math for something as esoteric as that.
Right now I'm working to make sure I have everything completely uniform, and I'm not dealing with range issues. Every set of chebyshev co-efficients is intentionally designed to cover a certain range, typically pi/4 (45-degrees) or pi/2 (90-degrees) or pi (180-degrees). The more you restrict your range, the fewer coefficients you need to get a specific level of precision. At least that's the theory! And indeed, if you push input angles beyond pi/4 into a routine with pi/4 coefficients, the precision goes down the tubes very, very fast.
It appears all the 7-coefficient sets are for pi/4. The usual scam is to check for angles greater than pi/4, and when you find them, you compute the cosine of (pi/2 - angle) instead. That's no problem when you're computing the sine or cosine of one value, but when you're computing 4 in parallel, well.... the question becomes "how do I compute sine of some of my 4 input angles, and cosines of others (an arbitrary mix).
That's where I'll be headed soon enough, even if I don't adopt any pi/4 co-efficient sets, because I want to at least provide functions to compute:
sin_sin_sin_sin()
cos_cos_cos_cos()
sin_sin_cos_cos()
cos_cos_sin_sin()
sin_cos_sin_cos()
cos_sin_cos_sin()
However, one way to handle input angles beyond the design limit for the co-efficients is to have the ability to flip the computation of any of the input angles from sine to cosine or cosine to sine. Given those screwball techniques I adopted in the earlier range reduction portions, this should not be too difficult. I'll just have to load both sine and cosine versions of each coefficient, then select one or the other based upon a mask that is all 0s for sine and all 1s for cosine.
I must say this. Those vcmppd and vpcmov instructions are absolutely brilliant. Without them, I'd be totally screwed. With them, I can do just about anything I need to customize what happens to each of the 4 elements in each ymm register.