Intrinsic compiled to 2 assembler instructions instead of 1
-
martes, 10 de julio de 2012 5:44
I'm using intrinsics for speed optimization.
Next intrinsic should be compiled to 1 assembler instruction, but it always generates two instructions. I tried all kinds of variations, but nothing works.
The problem seems to be the short pointer.
( with int pointers using _mm_insert_epi32, the intrinsic gets compiled into 1 instruction.)
Is this a compiler issue, or is there another way to compile to 1 instruction ?
Thanks,
Jan
unsigned short *ad1;
__m128i si0;
si0 = _mm_insert_epi16(si0, *(ad1), 2);
Gets compiled to:
movzx ecx, WORD PTR [edi]
pinsrw xmm0, ecx, 2Instead of
pinsrw xmm0, WORD PTR [edi], 2
Todas las respuestas
-
martes, 10 de julio de 2012 8:45
_mm_insert_epi16 expects an int, you supply a short. Seems the compiler makes up for that by putting your short into a 32 bit register first, and then performing the operation.
What happens if you supply your 16 bit value to _mm_insert_epi16 in an actual int? I suspect the movzx will disappear because there is no longer a mismatch.
- Editado Bruno van Dooren martes, 10 de julio de 2012 8:46
-
martes, 10 de julio de 2012 10:12
Try this workaround:
si0 = _mm_insert_epi16(si0, (__int32&)*ad1, 2);
Now it should generate one instruction and actually use only a half of (__int32&)*ad1.
For better investigation, report to https://connect.microsoft.com/VisualStudio/Feedback.- Editado Viorel_MVP martes, 10 de julio de 2012 19:15
- Propuesto como respuesta Helen ZhaoModerator miércoles, 11 de julio de 2012 3:21
- Votado como útil Helen ZhaoModerator martes, 17 de julio de 2012 2:46
-
jueves, 12 de julio de 2012 12:22
With
si0 = _mm_insert_epi16(si0, (__int32&)*ad1, 2);
This results in
mov ecx, DWORD PTR [esi]
pinsrw xmm0, ecx, 2
So still 2 instructions, and also incorrect as now a DWORD is read from a WORD address.
- Editado JanVliet jueves, 12 de julio de 2012 12:32
-
jueves, 12 de julio de 2012 12:26
I can't supply an int as the value is read from a short array.
To me it looks like a bug with _mm_insert_epi16, it should have been made to insert shorts, and not ints.
- Editado JanVliet jueves, 12 de julio de 2012 12:33
-
jueves, 12 de julio de 2012 12:33
I'm using intrinsics for speed optimization.
Next intrinsic should be compiled to 1 assembler instruction, but it always generates two instructions. I tried all kinds of variations, but nothing works.Jan,
Is the 1 instruction form perhaps slower (in some CPUs) than the 2
instruction form?If you know it's not, I suggest that you check it against the current
VS2012 beta and if the issue persists, submit a bug report on it at
http://connect.microsoft.com/VisualStudioDave
-
jueves, 12 de julio de 2012 12:59
Hi Dave,
I don't know how much slower the 2 instructions are versus 1 instruction.
At least it will require more instruction cache and more instruction decoding.
In microOps terms, they might end up the same, and on modern processors with a microOps cache, it might not make a difference, just speculating here.
We target a lot of different processors, so optimal instruction encoding is needed.
The code mix is too complicated, so we will not go as far as using assembler instead of intrinsics.
Also I'm reporting this for VS2010 as this is our current development platform.
-
jueves, 12 de julio de 2012 13:21
I don't know how much slower the 2 instructions are versus 1 instruction.
Unless you know for sure, you might be wasting someone's time in
reporting this.Intuitively one would expect that the 1 instruction form ought to be
faster - but you can never be sure on modern processors.Is there any difference if you set the compiler switch to optimize for
size rather than speed?Also I'm reporting this for VS2010 as this is our current development platform.
Microsoft are unlikely to make changes for things that are
non-critical in VS2010 now that VS2012 is almost finished.Dave
-
jueves, 12 de julio de 2012 13:58
I would be happy to be surprised by the single instruction version being significantly faster.
Anyway knowing that it would be faster with single instructions, would be interesting to know theoretically, but in practice this will be of little help with VS2010, as it now looks like.Optimizing for size has no effect on the issue.
Thanks for the input, we can close this one.
-
jueves, 12 de julio de 2012 16:11
>I would be happy to be surprised by the single instruction version being significantly faster.Anyway knowing that it would be faster with single instructions, would be interesting to know theoretically, but in practice this will be of little help with VS2010, as it now looks like.
FWIW, the code generated with VS2012 is almost identical, it just uses
a different register in my test:movzx eax, WORD PTR [eax]
pinsrw xmm0, eax, 2I've tried playing with different optimisation settings, but they've
made no difference.I'm afraid I have no idea which would be the most efficient though!
Dave
-
jueves, 12 de julio de 2012 18:23
With
si0 = _mm_insert_epi16(si0, (__int32&)*ad1, 2);
This results in
mov ecx, DWORD PTR [esi]
pinsrw xmm0, ecx, 2
So still 2 instructions, and also incorrect as now a DWORD is read from a WORD address.
Unfortunately the workaround seems to help only in case of 64-bit compilation.
-
viernes, 13 de julio de 2012 2:54
Thanks for reporting this issue. As someone pointed out, the intrinsic expects the second operand to be int type. So the compiler inserts a conversion before calling the intrinsic. One workaround you can do is:
si0 = _mm_insert_epi16(si0, *(int *)ad1, 2);
This will generate 1 instruction:
pinsrw xmm0, DWORD PTR [eax], 2
Though it uses "DWORD PTR", pinsrw will only insert the low word from the [eax] into the xmm0.
We will address this issue in the future.
Charles
-
viernes, 13 de julio de 2012 16:41
Both `(__int32&)*ad1` and `*(int *)ad1` violate the aliasing rules defined in the C++ standard and consequently invoke undefined behavior. As such, I don't think they're appropriate workarounds, even if they happen to generate valid machine code with the current VC++ compilers...

