Answered by:
Integrating SSE2 with .NET via C++/CLI and SSE2 Intrinsics

Hi there,
I'm embarking on the adventurous course of trying to accelerate a simulation application written entirely in C#.NET using SSE2. So far I've spent a few days looking at the feasibility of using SSE2 in a .NET application.From what I've read it looks pretty much impossible, but I would like to post here and see what you guys think.I am creating managed double[] arrays in C#.NET, passing to C++/CLI (pinning the pointers in the process) and from there to unmanaged code that performs SIMD instructions via SSE2 intrinsics. All very well and good, however the pinned pointers to arrays are not 16byte aligned, so I get an access violation exception when running my code in x86 mode.Curiously, running in x64 there is no problem.The speed results are not fantastic, but for the x64 version get a 2x speed improvement over .NET for doubleprecision data computed by SSE2 intrinsics. This is enough to make this venture worth it...
I don't suppose there is any way to declare a managed array as byte aligned, or to marshal the array to be byte aligned (without a full copy please!), or any other solution to the above is there?
Alternatively, is microsoft planning some vector primitives for use directly in .NET? (Rather like the Mono SIMD primitives which are so widely acclaimed)
Thanks!
Question
Answers

Wait a sec, I think I've found a viable workaround.I tried doing what I suggested above, buffering the input via a structure of aligned doubles, but it was slow (actually slower than .NET to perform the same computation. Yikes).
So instead I searched through the MSDN documentation on intrinsics and found two functions that let you load/get unaligned data to __m128d registers.The following now works on x86/x64 and gives the right answers. I take a performance hit, but its still way faster than .NET for the same execution.(Timings are around 160ms for .NET, 80ms for SIMD if the data is byte aligned, 93ms if I use the unaligned workaround below).The SSE2 Intrinsic function _mm_loadu_pd() loads a pointer to unaligned double array into a __m128d register, similarly _mm_storeu_pd stores an __m128d register to a pointer to an unaligned double array.I must say overall I'm pretty impressed by how the JIT optimises code, it certainly gives you a run for your money. I also realise this SIMD implementation below is a bit basic and other implementations could be optimised for latencies but it seems to work, which is great as this is just a proof of concept.Any comments / Suggestions? Thanks for your help guys.Andrew// Performs the square root of the array using SSE2 intrinsics // NOTE: input and output are pinned pointers to managed double arrays. void SSE2IntrinsicsDoSquareRoot(double * input, double * output, int arraySize, double & maxValue, double & minValue) { __m128d coeff = _mm_set1_pd(2.8); // coeff[0, 1] = 2.8 __m128d tmp; __m128d min128 = _mm_set1_pd(FLT_MAX); // min128[0, 1] = FLT_MAX __m128d max128 = _mm_set1_pd(FLT_MIN); // max128[0, 1] = FLT_MIN __m128d source; __m128d dest; // Loop over N elements of input[] and output[] // in blocks of 2 (128bits) for (register int i = 0; i < arraySize; i += 2) { // Load the unaligned data into __m128 register source // source[0, 1] = input[0], input[1] // This incurs a perf hit of approx ~15% but it is still way faster than managed code source = _mm_loadu_pd (input); // Perform the computation source = _mm_mul_pd(source, coeff); // tmp = *pSource * coeff tmp = _mm_sqrt_pd(source); // tmp = sqrt(tmp) min128 = _mm_min_pd(tmp, min128); // min128 = min(min(tmp[0], tmp[1], min(min128[0], min128[1])) max128 = _mm_max_pd(tmp, max128); // min128 = min(min(tmp[0], tmp[1], min(min128[0], min128[1])) // Retrieve the output data from the register and // store in unaligned pointer _mm_storeu_pd(output, tmp); output += 2; input += 2; } // Declare a union to byte aligned double array (size 2) and an SIMD register union u { __m128d m; __declspec(align(16)) double f[2]; }x; // extract minimum and maximum values from min128 and max128 x.m = min128; minValue = min(x.f[0], x.f[1]); x.m = max128; maxValue = max(x.f[0], x.f[1]); }
 Proposed as answer by Reed Copsey, JrMVP Thursday, January 28, 2010 4:38 PM
 Marked as answer by Wesley Yao Tuesday, February 02, 2010 3:36 AM
All replies

Unfortunately, I haven't seen a way around this.
The best option I've found is to use a native array instead of a managed array. You can make a managed wrapper for your native array type in C++/CLI, so C# can create and access the elements, but then you do your processing in completely native code on your byte aligned array.
This seems to work reasonably well, but it does add some overhead in the wrapping. However, I've had pretty good luck with this in general.
As for Microsoft's plans with a SIMD alternative  there hasn't been anything announced (at least not that I've seen).
Reed Copsey, Jr.  http://reedcopsey.com 
I don't think it is so curious that it works fine on a system that has a 64bit bus width, because doubles are 64bit types. There is always this alignment/packing issue when the bus width doubles.
When you pass an odd number of doubles...is that when this issue crops up? (Obviously, an even number of doubles will always be 16byte aligned, and an odd number will not be. Is is possible to pad the number to be even?)
In VB, at least...you can use structures for this type of thing...you have control over the packing and alignment with structures. I do not see an equivalent with arrays, but...they have always been problematic, because they really are not the same thing as structures, and highlevel languages want to pass array size along with the array.
I would try to create a structure that passed a size, taking the example from the various structures that Windows uses. For example, DEVMODE is a structure that has common elements as well as vendor defined elements, so...its size varies. Edited by jinzai Wednesday, January 27, 2010 6:20 PM sticky 'i' key

Hi everybody, thanks for your replies,Well I thought it was curious! I'm new to SIMD programming.
Anyway I was thinking last night and yes I thought perhaps I will have to do as Reed said, to create the data in C++ as an aligned array. After all, most of the processing is done here. However how could I wrap this as efficiently as possible (without mem copies)?I suppose if I created a generic to wrap an array of type T and put an indexed property on it (and get enumerator) would that do?A bit of a hack, but a workaround nonetheless.For Jinzai, although I am primarily targeting x64, I would rather I can make an x86 implementation too. For both the number of doubles is even (50,000, which is both even and divisible by 16)
Another idea I had was to pass the data in two stages. Here, I'll post some code below so you can see my test application (All it does is square root an array, returning the result and max/min during the operation. Pointless but proves the concept).I've commented in the SIMD implementation where I think I could load data from my managed array into a 16byte aligned struct on the stack (in a register?) and then into SIMD registers.Any comments/suggestions on how I could hack this to work on x86/64 would be appreciated (pref without creating arrays in C++, but I will if I have to). Go easy, I'm a bit green with SIMD!Thanks!!AndrewHeres the C# Implementation:// Outer loop is just used for microbenchmark purposes for (int loops = 0; loops < numLoops; loops++) { // Loop over the input double[] array for (int i = 0; i < ARRAY_SIZE; i++) { // Square root * 2.8 and store in the result resultArray[i] = Math.Sqrt(initialArray[i] * 2.8); // Keep track of min/max min = Math.Min(resultArray[i], min); max = Math.Max(resultArray[i], max); } }
And the equivalent C# to C++/CLI to SIMD version:// // C# Implementation  calls SSE Code via C++/CLI // // Previously declared ... // double [] initialArray = new double[ARRAY_SIZE]; // double [] resultArray = new double[ARRAY_SIZE]; // Create an instance of my C++/CLI wrapper to the SSE code SSECppCliTest.SSESquareRootWrapper rooter = new SSECppCliTest.SSESquareRootWrapper(); // Outer loop for micro benchmark purposes only for (int loops = 0; loops < numLoops; loops++) { // Perform the square root (passing in arrays and out max/min) rooter.DoSquareRoot(initialArray, resultArray, ref max, ref min); } // // C++/CLI and SIMD code // // Here's where my SSESquareRootWrapper.DoSquareRoot call goes void SSESquareRootWrapper::DoSquareRoot( array<double> ^ inputArray, array<double> ^ outputArray, double % maxVal, double % minVal) { // Pin the input/output pin_ptr<double> ptrInput = &inputArray[0]; pin_ptr<double> ptrOutput = &outputArray[0]; double minValue, maxValue; // Call the native function to do the SSE implementation SSE2IntrinsicsDoSquareRoot((double*)ptrInput, (double*)ptrOutput, inputArray>Length, maxValue, minValue); maxVal = maxValue; minVal = minValue; } #pragma unmanaged // And here's my unmanaged SSE implementation of the // square root void SSE2IntrinsicsDoSquareRoot(double * input, double * output, int arraySize, double & maxValue, double & minValue) { __m128d coeff = _mm_set1_pd(2.8); // coeff[0, 1] = 2.8 __m128d tmp; __m128d min128 = _mm_set1_pd(FLT_MAX); // min128[0, 1] = FLT_MAX __m128d max128 = _mm_set1_pd(FLT_MIN); // max128[0, 1] = FLT_MIN // Here we are accessing our pinned pointer to // managed double array and loading in an SIMD register // this seens to be fine (so far) __m128d* pSource = (__m128d*) input; // pSource = (input[0], input[1]) __m128d* pDest = (__m128d*) output; // pDest = (output[0], output[1]) // Loop over N elements of input[] and output[] // in blocks of 2 (128bits) for (register int i = 0; i < arraySize; i += 2) { // // NOTE: // At this point here I get an access violation exception // when calling in 32 bit mode // // Another interesting feature is if I run in 32bit mode // via the debugger, it works, but without debugging, no // // Perhaps here I could do a little hack and load my // pinned pointer to managed double into a 16byte // aligned structure before passing to SIMD function? // tmp = _mm_mul_pd(*pSource, coeff); // tmp = *pSource * coeff *pDest = _mm_sqrt_pd(tmp); // *pDest = sqrt(tmp) min128 = _mm_min_pd(*pDest, min128); // min128 = min(min(pDest[0], pDest[1], min(min128[0], min128[1])) max128 = _mm_max_pd(*pDest, max128); // min128 = min(min(pDest[0], pDest[1], min(min128[0], min128[1])) pSource++; // pSource = (input[i], input[i+1]) pDest++; // pDest = (output[i], output[i+1]) } // extract minimum and maximum values from min128 and max128 union u { __m128d m; double f[2]; } x; x.m = min128; minValue = min(x.f[0], x.f[1]); x.m = max128; maxValue = max(x.f[0], x.f[1]); } #pragma managed

Wait a sec, I think I've found a viable workaround.I tried doing what I suggested above, buffering the input via a structure of aligned doubles, but it was slow (actually slower than .NET to perform the same computation. Yikes).
So instead I searched through the MSDN documentation on intrinsics and found two functions that let you load/get unaligned data to __m128d registers.The following now works on x86/x64 and gives the right answers. I take a performance hit, but its still way faster than .NET for the same execution.(Timings are around 160ms for .NET, 80ms for SIMD if the data is byte aligned, 93ms if I use the unaligned workaround below).The SSE2 Intrinsic function _mm_loadu_pd() loads a pointer to unaligned double array into a __m128d register, similarly _mm_storeu_pd stores an __m128d register to a pointer to an unaligned double array.I must say overall I'm pretty impressed by how the JIT optimises code, it certainly gives you a run for your money. I also realise this SIMD implementation below is a bit basic and other implementations could be optimised for latencies but it seems to work, which is great as this is just a proof of concept.Any comments / Suggestions? Thanks for your help guys.Andrew// Performs the square root of the array using SSE2 intrinsics // NOTE: input and output are pinned pointers to managed double arrays. void SSE2IntrinsicsDoSquareRoot(double * input, double * output, int arraySize, double & maxValue, double & minValue) { __m128d coeff = _mm_set1_pd(2.8); // coeff[0, 1] = 2.8 __m128d tmp; __m128d min128 = _mm_set1_pd(FLT_MAX); // min128[0, 1] = FLT_MAX __m128d max128 = _mm_set1_pd(FLT_MIN); // max128[0, 1] = FLT_MIN __m128d source; __m128d dest; // Loop over N elements of input[] and output[] // in blocks of 2 (128bits) for (register int i = 0; i < arraySize; i += 2) { // Load the unaligned data into __m128 register source // source[0, 1] = input[0], input[1] // This incurs a perf hit of approx ~15% but it is still way faster than managed code source = _mm_loadu_pd (input); // Perform the computation source = _mm_mul_pd(source, coeff); // tmp = *pSource * coeff tmp = _mm_sqrt_pd(source); // tmp = sqrt(tmp) min128 = _mm_min_pd(tmp, min128); // min128 = min(min(tmp[0], tmp[1], min(min128[0], min128[1])) max128 = _mm_max_pd(tmp, max128); // min128 = min(min(tmp[0], tmp[1], min(min128[0], min128[1])) // Retrieve the output data from the register and // store in unaligned pointer _mm_storeu_pd(output, tmp); output += 2; input += 2; } // Declare a union to byte aligned double array (size 2) and an SIMD register union u { __m128d m; __declspec(align(16)) double f[2]; }x; // extract minimum and maximum values from min128 and max128 x.m = min128; minValue = min(x.f[0], x.f[1]); x.m = max128; maxValue = max(x.f[0], x.f[1]); }
 Proposed as answer by Reed Copsey, JrMVP Thursday, January 28, 2010 4:38 PM
 Marked as answer by Wesley Yao Tuesday, February 02, 2010 3:36 AM


Whats hilarious is I was just googling on this subject and I found  my own post! So err, not a lot out there in internet land on this topic.
I think theoretically it would be possible to create a C++/CLI class (or structure) that internally used SSE, for instance, Vector3, Vector4, Matrix4x4 etc... These are the sort of constructs I need. It would work better with structs as you can specifically set the alignment on a structures I believe.
The only problem is that .NET works across CPU architectures whereas SSE is architecture specific  it goes against the grain of .NET and what .NET is about. For the sort of work Im doing (mainly research) its not really a problem. But for a massreleased piece of code it would be.
You could have multiple implementations of internals of a Vector4 say and select the right one at runtime depending on SSE availability, but that would probably introduce a performance hit that would make your optimisations irrelevant. Alternatively, there must be some way of JIT compiling "The right Vector3". If only I knew how! ...
For instance, there are ways in .NEt to dynamically compile an assembly at runtime. What if you had in your SIMD library (C++/CLI) some way of checking the SSE availability and then dynamically compiling the right code to expose Vector3/Vector4 with SSE etc?
Just thinking out loud. If anyone knows  please post! :D 
Right I did a bit more investigation into this.
I created a Vector3 and Matrix3x3 class in C++/CLI, each of which had a pointer to array for the values (double *), allocated using _mm_alloc to align the data.I implemented the * operator on the Matrix class to multiply a matrix by a vector, returning a vector. Inside that I called a C function (in a separate file) that did the operation using SSE2. The SSE code was as follows:
double * Matrix3x3_Vector3_Mul64_SSE2(double * mat3x3d, double * vec3d) { /***************************************** * Matrix format, mat3x3d = M11 M12 M13 (elements 0,1,2) * M21 M22 M23 (elements 3,4,5) * M31 M32 M33 (elements 6,7,8) * * Vector3 Format, vec3d = X element 0 * Y .. * Z element 2 *******************************************/ // 16byte aligned version of "new double[4]". Last element is redundant double * pResult = (double*)_mm_malloc(4 * sizeof(double), 16); __m128d xy = _mm_load_pd(vec3d); // Contains vec3 X, Y __m128d zz = _mm_load1_pd(vec3d + 2); // Contains vec3 Z, Z __m128d m11m12 = _mm_load_pd(mat3x3d); // Contains mat3x3 M11, M12 __m128d m13m21 = _mm_load_pd(mat3x3d + 2); // Contains mat3x3 M13, M21 __m128d m22m23 = _mm_load_pd(mat3x3d + 4); // Contains mat3x3 M22, M23 __m128d m31m32 = _mm_load_pd(mat3x3d + 6); // Contains mat3x3 M31, M32 SimdDouble temp; // temp = [X, X] * [m11, m21] = [X.m11, X.m12] // unpacklo takes X from xy and stores in one __m128d register as [X, X] // shuffle takes argument 0x02 = binary 0000 0010, means get m11m12.0 and m13m21.1 temp.m = _mm_mul_pd(_mm_unpacklo_pd(xy, xy), _mm_shuffle_pd(m11m12, m13m21, 0x02)); // temp = [X.m11, X.m12] + [Y, Y] * [m12, m22] // = [X.m11, X.m12] + [Y.m12, Y.m22] // unpackhi takes Y from xy and stores in one __m128d register as [Y, Y] // shuffle takes argument 0x01 = binary 0000 0001, means get m11m12.1 and m22m23.0 temp.m = _mm_add_pd(temp.m, _mm_mul_pd ( _mm_unpackhi_pd ( xy , xy ), _mm_shuffle_pd ( m11m12 , m22m23 , 0x01 ))); // temp = [X.m11, X.m12] + [Y.m12, Y.m22] + [Z, Z] * [m13, m23] // = [X.m11, X.m12] + [Y.m12, Y.m22] + [Z.m13, Z.m23] // zz already stored as [Z, Z] // shuffle takes argument 0x02 = binary 0000 0010, means get m13m21.0 and m22m23.1 temp.m = _mm_add_pd(temp.m, _mm_mul_pd ( zz, _mm_shuffle_pd (m13m21 , m22m23 , 0x02 ))); // Temp now contains the first two rows of the matrix // so store temp in Result[0, 1] // result[0] = X.m11 + Y.m12 + Z.m13 // result[1] = X.m21 + Y.m22 + Z.m23 pResult[0] = temp.f[0]; pResult[1] = temp.f[1]; // Compute the last row of the matrix ... // temp = [X.m31, Y.m32] temp.m = _mm_mul_pd(m31m32, xy); // No point using SIMD to compute M33 as there is no optimisation to make. // Compute directly: // result[2] = X.m31 + Y.m32 + (X.m33) pResult[2] = temp.f[0] + temp.f[1] + (vec3d[Z] * mat3x3d[M33]); &pResult[2] = _mm_add_sd ( _mm_add_sd ( t4 , t6 ), t5 );*/ return pResult; }
I then created some Vector3 & Matrix3x3 classes in pure C#, implemented the same operation (Vec = Mat * Vec). And a speed test to compare the two.The result  the C++/CLI SIMD performance is appalling compared to vanilla C#. Here are the results of 10,000,000 matrix/vector multiplies. Yes that is 20 times slower for the SIMD implementation vs vanilla C#..
C# Time: 00:00:02.2298309 SIMD Time: 00:00:51.1679630 Speedup: 0.0435786529160834 256663389.878927 (checksum, = Sum{X+Y+Z}i) 256663389.878927 (checksum, = Sum{X+Y+Z}i)
Although Im not 100% sure of the reason of the terrible performance, I can hazard a guess that its not due to the SIMD implementation perse, but possibly the crossing of boundaries from managed to native code so often.
This sort of method, trying to create some primitives in C++/CLI that can be called from C#, internally using SIMD is not going to work. I think the soonest something like this can be done is when (if?) Microsoft implements some SIMD primitives in the JIT, rather like how Mono.SIMD works, where various primitives insert SIMD code directly into the output by the JIT compiler.
In conclusion, I think the above method of creating primitives is useless, but if you wanted to say operate on an entire 1024x1024 image doing some operation, then SIMD will still help. The time is being spent (I think) crossing the managed/native boundary very regularly. This needs to be minimised.
That's why the first example I posted provides a decent speed improvement, while the second does not. in the first example a large array is passed to C++ where it is then operated on in entirety with plenty of maths operations to keep the CPU busy (and outweigh any marshalling issues).
Hope this thread has been informative. A little dissapointing, but at least I know a little more about when I can and cannot use SIMD. 
I have used Reed Copsey jr's suggestion, works ok as long as you use the blitting functions to read the arrays. Here's an example, a memory aligned double array
using System; using System.Collections.Generic; using System.Runtime.InteropServices; using System.Text; namespace AlignedArray { public class AlignedDoubleArray : IDisposable { private byte[] buffer; private GCHandle bufferHandle; private IntPtr bufferPointer; private readonly int length; public AlignedDoubleArray(int length, int byteAlignment) { this.length = length; buffer = new byte[length * sizeof(double) + byteAlignment]; bufferHandle = GCHandle.Alloc(buffer, GCHandleType.Pinned); long ptr = bufferHandle.AddrOfPinnedObject().ToInt64(); // round up ptr to nearest 'byteAlignment' boundary ptr = (ptr + byteAlignment  1) & ~(byteAlignment  1); bufferPointer = new IntPtr(ptr); } ~AlignedDoubleArray() { Dispose(false); } protected void Dispose(bool disposing) { if(bufferHandle.IsAllocated) { bufferHandle.Free(); buffer = null; } } #region IDisposable Members public void Dispose() { Dispose(true); } #endregion public double this[int index] { get { unsafe { return GetPointer()[index]; } } set { unsafe { GetPointer()[index] = value; } } } public int Length { get { return length; } } public void Write(int index,double[] src,int srcIndex,int count) { if(index<0  index>=length) throw new IndexOutOfRangeException(); if ((index + count) > length) count = Math.Max(0, length  index); System.Runtime.InteropServices.Marshal.Copy( src, srcIndex, new IntPtr(bufferPointer.ToInt64() + index*sizeof (double)), count); } public void Read(int index, double[] dest, int dstIndex, int count) { if (index < 0  index >= length) throw new IndexOutOfRangeException(); if ((index + count) > length) count = Math.Max(0, length  index); System.Runtime.InteropServices.Marshal.Copy( new IntPtr(bufferPointer.ToInt64() + index*sizeof (double)), dest, dstIndex, count); } public double[] GetManagedArray() { return GetManagedArray(0, length); } public double[] GetManagedArray(int index,int count) { double[] result = new double[count]; Read(index, result, 0, count); return result; } public override string ToString() { StringBuilder sb = new StringBuilder(); sb.Append('['); for(int t=0;t<length;t++) { sb.Append(this[t].ToString()); if (t < (length  1)) sb.Append(','); } sb.Append(']'); return sb.ToString(); } public unsafe double* GetPointer(int index) { return GetPointer() + index; } public unsafe double* GetPointer() { return ((double*) bufferPointer.ToPointer()); } } }