none
Integrating SSE2 with .NET via C++/CLI and SSE2 Intrinsics

    Question

  • Hi there, 

    I'm embarking on the adventurous course of trying to accelerate a simulation application written entirely in C#.NET using SSE2. So far I've spent a few days looking at the feasibility of using SSE2 in a .NET application. 

    From what I've read it looks pretty much impossible, but I would like to post here and see what you guys think. 

    I am creating managed double[] arrays in C#.NET, passing to C++/CLI (pinning the pointers in the process) and from there to unmanaged code that performs SIMD instructions via SSE2 intrinsics. All very well and good, however the pinned pointers to arrays are not 16-byte aligned, so I get an access violation exception when running my code in x86 mode. 

    Curiously, running in x64 there is no problem. 

    The speed results are not fantastic, but for the x64 version get a 2x speed improvement over .NET for double-precision data computed by SSE2 intrinsics. This is enough to make this venture worth it... 

    I don't suppose there is any way to declare a managed array as byte aligned, or to marshal the array to be byte aligned (without a full copy please!), or any other solution to the above is there?

    Alternatively, is microsoft planning some vector primitives for use directly in .NET? (Rather like the Mono SIMD primitives which are so widely acclaimed)

    Thanks!
    Wednesday, January 27, 2010 5:59 PM

Answers

  • Wait a sec, I think I've found a viable workaround. 

    I tried doing what I suggested above, buffering the input via a structure of aligned doubles, but it was slow (actually slower than .NET to perform the same computation. Yikes). 

    So instead I searched through the MSDN documentation on intrinsics and found two functions that let you load/get unaligned data to __m128d registers. 

    The following now works on x86/x64 and gives the right answers. I take a performance hit, but its still way faster than .NET for the same execution. 

    (Timings are around 160ms for .NET, 80ms for SIMD if the data is byte aligned, 93ms if I use the unaligned workaround below). 

    The SSE2 Intrinsic function _mm_loadu_pd() loads a pointer to unaligned double array into a __m128d register, similarly _mm_storeu_pd stores an __m128d register to a pointer to an unaligned double array. 

    I must say overall I'm pretty impressed by how the JIT optimises code, it certainly gives you a run for your money. I also realise this SIMD implementation below is a bit basic and other implementations could be optimised for latencies but it seems to work, which is great as this is just a proof of concept. 

    Any comments / Suggestions? Thanks for your help guys. 

    Andrew


    --------------------------------------------------------------

    // Performs the square root of the array using SSE2 intrinsics
    // NOTE: input and output are pinned pointers to managed double arrays. 
    void SSE2IntrinsicsDoSquareRoot(double * input, double * output, int arraySize, double & maxValue, double & minValue)
    {        
    	__m128d coeff = _mm_set1_pd(2.8);      // coeff[0, 1] = 2.8
    	__m128d tmp;
    
    	__m128d min128 = _mm_set1_pd(FLT_MAX);  // min128[0, 1] = FLT_MAX
    	__m128d max128 = _mm_set1_pd(FLT_MIN);  // max128[0, 1] = FLT_MIN
    		
    	__m128d source;
    	__m128d dest;
    
    	// Loop over N elements of input[] and output[]
    	// in blocks of 2 (128bits)
    	for (register int i = 0; i < arraySize; i += 2)
    	{
    		// Load the unaligned data into __m128 register source
    		// source[0, 1] = input[0], input[1]
                    // This incurs a perf hit of approx ~15% but it is still way faster than managed code
    		source = _mm_loadu_pd (input);
    
    		// Perform the computation
    		source = _mm_mul_pd(source, coeff);      // tmp = *pSource * coeff
    		tmp = _mm_sqrt_pd(source);               // tmp = sqrt(tmp)
    
    		min128 =  _mm_min_pd(tmp, min128);   // min128 = min(min(tmp[0], tmp[1], min(min128[0], min128[1]))
    		max128 =  _mm_max_pd(tmp, max128);   // min128 = min(min(tmp[0], tmp[1], min(min128[0], min128[1]))		
    		
    		// Retrieve the output data from the register and 
    		// store in unaligned pointer
    		_mm_storeu_pd(output, tmp);
    		
    		output += 2;
    		input += 2;
    	}	
    
    	// Declare a union to byte aligned double array (size 2) and an SIMD register
    	union u
    	{
    		__m128d m;
    		__declspec(align(16)) double f[2];
    	}x;
    
    	// extract minimum and maximum values from min128 and max128
    	x.m = min128;
    	minValue = min(x.f[0], x.f[1]);
    
    	x.m = max128;
    	maxValue = max(x.f[0], x.f[1]);
    }

    • Proposed as answer by Reed Copsey, JrMVP Thursday, January 28, 2010 4:38 PM
    • Marked as answer by Wesley Yao Tuesday, February 02, 2010 3:36 AM
    Thursday, January 28, 2010 11:30 AM

All replies

  • Unfortunately, I haven't seen a way around this.

    The best option I've found is to use a native array instead of a managed array.  You can make a managed wrapper for your native array type in C++/CLI, so C# can create and access the elements, but then you do your processing in completely native code on your byte aligned array.

    This seems to work reasonably well, but it does add some overhead in the wrapping.  However, I've had pretty good luck with this in general.

    As for Microsoft's plans with a SIMD alternative - there hasn't been anything announced (at least not that I've seen).

    Reed Copsey, Jr. - http://reedcopsey.com
    Wednesday, January 27, 2010 6:06 PM
  • I don't think it is so curious that it works fine on a system that has a 64-bit bus width, because doubles are 64-bit types. There is always this alignment/packing issue when the bus width doubles.

    When you pass an odd number of doubles...is that when this issue crops up? (Obviously, an even number of doubles will always be 16-byte aligned, and an odd number will not be. Is is possible to pad the number to be even?)

    In VB, at least...you can use structures for this type of thing...you have control over the packing and alignment with structures. I do not see an equivalent with arrays, but...they have always been problematic, because they really are not the same thing as structures, and high-level languages want to pass array size along with the array.

    I would try to create a structure that passed a size, taking the example from the various structures that Windows uses. For example, DEVMODE is a structure that has common elements as well as vendor defined elements, so...its size varies.
    • Edited by jinzai Wednesday, January 27, 2010 6:20 PM sticky 'i' key
    Wednesday, January 27, 2010 6:18 PM
  • Hi everybody, thanks for your replies, 

    Well I thought it was curious! I'm new to SIMD programming. 

    Anyway I was thinking last night and yes I thought perhaps I will have to do as Reed said, to create the data in C++ as an aligned array. After all, most of the processing is done here. However how could I wrap this as efficiently as possible (without mem copies)? 

    I suppose if I created a generic to wrap an array of type T and put an indexed property on it (and get enumerator) would that do? 

    A bit of a hack, but a workaround nonetheless. 

    For Jinzai, although I am primarily targeting x64, I would rather I can make an x86 implementation too. For both the number of doubles is even (50,000, which is both even and divisible by 16)

    Another idea I had was to pass the data in two stages. Here, I'll post some code below so you can see my test application (All it does is square root an array, returning the result and max/min during the operation. Pointless but proves the concept). 

    I've commented in the SIMD implementation where I think I could load data from my managed array into a 16-byte aligned struct on the stack (in a register?) and then into SIMD registers. 

    Any comments/suggestions on how I could hack this to work on x86/64 would be appreciated (pref without creating arrays in C++, but I will if I have to). Go easy, I'm a bit green with SIMD!

    Thanks!!

    Andrew
    -----------------------------------------------------

    Heres the C# Implementation:

    // Outer loop is just used for micro-benchmark purposes
    for (int loops = 0; loops < numLoops; loops++)
    {
        // Loop over the input double[] array
        for (int i = 0; i < ARRAY_SIZE; i++)
        {
            // Square root * 2.8 and store in the result
            resultArray[i] = Math.Sqrt(initialArray[i] * 2.8);
          
            // Keep track of min/max
            min = Math.Min(resultArray[i], min);
            max = Math.Max(resultArray[i], max);
        }
    }
       

    And the equivalent C# to C++/CLI to SIMD version:
    //-------------------------------------------------------------
    // C# Implementation - calls SSE Code via C++/CLI
    //-------------------------------------------------------------
    
    // Previously declared ... 
    // double [] initialArray = new double[ARRAY_SIZE];
    // double [] resultArray = new double[ARRAY_SIZE];
    
    // Create an instance of my C++/CLI wrapper to the SSE code
    SSECppCliTest.SSESquareRootWrapper rooter = new SSECppCliTest.SSESquareRootWrapper();
    
    // Outer loop for micro benchmark purposes only
    for (int loops = 0; loops < numLoops; loops++)
    {
        // Perform the square root (passing in arrays and out max/min)
        rooter.DoSquareRoot(initialArray, resultArray, ref max, ref min);
    }
    
    
    //-------------------------------------------------------------
    // C++/CLI and SIMD code
    //-------------------------------------------------------------
    
    // Here's where my SSESquareRootWrapper.DoSquareRoot call goes
    void SSESquareRootWrapper::DoSquareRoot(
    	array<double> ^ inputArray, 
    	array<double> ^ outputArray, 
    	double % maxVal, 
    	double % minVal)
    {		
            // Pin the input/output
    	pin_ptr<double> ptrInput = &inputArray[0];
    	pin_ptr<double> ptrOutput = &outputArray[0];
    
    	double minValue, maxValue;
    
            // Call the native function to do the SSE implementation
    	SSE2IntrinsicsDoSquareRoot((double*)ptrInput, (double*)ptrOutput, inputArray->Length, maxValue, minValue);
    
    	maxVal = maxValue;
    	minVal = minValue;
    }
    
    
    #pragma unmanaged
    
    // And here's my unmanaged SSE implementation of the
    // square root 
    void SSE2IntrinsicsDoSquareRoot(double * input, double * output, int arraySize, double & maxValue, double & minValue)
    {
    	__m128d coeff = _mm_set1_pd(2.8);      // coeff[0, 1] = 2.8
    	__m128d tmp;
    
    	__m128d min128 = _mm_set1_pd(FLT_MAX);  // min128[0, 1] = FLT_MAX
    	__m128d max128 = _mm_set1_pd(FLT_MIN);  // max128[0, 1] = FLT_MIN
    			
            // Here we are accessing our pinned pointer to 
            // managed double array and loading in an SIMD register
            // this seens to be fine (so far)
    	__m128d* pSource = (__m128d*) input; // pSource = (input[0], input[1])
    	__m128d* pDest = (__m128d*) output;  // pDest = (output[0], output[1])
    
    	// Loop over N elements of input[] and output[]
    	// in blocks of 2 (128bits)
    	for (register int i = 0; i < arraySize; i += 2)
    	{
                    //
    		// NOTE: 
    		// At this point here I get an access violation exception
    		// when calling in 32 bit mode
                    // 
                    // Another interesting feature is if I run in 32bit mode
                    // via the debugger, it works, but without debugging, no
                    // 
                    // Perhaps here I could do a little hack and load my
                    // pinned pointer to managed double into a 16-byte
                    // aligned structure before passing to SIMD function?
    		// 
    		tmp = _mm_mul_pd(*pSource, coeff);      // tmp = *pSource * coeff
    		*pDest = _mm_sqrt_pd(tmp);              // *pDest = sqrt(tmp)
    
    		min128 =  _mm_min_pd(*pDest, min128);   // min128 = min(min(pDest[0], pDest[1], min(min128[0], min128[1]))
    		max128 =  _mm_max_pd(*pDest, max128);   // min128 = min(min(pDest[0], pDest[1], min(min128[0], min128[1]))		
    
    		pSource++;	// pSource = (input[i], input[i+1])
    		pDest++;	// pDest = (output[i], output[i+1])
    	}
    
    	// extract minimum and maximum values from min128 and max128
    	union u
    	{
    		__m128d m;
    		double f[2];
    	} x;
    
    	x.m = min128;
    	minValue = min(x.f[0], x.f[1]);
    
    	x.m = max128;
    	maxValue = max(x.f[0], x.f[1]);
    }
    #pragma managed

    Thursday, January 28, 2010 9:47 AM
  • Wait a sec, I think I've found a viable workaround. 

    I tried doing what I suggested above, buffering the input via a structure of aligned doubles, but it was slow (actually slower than .NET to perform the same computation. Yikes). 

    So instead I searched through the MSDN documentation on intrinsics and found two functions that let you load/get unaligned data to __m128d registers. 

    The following now works on x86/x64 and gives the right answers. I take a performance hit, but its still way faster than .NET for the same execution. 

    (Timings are around 160ms for .NET, 80ms for SIMD if the data is byte aligned, 93ms if I use the unaligned workaround below). 

    The SSE2 Intrinsic function _mm_loadu_pd() loads a pointer to unaligned double array into a __m128d register, similarly _mm_storeu_pd stores an __m128d register to a pointer to an unaligned double array. 

    I must say overall I'm pretty impressed by how the JIT optimises code, it certainly gives you a run for your money. I also realise this SIMD implementation below is a bit basic and other implementations could be optimised for latencies but it seems to work, which is great as this is just a proof of concept. 

    Any comments / Suggestions? Thanks for your help guys. 

    Andrew


    --------------------------------------------------------------

    // Performs the square root of the array using SSE2 intrinsics
    // NOTE: input and output are pinned pointers to managed double arrays. 
    void SSE2IntrinsicsDoSquareRoot(double * input, double * output, int arraySize, double & maxValue, double & minValue)
    {        
    	__m128d coeff = _mm_set1_pd(2.8);      // coeff[0, 1] = 2.8
    	__m128d tmp;
    
    	__m128d min128 = _mm_set1_pd(FLT_MAX);  // min128[0, 1] = FLT_MAX
    	__m128d max128 = _mm_set1_pd(FLT_MIN);  // max128[0, 1] = FLT_MIN
    		
    	__m128d source;
    	__m128d dest;
    
    	// Loop over N elements of input[] and output[]
    	// in blocks of 2 (128bits)
    	for (register int i = 0; i < arraySize; i += 2)
    	{
    		// Load the unaligned data into __m128 register source
    		// source[0, 1] = input[0], input[1]
                    // This incurs a perf hit of approx ~15% but it is still way faster than managed code
    		source = _mm_loadu_pd (input);
    
    		// Perform the computation
    		source = _mm_mul_pd(source, coeff);      // tmp = *pSource * coeff
    		tmp = _mm_sqrt_pd(source);               // tmp = sqrt(tmp)
    
    		min128 =  _mm_min_pd(tmp, min128);   // min128 = min(min(tmp[0], tmp[1], min(min128[0], min128[1]))
    		max128 =  _mm_max_pd(tmp, max128);   // min128 = min(min(tmp[0], tmp[1], min(min128[0], min128[1]))		
    		
    		// Retrieve the output data from the register and 
    		// store in unaligned pointer
    		_mm_storeu_pd(output, tmp);
    		
    		output += 2;
    		input += 2;
    	}	
    
    	// Declare a union to byte aligned double array (size 2) and an SIMD register
    	union u
    	{
    		__m128d m;
    		__declspec(align(16)) double f[2];
    	}x;
    
    	// extract minimum and maximum values from min128 and max128
    	x.m = min128;
    	minValue = min(x.f[0], x.f[1]);
    
    	x.m = max128;
    	maxValue = max(x.f[0], x.f[1]);
    }

    • Proposed as answer by Reed Copsey, JrMVP Thursday, January 28, 2010 4:38 PM
    • Marked as answer by Wesley Yao Tuesday, February 02, 2010 3:36 AM
    Thursday, January 28, 2010 11:30 AM
  • Hi Andy,

    I'm also interested by this area...

    Do you have find a better solution ? Do you think that it will be possible to create a kind of 'float4' structure with operators by using SSE ?

    Regards

    ViewOn01
    Wednesday, March 03, 2010 8:47 AM
  • Whats hilarious is I was just googling on this subject and I found -- my own post! So err, not a lot out there in internet land on this topic. 

    I think theoretically it would be possible to create a C++/CLI class (or structure) that internally used SSE, for instance, Vector3, Vector4, Matrix4x4 etc... These are the sort of constructs I need. It would work better with structs as you can specifically set the alignment on a structures I believe.

    The only problem is that .NET works across CPU architectures whereas SSE is architecture specific - it goes against the grain of .NET and what .NET is about. For the sort of work Im doing (mainly research) its not really a problem. But for a mass-released piece of code it would be.

    You could have multiple implementations of internals of a Vector4 say and select the right one at runtime depending on SSE availability, but that would probably introduce a performance hit that would make your optimisations irrelevant. Alternatively, there must be some way of JIT compiling "The right Vector3". If only I knew how! ...

    For instance, there are ways in .NEt to dynamically compile an assembly at runtime. What if you had in your SIMD library (C++/CLI) some way of checking the SSE availability and then dynamically compiling the right code to expose Vector3/Vector4 with SSE etc? 

    Just thinking out loud. If anyone knows - please post! :D

    Tuesday, April 20, 2010 8:17 PM
  • Right I did a bit more investigation into this. 


    I created a Vector3 and Matrix3x3 class in C++/CLI, each of which had a pointer to array for the values (double *), allocated using _mm_alloc to align the data. 

    I implemented the * operator on the Matrix class to multiply a matrix by a vector, returning a vector. Inside that I called a C function (in a separate file) that did the operation using SSE2. The SSE code was as follows: 

     

    double * Matrix3x3_Vector3_Mul64_SSE2(double * mat3x3d, double * vec3d)
    {
      /*****************************************
       * Matrix format, mat3x3d = |M11 M12 M13| (elements 0,1,2)
       * 				|M21 M22 M23| (elements 3,4,5)
       *				|M31 M32 M33| (elements 6,7,8)
       *   
       * Vector3 Format, vec3d = |X| element 0
       *				|Y| .. 
       *				|Z| element 2
      *******************************************/
    
    	// 16-byte aligned version of "new double[4]". Last element is redundant
    	double * pResult = (double*)_mm_malloc(4 * sizeof(double), 16);
    
    	__m128d xy = _mm_load_pd(vec3d);		// Contains vec3 X, Y
    	__m128d zz = _mm_load1_pd(vec3d + 2);	// Contains vec3 Z, Z
    
    	__m128d m11m12 = _mm_load_pd(mat3x3d);		// Contains mat3x3 M11, M12
    	__m128d m13m21 = _mm_load_pd(mat3x3d + 2);	// Contains mat3x3 M13, M21
    	__m128d m22m23 = _mm_load_pd(mat3x3d + 4);	// Contains mat3x3 M22, M23
    	__m128d m31m32 = _mm_load_pd(mat3x3d + 6);	// Contains mat3x3 M31, M32
    
    	SimdDouble temp;
    
    	// temp = [X, X] * [m11, m21] = [X.m11, X.m12]
    	//  unpacklo takes X from xy and stores in one __m128d register as [X, X]
    	//  shuffle takes argument 0x02 = binary 0000 0010, means get m11m12.0 and m13m21.1
    	temp.m = _mm_mul_pd(_mm_unpacklo_pd(xy, xy),
    						_mm_shuffle_pd(m11m12, m13m21, 0x02)); 
    
    	// temp = [X.m11, X.m12] + [Y, Y] * [m12, m22]
    	//   = [X.m11, X.m12] + [Y.m12, Y.m22]
    	//  unpackhi takes Y from xy and stores in one __m128d register as [Y, Y]
    	//  shuffle takes argument 0x01 = binary 0000 0001, means get m11m12.1 and m22m23.0
    	temp.m = _mm_add_pd(temp.m, _mm_mul_pd (
    									_mm_unpackhi_pd ( xy , xy ), 
    									_mm_shuffle_pd ( m11m12 , m22m23 , 0x01 ))); 
    
    	// temp = [X.m11, X.m12] + [Y.m12, Y.m22] + [Z, Z] * [m13, m23] 
    	//   = [X.m11, X.m12] + [Y.m12, Y.m22] + [Z.m13, Z.m23]
    	//  zz already stored as [Z, Z]
    	//  shuffle takes argument 0x02 = binary 0000 0010, means get m13m21.0 and m22m23.1
    	temp.m = _mm_add_pd(temp.m, _mm_mul_pd (
    									zz, 
    									_mm_shuffle_pd (m13m21 , m22m23 , 0x02 )));
    
    	// Temp now contains the first two rows of the matrix
    	// so store temp in Result[0, 1]
    	//  result[0] = X.m11 + Y.m12 + Z.m13
    	//  result[1] = X.m21 + Y.m22 + Z.m23
    	pResult[0] = temp.f[0];
    	pResult[1] = temp.f[1];
    	
    	// Compute the last row of the matrix ... 
    
    	// temp = [X.m31, Y.m32]
    	temp.m = _mm_mul_pd(m31m32, xy);
    
    	// No point using SIMD to compute M33 as there is no optimisation to make.
    	// Compute directly: 
    	//  result[2] = X.m31 + Y.m32 + (X.m33)
    	pResult[2] = temp.f[0] + temp.f[1] + (vec3d[Z] * mat3x3d[M33]);				
    
    	&pResult[2] = _mm_add_sd ( _mm_add_sd ( t4 , t6 ), t5 );*/
    
    	return pResult;
    }


    I then created some Vector3 & Matrix3x3 classes in pure C#, implemented the same operation (Vec = Mat * Vec). And a speed test to compare the two. 

    The result - the C++/CLI SIMD performance is appalling compared to vanilla C#. Here are the results of 10,000,000 matrix/vector multiplies. Yes that is 20 times slower for the SIMD implementation vs vanilla C#..

    C# Time:	00:00:02.2298309
    SIMD Time:	00:00:51.1679630
    Speedup:	0.0435786529160834
    
    -256663389.878927 (checksum, = Sum{X+Y+Z}i)
    -256663389.878927 (checksum, = Sum{X+Y+Z}i)

    Although Im not 100% sure of the reason of the terrible performance, I can hazard a guess that its not due to the SIMD implementation per-se, but possibly the crossing of boundaries from managed to native code so often.

    This sort of method, trying to create some primitives in C++/CLI that can be called from C#, internally using SIMD is not going to work. I think the soonest something like this can be done is when (if?) Microsoft implements some SIMD primitives in the JIT, rather like how Mono.SIMD works, where various primitives insert SIMD code directly into the output by the JIT compiler.

    In conclusion, I think the above method of creating primitives is useless, but if you wanted to say operate on an entire 1024x1024 image doing some operation, then SIMD will still help. The time is being spent (I think) crossing the managed/native boundary very regularly. This needs to be minimised. 

    That's why the first example I posted provides a decent speed improvement, while the second does not. in the first example a large array is passed to C++ where it is then operated on in entirety with plenty of maths operations to keep the CPU busy (and outweigh any marshalling issues). 


    Hope this thread has been informative. A little dissapointing, but at least I know a little more about when I can and cannot use SIMD. 

     

     

     

     

    Friday, April 23, 2010 2:37 PM
  • I have used Reed Copsey jr's suggestion, works ok as long as you use the blitting functions to read the arrays. Here's an example, a memory aligned double array

    using System;
    using System.Collections.Generic;
    using System.Runtime.InteropServices;
    using System.Text;
    
    namespace AlignedArray
    {
      public class AlignedDoubleArray : IDisposable
      {
        private byte[] buffer;
        private GCHandle bufferHandle;
        private IntPtr bufferPointer;
        private readonly int length;
    
        public AlignedDoubleArray(int length, int byteAlignment)
        {
          this.length = length;
          buffer = new byte[length * sizeof(double) + byteAlignment];
          bufferHandle = GCHandle.Alloc(buffer, GCHandleType.Pinned);
          long ptr = bufferHandle.AddrOfPinnedObject().ToInt64();
          // round up ptr to nearest 'byteAlignment' boundary
          ptr = (ptr + byteAlignment - 1) & ~(byteAlignment - 1);
          bufferPointer = new IntPtr(ptr);
        }
    
        ~AlignedDoubleArray()
        {
          Dispose(false);
        }
    
        protected void Dispose(bool disposing)
        {
          if(bufferHandle.IsAllocated)
          {
            bufferHandle.Free();
            buffer = null;
          }
        }
    
        #region IDisposable Members
    
        public void Dispose()
        {
          Dispose(true);
        }
    
        #endregion
    
        public double this[int index]
        {
          get
          {
            unsafe
            {
              return GetPointer()[index];
            }
          }
          set
          {
            unsafe
            {
              GetPointer()[index] = value;
            }
          }
        }
    
        public int Length
        {
          get
          {
            return length;
          }
        }
    
        public void Write(int index,double[] src,int srcIndex,int count)
        {
          if(index<0 || index>=length) throw new IndexOutOfRangeException();
          if ((index + count) > length) count = Math.Max(0, length - index);
    
          System.Runtime.InteropServices.Marshal.Copy(
            src, 
            srcIndex,
            new IntPtr(bufferPointer.ToInt64() + index*sizeof (double)),
            count);
        }
    
        public void Read(int index, double[] dest, int dstIndex, int count)
        {
          if (index < 0 || index >= length) throw new IndexOutOfRangeException();
          if ((index + count) > length) count = Math.Max(0, length - index);
    
          System.Runtime.InteropServices.Marshal.Copy(
            new IntPtr(bufferPointer.ToInt64() + index*sizeof (double)),
            dest,
            dstIndex,
            count);
        }
    
        public double[] GetManagedArray()
        {
          return GetManagedArray(0, length);
        }
    
        public double[] GetManagedArray(int index,int count)
        {
          double[] result = new double[count];
          Read(index, result, 0, count);
          return result;
        }
    
        public override string ToString()
        {
          StringBuilder sb = new StringBuilder();
          sb.Append('[');
          for(int t=0;t<length;t++)
          {
            sb.Append(this[t].ToString());
            if (t < (length - 1)) sb.Append(',');
          }
          sb.Append(']');
          return sb.ToString();
        }
    
        public unsafe double* GetPointer(int index)
        {
          return GetPointer() + index;
        }
    
        public unsafe double* GetPointer()
        {
          return ((double*) bufferPointer.ToPointer());
        }
      }
    }
    


    Sunday, June 27, 2010 7:26 PM