Global and static memory, structures, and parallelizing rand() RRS feed

  • Question

  • I am trying to port a simple raytracer (and by simple, I mean maybe simpler than a driving test)

    Currently the raytracer supports raytraced shadows and a few other things, but not transparency or reflection.  Amp doesn't support recursion, so I'm not going to worry about those right now.

    Presently there are a few things that are holding me up from moving forward:

    1.  Rand.  It seems to be very difficult to parallelize random numbers because there's no support for static variables, etc.

    2.  No global variables.  in raytracing, you have to store a lot of information, such as camera, light, and world attributes, rendering options, buffers, and etc.  I'm concerned that passing all those parameters down through so many functions will kill performance.

    Saturday, April 20, 2013 7:15 PM

All replies

  • 1. I would say it's relatively easy to get pRNG going on the GPU / in AMP, see something like Marc Olano's GPUTEA paper or Ranlim32 from Numerical Recipes (I've put both in AMP in a pretty short time and they're quite reasonable) - note that you really really don't want to use something that relies on shared global state / statics, as is the norm for stuff like rand(), you want something that is state light and can be attached to a threads / SIMD lanes, as opposed to the whole dispatch in general; note that I'm not saying anything about the properties of the pRNs generated, as that is a wholly different kettle of fish (it is pretty hard to prove stuff for the serial case, it becomes harder for the parallel case and it is not exactly an exhaustively researched problem yet) - they will probably work fine for your use case;

    2. You have to store similar state for rasterization, and GPUs have been doing it for ages (they are super duper rasterizers, after all!:) ). Have your data-structures live in global-memory (GPU VRAM), maybe wrapped in an array_view or something, that way they will be visible to all tiles and you will be able to read them with ease (and also update them, but for writes you of course have to also take hazards into account); the acceleration structure, if you use something like that, might be somewhat less straightforward depending on what you choose but you can leave that for the last step.

    Sunday, April 21, 2013 4:28 AM
  • I'm not doing an acceleration structure, or even polygons at the moment.  I'm using a density function.  If the density function returns >0  then I've hit solid, and  I plot a pixel.  I figure it will be useful for things like metaballs, mathmatical shapes, maybe the mandelbulb, but right now I'm just doing a hightmap defined by a plane minus  the bits in a bitmap.  Simple stuff, so I can concentrate on getting it to work on the GPU.

    The rand() thing isn't a showstopper, but it's a consideration.  I was previously using xor shift random numbres, and they probably won't parallelize out of the box.  I could always use a lookup table in this case.

    I am however having trouble wrapping my head around how to pass structures down to AMP functions.  They don't seem to be captured by default, and I haven't messed with capture lists yet, so I don't know what's possible.

    You may be right, I may have to make an array_view of an array containing the structures.  Assuming that's possible.

    Sunday, April 21, 2013 5:42 PM
  • >>Have your data-structures live in global-memory (GPU VRAM), maybe wrapped in an array_view or something

    I'm not sure how that is possible since global variables aren't allowed.

    Sunday, April 21, 2013 5:45 PM
  • Well, you can have an array_view over many things actually (the naming might be a bit unfortunate in that it makes it seem more restricted than it actually is), as long as you make sure you meet restrict(amp) restrictions (so something like a bog standard list, using pointers, is something you wouldn't be able to wrap - luckily, you don't really need that for world / camera / other properties). You can wrap an array of conforming structures just fine (note that AoS is an anti-pattern though, but if you're just looking to quickly prototype stuff). Could you perhaps post some token stuff that you'd want to pass through, maybe we find ways to do it nicely? As for pRNG, this might be useful, it's the paper I mentioned earlier: http://www.csee.umbc.edu/~olano/papers/GPUTEA.pdf. Very straightforward to implement in AMP.
    Sunday, April 21, 2013 6:40 PM
  • Some kind of global variable that resides on the accelerator would be very useful in a future version.  This is turning out to be one of those cases where global variables make sense and help avoid ridiculous looking code.

    Also, I'm wondering why global consts cause an error.

    Tuesday, April 23, 2013 9:28 PM
  • Sadly, I am giving up on this.  It just seems to be impossible to implement at this time.

    Wednesday, April 24, 2013 7:54 PM
  • Hi Dan. If you need constants that are accessible by all the threads launched by a parallel_for_each, you can use constant memory (http://blogs.msdn.com/b/nativeconcurrency/archive/2012/01/11/using-constant-memory-in-c-amp.aspx).  In C++ AMP, all the captured by value variables are put into constant memory.

    Wednesday, April 24, 2013 11:35 PM
  • Thanks for the info.

    The biggest problem I was having was this...

    I have a render() function.  That's where the Parallel_For_each is.

    That calls a raytrace function.

    That calls a raytraceshadow function

    They each call a density function

    the density calls a texture_lookup function that does a lookup in a bitmap.

    Basically, I need to pass parameters and a pointer to an image down through all that.  without parameters that are global in scope, I have to pass them as parameters to each function.

    Without pointers, everything has to be copied each time.

    There is some indication in the documentation that you can pass some parameters by reference within the confines of AMP, but I've had no success with it.

    Thursday, April 25, 2013 8:07 PM
  • You cannot have pointers or references as data-members, or capture-list members (and indirections deeper than one, so no ptr-to-ptr of ref-to-ptr or whatnot), but you can have them just fine as function parameters, so no need to pass-by-value all the time.  Note that there are two types of containers that do not obey the second rule, namely concurrency::array and concurrency::graphics::texture, as these need to be captured by reference. That being said, what I would do in this case is consider using (stateful) functors as opposed to globals (it is quite difficult to make a good case for globals in general, but that's a wholly different kettle of fish). Or put together a struct to hold all the stuff you want to pass down, and then just feed that down the hierarchy. If the params you're passing are immutable (seems like it), you can assemble it on host side and just pass it through a constant buffer (which is to say, capture by value in the restrict(amp) lambda).
    Wednesday, May 1, 2013 1:43 AM
  • Not sure how you'd use dynamically allocated memory within a struct.   I could see doing that with fixed size arrays.  Will have to get up to speed on functors.
    Wednesday, May 8, 2013 7:43 PM
  • There is no dynamic allocation in C++ AMP (per se). Whilst it can be worked around, the process is somewhat involved and not something you'd care about for quick prototyping.

    Thursday, May 9, 2013 4:32 AM