array_view back to old value when using p-f-e
-
יום שלישי 17 יולי 2012 09:20
Hi .
I have hit the wall again..please help.please look at following code.
float data[1]; array_view<float,1> dm(1,data); data[0]=2;//dm[0]=2 at this point 100% sure about it. data[0]+=2;//dm[0]=4 at this point printf(" dm[0]=%f ",dm[0]);//dm[0]=4 at this point<---of course.this is fine. then look at following.
float data[1]; array_view<float,1> dm(1,data); data[0]=2;//dm[0]=2 at this point 100% sure about it. parallel_for_each(dm.extent,[=](index<1> idx) restrict(amp) { float f=dm[0];//dammy,it doesnt matter. just for using dm in p-f-e block }); data[0]+=2;//dm[0]=4 at this point printf(" dm[0]=%f ",dm[0]);//dm[0]=2 at this point<---this is wired for me. //obviously p-f-e is the key.when I use "dm"(the array_view) in "p-f-e", then synchronize(synchronize happen at first use) it ,then weird thing(for me) happen.
can anyone explain how it happen or/and anyone teach me how to avoid it?
thank you.
כל התגובות
-
יום שלישי 17 יולי 2012 11:52
Hi,
The problem is in the statement:
data[0]+=2;//dm[0]=4 at this point
That isn't true because you're modifying data[0] not dm[0]. There are two copies of the data here, data[] and dm[], and they don't have the same value until a synchronization. That occurs explicitly with a call to dm.syncrhonize(), or when you reference dm[] in a statement in the kernel or host code.
During execution of the p-f-e, data[0] is finally copied to dm[0] because of implicit synchronization. Next, data[0] is modified (=4) in the host, but dm[0] is still 2. In the printf, dm[] is referenced, which performs an implicit dm.synchronize(). That copies dm[0] to data[0], overwriting the 4 back with the 2.
One solution is to always use the array_view object in your host code. Another is to explicitly call dm.synchronize() after the p-f-e, then use either the array_view or the host buffer variable.
Hope I explained this correctly, and that it helps.
Ken
- הוצע כתשובה על-ידי DanielMothMicrosoft Employee, Owner יום שלישי 17 יולי 2012 16:21
- סומן כתשובה על-ידי DanielMothMicrosoft Employee, Owner יום שני 23 יולי 2012 19:32
-
יום שלישי 17 יולי 2012 12:58
Thank you and Hi Ken.
There are two copies of the data here, data[] and dm[], and they don't have the same value until a synchronization. That occurs explicitly with a call to dm.syncrhonize(), or when you reference dm[] in a statement in the kernel or host code.
In the printf, dm[] is referenced, which performs an implicit dm.synchronize().Yes I know that.
data[0]+=2;//dm[0]=4 at this point
That isn't true because you're modifying data[0] not dm[0].If it isn't true,It all make sense. But VisualC++2012RC Debugger says Its true. It says dm[0]=4
-
יום שלישי 17 יולי 2012 13:26
Yep, you're right. The debugger shows dm[0] = 4. But, it does make sense because there are two copies of dm[]'s buffer: one on the GPU, and another on the host. The debugger shows dm[0] = 4 because that's the copy on the host, because you're in host code. But, dm[0] = 2 on the GPU still. Then, when you do the printf, implicit synchronization occurs, and the host buffer is blown away with the copy on the GPU. Then, when you step past the printf and display dm[0] in the debugger, it's back to 2.
Ken
-
יום שלישי 17 יולי 2012 13:41
Thank you again Ken.
Your Explanation does make sense. The value on the GPU still 2,its possible.
Then why it didn't happen on the code that without p-f-e(1st code of first post). -
יום שלישי 17 יולי 2012 14:02
Yep, in the first code block, there isn't a GPU copy at all, because there's no p-f-e to fork off a copy from the host So, there's nothing to get out of synch. When the p-f-e is called in the 2nd code block, dm[] gets allocated on the GPU, and a copy is performed from the host to the GPU. That's why you may find some p-f-e's taking a little longer than you'd expect, because it needs to do the copy. (You can force the copy with a kludgy call to a p-f-e that looks just like the p-f-e you wrote in your example.) Remember, p-f-e can take an accelerator_view, so which GPU to use doesn't get selected until the p-f-e() function is called.
Ken
-
יום שלישי 17 יולי 2012 14:49
I see.very useful information that i didn't notice. thank you.
I still in trouble to avoid this.
please look at following code if you have more times.void main() { #define SIZE 3 float data[SIZE]={0,0,0}; float data2[SIZE]={2,2,2}; array_view<float,1> av(SIZE,data); array_view<float,1> av2(SIZE,data2); for(int i=0;i<4;i++) { extent<1> e(SIZE); parallel_for_each(e,[=](index<1> idx) restrict(amp) { av[0]=av2[0]; }); av2.synchronize(); for(int k=0;k<SIZE;k++) data2[k]+=2;//think this as outer function } printf("av[0]=%f av2[0]=%f",av[0],av2[0]);//av[0]=2 av2[0]=4 //suppose to be av[0]=8 av2[0]=8 //av[0] is undated only once in the first p-f-e 1 out of 3times }this code is very very simpler version of my project. i cant reveal my code for NDA reasons.
array_view av2[] in the code takes value 2 and 4 by turns like ping pong.
You mention about a solution using array_view for all,But my project was already there before I Start C++AMP and there is ton of buffers I have to rewrite to array_view. hmmmm
I just want to copy av2[](av2[] is updated by udtating data2[] in the other sub functions) to av[].
- נערך על-ידי HotInCool יום שלישי 17 יולי 2012 14:51
-
יום שלישי 17 יולי 2012 16:04
The problem is
data2[k]+=2;//think this as outer function
av2 won't be aware of this modification.
What you can do is to use refresh, which could notify av2 that the memory (data2) it wraps has been updated.
for(int k=0;k<SIZE;k++) data2[k]+=2; av2.refresh();
Also, there is no need for "av2.synchronize()", since av2 is not modified by the parallel_for_each. If av2 is read-only in the parallel_for_each, you should consider make it array_view<const float, 1>. Also if av's original values are not read in parallel_for_each, it's a good practice to call av.discard_data() before calling parallel_for_each.
We will have a blog series that provides guidelines on using array_view, please stay tuned.
Thanks,
Weirong
- נערך על-ידי Zhu, Weirong יום שלישי 17 יולי 2012 16:07
- הוצע כתשובה על-ידי DanielMothMicrosoft Employee, Owner יום רביעי 18 יולי 2012 02:46
- סומן כתשובה על-ידי DanielMothMicrosoft Employee, Owner יום שני 23 יולי 2012 19:32
-
יום שלישי 17 יולי 2012 16:17
There are two ways to avoid the confusion of array_view<>:
(1) use array<> instead of array_view<>. In this case, you have to manually copy from the array<> to/from a host buffer.
(2) try to place all code that manipulates data2 currently in the host into C++ AMP code on the GPU.
Let's try solution (2).
And, let's assume that you really want to modify av[] and av2[] in separate, global synchronization steps, and you only want one thread to copy av2[0] to av[0]. So, your code would look like this:
void main() { #define SIZE 3 float data[SIZE]={0,0,0}; float data2[SIZE]={2,2,2}; array_view<float,1> av(SIZE,data); array_view<float,1> av2(SIZE,data2); for(int i=0;i<4;i++) { extent<1> e1(1); parallel_for_each(e1,[=](index<1> idx) restrict(amp) { int i = idx[0]; av[i] = av2[i]; }); extent<1> e(SIZE); parallel_for_each(e,[=](index<1> idx) restrict(amp) { int i = idx[0]; av2[i] += 2; }); } printf("av[0]=%f av2[0]=%f",av[0],av2[0]);//av[0]=2 av2[0]=4 //suppose to be av[0]=8 av2[0]=8 //av[0] is undated only once in the first p-f-e 1 out of 3times }If you can, try to make all the host code that modifies data2[] into functions that run on the GPU which modify av2[] instead.
Ken
-
יום רביעי 18 יולי 2012 22:52
thank you both of you,I dont run away, Im struggling to my project rewriting Hoping I can get speed up my code.
I knew array need to copy CPU-GPU manually,I thought array_view reflect the value automatically. but its not.
yes I knew const,discard() thing.I will report the result when I finishi it. Its so big code so it takes a few days maybe.
- נערך על-ידי HotInCool יום רביעי 18 יולי 2012 22:54
-
יום חמישי 19 יולי 2012 00:40
Hi HotInCool,
Have you considered using array_view::refresh after you are done with working on "data2", so don't need to rewrite that portion of host code? Of course, depending on what you are doing, it might be beneficial to do that portion of work on GPU too as Ken suggested.
Thanks,
Weirong
-
יום חמישי 19 יולי 2012 01:06
Thank you. I meant I rewrite other thing. Yes I have to use refresh() to get same calculation result as the cpu code.
refresh() method invoke memory trasfer right? so my code speed down for now.
(host) ---copy--> (gpu) needs refresh()
(host) <---copy-- (gpu) needs synchronize()
Am I right?
So, I decide I rewrite most of all for-loop on the project to p-f-e so that avoid memory transfer as possible.
(Its ballance issue ,I think I cant get good result only p-f-e,th point is how the cpu and the gpu work at sametime and minimum memory transfer)Im wokin on it now,and report a good result. i hope.
- נערך על-ידי HotInCool יום חמישי 19 יולי 2012 01:18
-
יום שישי 27 יולי 2012 06:08
Hi, first of all, thank you Ken and Zhu.
I made it. Actually I couldnt get speed gain But I realize a lot of tips by you two.I found that 2 keys.
1. let the cpu to work during the gpu is working and let them work TOGETHER.(and use synchronized_async() instead of synchronize() )
2.TESTING the app is ONLY way to achieve the speed gain. theory cant help a lot.
Anyway Thank you!