|
Page 2 of 2 |
|
Posted: Mon, 12th Aug 2013 15:03 Post subject: |
|
 |
Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
|
|
Back to top |
|
 |
|
|
Back to top |
|
 |
|
Posted: Mon, 12th Aug 2013 16:54 Post subject: |
|
 |
Yupp, that was what put me off your gpu solution as you can't be sure at 1000 concurrent calcs, that it won't miss a previous prime in another thread, except for having a really big n to start with...
|
|
Back to top |
|
 |
|
Posted: Mon, 12th Aug 2013 17:47 Post subject: |
|
 |
I'm convinced the GPU version should be much faster than it is. I haven't used AMP much but it seems to have a lot of overhead which is why my solution is so slow. Just initializing the array_view takes 40+ms !? I'm not sure whats going on there.
|
|
Back to top |
|
 |
[mrt]
[Admin] Code Monkey
Posts: 1342
|
Posted: Mon, 12th Aug 2013 20:40 Post subject: |
|
 |
|
|
Back to top |
|
 |
|
Posted: Mon, 12th Aug 2013 21:22 Post subject: |
|
 |
Nice one!
But you are right, running it in Visual Studio yields much worse results. I can't get it under 15ms, which I why I abandoned multithreaded approach. Running it directly it's much faster, 3.1ms. :/ Good thing to know.
|
|
Back to top |
|
 |
|
Posted: Mon, 12th Aug 2013 23:58 Post subject: |
|
 |
BearishSun wrote: | It's as clear as you can get it, optimized code is usually pretty ugly.
But in short, it's using vector CPU instructions (i.e. SIMD - Single instruction multiple data), where I can perform various operations on 8 values at once, at the cost of a single operation.
First bit before tmain is emulating abs() function because no such function exists for vector registers.
Then in tmain I populate the first few prime numbers because of the way algorithm works. Nothing special. I also keep a inverse of each prime number because division is a bottleneck in the algorithm, and caching the division saves about 0.5ms.
Then in the main loop I do a bunch of operations to determine if number is divisible by another number. There is no vector equivalent of a modulo operator, and also there is no support for integer division so I do it the hard way. Divide using floating point, then round to nearest integer and then see if the difference between original and rounded is close to 0, and if it is I determine it is divisible and it's not a prime.
And at the end I unpack everything from vector registers back into scalar part of the code. I could have done a loop here instead of doing each unpack manually but I don't trust the compiler to unroll it properly. |


|
|
Back to top |
|
 |
Page 2 of 2 |
All times are GMT + 1 Hour |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
Powered by phpBB 2.0.8 © 2001, 2002 phpBB Group
|
|
 |
|