Fastest Prime Number Generator on NFO! FIGHT!
Page 2 of 2 Goto page Previous  1, 2
paxsali
Banned



Posts: 18352

PostPosted: Mon, 12th Aug 2013 15:03    Post subject:
⁢⁢


Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total
Back to top
BearishSun




Posts: 4484

PostPosted: Mon, 12th Aug 2013 15:16    Post subject:
Very Happy

Yup, I just add the first 8 to ensure that N is never divisible by any of the last 7 primes, since I can't check those yet. Didn't really do any math behind it, just guesstimated that 8 first primes should be enough.

EDIT: Gave it some more thought. First 8 primes is overkill. It's enough if I start at N = 8 (first 4 primes), and I can ensure that nothing in a single iteration is divisible by each other.
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Mon, 12th Aug 2013 16:54    Post subject:
Yupp, that was what put me off your gpu solution as you can't be sure at 1000 concurrent calcs, that it won't miss a previous prime in another thread, except for having a really big n to start with...


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
BearishSun




Posts: 4484

PostPosted: Mon, 12th Aug 2013 17:47    Post subject:
I'm convinced the GPU version should be much faster than it is. I haven't used AMP much but it seems to have a lot of overhead which is why my solution is so slow. Just initializing the array_view takes 40+ms !? I'm not sure whats going on there.
Back to top
[mrt]
[Admin] Code Monkey



Posts: 1342

PostPosted: Mon, 12th Aug 2013 20:40    Post subject:
Aye Bearish!

Your proggy did it in 4.25ms (2.2k/sec) on my 2500k box.

Both apps were run from Windows directly, not through the VS2012 Run command (that slows it down somewhat - wierd).

I'm up! Very Happy

Here is a bench for 10 million primes.

Code:

Looking for 10000000 Primes...
Found 10000152 primes in 14077.582ms (710 k/sec)
First 10: 2 3 5 7 11 13 17 19 23 29
Last 10: 177084823 177084827 177084829 177084833 177084847 177084911 177084917 1
77084949 177084977 177084979
Press any key to continue . . .


Here is for 10k

Code:

Looking for 10000 Primes...
Found 10274 primes in 3.601ms (2852 k/sec)
First 10: 2 3 5 7 11 13 17 19 23 29
Last 10: 104677 104681 104683 104693 104701 104707 104711 104717 104723 104729
Press any key to continue . . .


Here's the code. Optimized threading and updated the algorithm to use prime devisors only.

http://pastebin.com/x57SMVnt

I tried implementing it through the AVX/SSE extensions but the results were a disappointment. I expected at least a 2-4-fold improvement, but since integer math isn't supported yet the floats still eat away too much compared to bare-bone integers. So I dropped it.

Now all cores are running full throttle. It scales like mad.

With 10k primes I get ~2800k primes/sec, with 10 million it only drops to 730k/sec.

Sweet Very Happy

EDiT:
I noticed that on small runs (like 10k) the results vary on the first few runs. The app finishes too quickly, but when it can run for a bit the threads can finally dig-in and it really gets a boost. Possibly the Intel Turbo-Boost kicks in or it just takes a while for the CPU to throttle to full power...


teey
Back to top
BearishSun




Posts: 4484

PostPosted: Mon, 12th Aug 2013 21:22    Post subject:
Nice one!

But you are right, running it in Visual Studio yields much worse results. I can't get it under 15ms, which I why I abandoned multithreaded approach. Running it directly it's much faster, 3.1ms. :/ Good thing to know.
Back to top
WhiteBarbarian




Posts: 6011
Location: Russia
PostPosted: Mon, 12th Aug 2013 23:58    Post subject:
BearishSun wrote:
Scratch that, 1.17ms. Optimized out the sqrt.

http://pastebin.com/2CcHkN83

BearishSun wrote:
It's as clear as you can get it, optimized code is usually pretty ugly.

But in short, it's using vector CPU instructions (i.e. SIMD - Single instruction multiple data), where I can perform various operations on 8 values at once, at the cost of a single operation.

First bit before tmain is emulating abs() function because no such function exists for vector registers.

Then in tmain I populate the first few prime numbers because of the way algorithm works. Nothing special. I also keep a inverse of each prime number because division is a bottleneck in the algorithm, and caching the division saves about 0.5ms.

Then in the main loop I do a bunch of operations to determine if number is divisible by another number. There is no vector equivalent of a modulo operator, and also there is no support for integer division so I do it the hard way. Divide using floating point, then round to nearest integer and then see if the difference between original and rounded is close to 0, and if it is I determine it is divisible and it's not a prime.

And at the end I unpack everything from vector registers back into scalar part of the code. I could have done a loop here instead of doing each unpack manually but I don't trust the compiler to unroll it properly.




Back to top
Page 2 of 2 All times are GMT + 1 Hour
NFOHump.com Forum Index - Programmers Corner Goto page Previous  1, 2
Signature/Avatar nuking: none (can be changed in your profile)  


Display posts from previous:   

Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB 2.0.8 © 2001, 2002 phpBB Group