NFOHump.com :: View topic - Fastest Prime Number Generator on NFO! FIGHT!

Page 2 of 2

Goto page Previous 1, 2

paxsali
Banned

Posts: 18338

Posted: Mon, 12th Aug 2013 15:03 Post subject:	Quick Quote Quote

⁢⁢ Last edited by paxsali on Thu, 4th Jul 2024 22:05; edited 1 time in total

Profile
PM

BearishSun

Posts: 4484

Posted: Mon, 12th Aug 2013 15:16 Post subject:	Quick Quote Quote

Yup, I just add the first 8 to ensure that N is never divisible by any of the last 7 primes, since I can't check those yet. Didn't really do any math behind it, just guesstimated that 8 first primes should be enough. EDIT: Gave it some more thought. First 8 primes is overkill. It's enough if I start at N = 8 (first 4 primes), and I can ensure that nothing in a single iteration is divisible by each other.

Profile
PM

PumpAction
[Schmadmin]

Posts: 26759

Posted: Mon, 12th Aug 2013 16:54 Post subject:	Quick Quote Quote

Yupp, that was what put me off your gpu solution as you can't be sure at 1000 concurrent calcs, that it won't miss a previous prime in another thread, except for having a really big n to start with... => NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse

Profile
PM

BearishSun

Posts: 4484

Posted: Mon, 12th Aug 2013 17:47 Post subject:	Quick Quote Quote

I'm convinced the GPU version should be much faster than it is. I haven't used AMP much but it seems to have a lot of overhead which is why my solution is so slow. Just initializing the array_view takes 40+ms !? I'm not sure whats going on there.

Profile
PM

[mrt]
[Admin] Code Monkey

Posts: 1342

Posted: Mon, 12th Aug 2013 20:40 Post subject:

Aye Bearish!

Your proggy did it in 4.25ms (2.2k/sec) on my 2500k box.

Both apps were run from Windows directly, not through the VS2012 Run command (that slows it down somewhat - wierd).

I'm up! Very Happy

Here is a bench for 10 million primes.

Code:

Looking for 10000000 Primes...
Found 10000152 primes in 14077.582ms (710 k/sec)
First 10: 2 3 5 7 11 13 17 19 23 29
Last 10: 177084823 177084827 177084829 177084833 177084847 177084911 177084917 1
77084949 177084977 177084979
Press any key to continue . . .

Here is for 10k

Code:

Looking for 10000 Primes...
Found 10274 primes in 3.601ms (2852 k/sec)
First 10: 2 3 5 7 11 13 17 19 23 29
Last 10: 104677 104681 104683 104693 104701 104707 104711 104717 104723 104729
Press any key to continue . . .

Here's the code. Optimized threading and updated the algorithm to use prime devisors only.

http://pastebin.com/x57SMVnt

I tried implementing it through the AVX/SSE extensions but the results were a disappointment. I expected at least a 2-4-fold improvement, but since integer math isn't supported yet the floats still eat away too much compared to bare-bone integers. So I dropped it.

Now all cores are running full throttle. It scales like mad.

With 10k primes I get ~2800k primes/sec, with 10 million it only drops to 730k/sec.

Sweet Very Happy

EDiT:
I noticed that on small runs (like 10k) the results vary on the first few runs. The app finishes too quickly, but when it can run for a bit the threads can finally dig-in and it really gets a boost. Possibly the Intel Turbo-Boost kicks in or it just takes a while for the CPU to throttle to full power...

teey

Profile
PM

BearishSun

Posts: 4484

Posted: Mon, 12th Aug 2013 21:22 Post subject:	Quick Quote Quote

Nice one! But you are right, running it in Visual Studio yields much worse results. I can't get it under 15ms, which I why I abandoned multithreaded approach. Running it directly it's much faster, 3.1ms. :/ Good thing to know.

Profile
PM

WhiteBarbarian

Posts: 6011
Location: Russia

Posted: Mon, 12th Aug 2013 23:58 Post subject:

BearishSun wrote:

Scratch that, 1.17ms. Optimized out the sqrt.

http://pastebin.com/2CcHkN83

BearishSun wrote:

It's as clear as you can get it, optimized code is usually pretty ugly.

But in short, it's using vector CPU instructions (i.e. SIMD - Single instruction multiple data), where I can perform various operations on 8 values at once, at the cost of a single operation.

First bit before tmain is emulating abs() function because no such function exists for vector registers.

Then in tmain I populate the first few prime numbers because of the way algorithm works. Nothing special. I also keep a inverse of each prime number because division is a bottleneck in the algorithm, and caching the division saves about 0.5ms.

Then in the main loop I do a bunch of operations to determine if number is divisible by another number. There is no vector equivalent of a modulo operator, and also there is no support for integer division so I do it the hard way. Divide using floating point, then round to nearest integer and then see if the difference between original and rounded is close to 0, and if it is I determine it is divisible and it's not a prime.

And at the end I unpack everything from vector registers back into scalar part of the code. I could have done a loop here instead of doing each unpack manually but I don't trust the compiler to unroll it properly.

Profile
PM

Page 2 of 2

All times are GMT + 1 Hour