Generating some normal distributed random numbers in c++
Page 1 of 1
Atropa




Posts: 878

PostPosted: Fri, 28th Nov 2014 18:13    Post subject: Generating some normal distributed random numbers in c++
Hallo.

I am solving some equations and need a lot of normal distributed random numbers. They don't have to be super duper random since the equations are supposed to be mostly driven by a deterministic flow. Writing in c++ I thought std::normal_distribution might be a nice try. I seem to be making a error when using it since it is very slow. Heres is my code.
Code:
#include <stdlib.h>
#include <iostream>
#include <time.h>
#include <random>
//
using namespace std;

int main()
{
  minstd_rand0 eng(time(0));
  normal_distribution<double> normal(0, 1);

int N=1024;
       long long int nvol=N*N;
       double *q1 = new double[nvol+2*N];
       double *q2 = new double[nvol+2*N];
        //
        int i;
        int j;
long starttime=time(0);
  for(int k=0; k<1000; k++){
            for(i=0; i<nvol; i++)
                    {
               q1[i]=normal(eng);
               q2[i]=normal(eng);
            }

                if(k%100==0){
                cout << time(0)-starttime << endl;
                }
  }

    return 0;
}

 

It takes approximately 40 sec for 100 k when I run this code. To compare matlab on my laptop seems to be able to generate 1000 k in under 40 sec. How do I reach these speeds?
Back to top
[mrt]
[Admin] Code Monkey



Posts: 1342

PostPosted: Fri, 28th Nov 2014 19:50    Post subject:
Hi,

Well let me ask you the obvious question. Are you running the Release build or the Debug build? I'm asking because my i5-2500 generates 100k values in 9 seconds on a single core.

Here is the output:

Code:

0
9
18
28
37
46
56
65
74
83
Press any key to continue . . .


Another thing you could do is see if you can distribute the calculation to multiple cores and considerably speed it up, or even further use SIMD instructions.

Good luck!


teey
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Fri, 28th Nov 2014 19:58    Post subject:
In C#, if I use a manual way to calculate the normal distribution for 10 k it required 36 seconds (Laptop with mobile i5)
Code:

            Stopwatch watch = new Stopwatch(); ;
            List<Double> rndm = new List<Double>();
            watch.Start();
            for (int k = 0; k< 100; k++){
                for (int i = 0; i < 1024*1024; i++)
                {
                    Random rand = new Random();
                    double x = rand.NextDouble(), y = rand.NextDouble();
                    rndm.Add(Math.Sqrt(-2.0 * Math.Log(x)) * Math.Sin(2.0 * Math.PI * y));
                }
                Console.WriteLine(k+": "+watch.ElapsedMilliseconds);
            }
                       
            Console.ReadLine();


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
Atropa




Posts: 878

PostPosted: Fri, 28th Nov 2014 20:29    Post subject:
[mrt] wrote:
Hi,

Well let me ask you the obvious question. Are you running the Release build or the Debug build? I'm asking because my i5-2500 generates 100k values in 9 seconds on a single core.

Here is the output:

Code:

0
9
18
28
37
46
56
65
74
83
Press any key to continue . . .


Another thing you could do is see if you can distribute the calculation to multiple cores and considerably speed it up, or even further use SIMD instructions.

Good luck!

I am sort of new to all this since until recently I managed with matlab so bare with me. g++ -v gives me.
gcc version 4.9.1 20140922 (Red Hat 4.9.1-10) (GCC)
I am not sure if that is what your asking about? Is there any tricks I should know about when compiling?
About parallelizing: A stack exchange post told me that generating random numbers should in principle be quite fast such that it shouldn't be necessary for only 2 mill numbers. Further more I couldn't figure out how to do it with openmp :/.


PumpAction wrote:
In C#, if I use a manual way to calculate the normal distribution for 10 k it required 36 seconds (Laptop with mobile i5)
Code:

            Stopwatch watch = new Stopwatch(); ;
            List<Double> rndm = new List<Double>();
            watch.Start();
            for (int k = 0; k< 100; k++){
                for (int i = 0; i < 1024*1024; i++)
                {
                    Random rand = new Random();
                    double x = rand.NextDouble(), y = rand.NextDouble();
                    rndm.Add(Math.Sqrt(-2.0 * Math.Log(x)) * Math.Sin(2.0 * Math.PI * y));
                }
                Console.WriteLine(k+": "+watch.ElapsedMilliseconds);
            }
                       
            Console.ReadLine();

So that is in the wrong direction?

Edit: Windows Task Manager is reporting Matlab to be using ~30 in the cpu tab on a i7-3520M. I suppose that is 1 core and still it is quite a lot faster than even mrt. hmm
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Fri, 28th Nov 2014 20:40    Post subject:
2 million numbers?
100 * 1.024*1.024 = 100 * 1.048.576 = 104.857.600


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
Atropa




Posts: 878

PostPosted: Fri, 28th Nov 2014 20:43    Post subject:
PumpAction wrote:
2 million numbers?
100 * 1.024*1.024 = 100 * 1.048.576 = 104.857.600


Sorry. It' s is because k is a marching time parameter. After each k i do stuff with some different vectors and the random numbers, and then have to generate some new. So per time step I need 2*1024^2 random numbers.


Last edited by Atropa on Fri, 28th Nov 2014 20:44; edited 1 time in total
Back to top
WhiteBarbarian




Posts: 6011
Location: Russia
PostPosted: Fri, 28th Nov 2014 20:43    Post subject:
@Atropa

If you are compiling from shell use -O3 optimization flag.




Last edited by WhiteBarbarian on Fri, 28th Nov 2014 20:47; edited 1 time in total
Back to top
Atropa




Posts: 878

PostPosted: Fri, 28th Nov 2014 20:47    Post subject:
WhiteBarbarian wrote:
@Atropa

If are compiling from shell use -O3 optimization flag.


And now matlab is only twice as fast
Back to top
PumpAction
[Schmadmin]



Posts: 26759

PostPosted: Fri, 28th Nov 2014 21:00    Post subject:
The rest is assembler magic Cool Face


=> NFOrce GIF plugin <= - Ryzen 3800X, 16GB DDR4-3200, Sapphire 5700XT Pulse
Back to top
Atropa




Posts: 878

PostPosted: Fri, 28th Nov 2014 21:13    Post subject:
Matlab is really a high level all purpose language. Isn't a bit weird it is so much faster?
Back to top
[mrt]
[Admin] Code Monkey



Posts: 1342

PostPosted: Sun, 30th Nov 2014 12:37    Post subject:
Not really, as its built in routines are optimized thorougly. It is in the end meant for mathematical calculations. The algorithm is probably split between multiple cores and could use some other magic and approximations the Standard C++ library doesn't.

The only thing slow in matlab is the script, not the function calls Smile

Anyway. Different compilers will give you different speeds. Usually GCC is not the best when it comes to optimizations, at least MSVC blows it out of the water most of the time. Intel's is even better as far as I know.

Quote:
Edit: Windows Task Manager is reporting Matlab to be using ~30 in the cpu tab on a i7-3520M. I suppose that is 1 core and still it is quite a lot faster than even mrt. hmm


As I said. Probably a more optimized algorithm and it could be using SSE instructions, which Stdlib isn't.


teey
Back to top
Atropa




Posts: 878

PostPosted: Sun, 30th Nov 2014 16:36    Post subject:
[mrt] wrote:
Not really, as its built in routines are optimized thorougly. It is in the end meant for mathematical calculations. The algorithm is probably split between multiple cores and could use some other magic and approximations the Standard C++ library doesn't.

The only thing slow in matlab is the script, not the function calls Smile

Anyway. Different compilers will give you different speeds. Usually GCC is not the best when it comes to optimizations, at least MSVC blows it out of the water most of the time. Intel's is even better as far as I know.

Quote:
Edit: Windows Task Manager is reporting Matlab to be using ~30 in the cpu tab on a i7-3520M. I suppose that is 1 core and still it is quite a lot faster than even mrt. hmm


As I said. Probably a more optimized algorithm and it could be using SSE instructions, which Stdlib isn't.


Aha. I used g++ but will try icc and msvc if I can get my sys admin to update/install these. Is it easy to use the SSE instructions? I really need to make this as fast as possible, since it can mean the difference in running the code for 1 month instead of 2.
Back to top
BearishSun




Posts: 4484

PostPosted: Sun, 30th Nov 2014 17:52    Post subject:
I pulled my random generator from my old raytracer project, back in the day I used it for generating all kinds of sampling data for rendering. I don't remember how good it is or are there any limitations, so you should probably check the random data before going forward with it. It generates floats and not doubles like your example but you said precision wasn't important and you can convert to double later.

It takes under 2 seconds to generate 1000 k. If your CPU has AVX or better extensions you can update the code and probably make it run 2-4x faster (theoretically).

This works on MSVC, will probably not work on other compilers as I assume they have differently named intrinsic functions for dealing with vector instructions.

Code:

#include "stdafx.h"
#include <stdlib.h>
#include <iostream>
#include <time.h>
#include <random>
//
using namespace std;

__declspec(align(16)) __m128i CurrentSeed;
#define RAND_MAX_SSE 2147483647

void RTSRand_SSE(unsigned int Seed)
{
   CurrentSeed = _mm_set_epi32(Seed, Seed + 1, Seed, Seed + 1);
}

// Returns a set of 4 random float numbers between 0.0 and 1.0
void RTRand_SSE(__m128& Result)
{
   __declspec(align(16)) __m128i cur_seed_split;
   __declspec(align(16)) __m128i multiplier;
   __declspec(align(16)) __m128i adder;
   __declspec(align(16)) __m128i mod_mask;
   __declspec(align(16)) static const unsigned int mult[4] = { 214013, 17405, 214013, 69069 };
   __declspec(align(16)) static const unsigned int gadd[4] = { 2531011, 10395331, 13737667, 1 };
   __declspec(align(16)) static const unsigned int mask[4] = { 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF, 0x7FFFFFFF };

   adder = _mm_load_si128((__m128i*) gadd);
   multiplier = _mm_load_si128((__m128i*) mult);
   mod_mask = _mm_load_si128((__m128i*) mask);
   cur_seed_split = _mm_shuffle_epi32(CurrentSeed, _MM_SHUFFLE(2, 3, 0, 1));

   CurrentSeed = _mm_mul_epu32(CurrentSeed, multiplier);
   multiplier = _mm_shuffle_epi32(multiplier, _MM_SHUFFLE(2, 3, 0, 1));
   cur_seed_split = _mm_mul_epu32(cur_seed_split, multiplier);

   cur_seed_split = _mm_and_si128(cur_seed_split, mod_mask);
   cur_seed_split = _mm_shuffle_epi32(cur_seed_split, _MM_SHUFFLE(2, 3, 0, 1));
   CurrentSeed = _mm_or_si128(CurrentSeed, cur_seed_split);
   CurrentSeed = _mm_add_epi32(CurrentSeed, adder);
   CurrentSeed = _mm_and_si128(CurrentSeed, mod_mask);

   Result = _mm_div_ps(_mm_cvtepi32_ps(CurrentSeed), _mm_set_ps1((float)RAND_MAX_SSE));
}

int _tmain(int argc, _TCHAR* argv[])
{
   RTSRand_SSE(345662);

   int N = 1024;
   long long int nvol = N*N;
   float *q1 = (float*)_aligned_malloc(sizeof(float) * nvol + 2 * N, 16);
   float *q2 = (float*)_aligned_malloc(sizeof(float) * nvol + 2 * N, 16);

   //
   int i;
   int j;
   long starttime = time(0);
   for (int k = 0; k < 1000; k++){
      for (i = 0; i < nvol; i += 4)
      {
         __m128 output;
         RTRand_SSE(output);
         _mm_store_ps(&q1[i], output);

         RTRand_SSE(output);
         _mm_store_ps(&q2[i], output);
      }

      if (k % 100 == 0){
         cout << time(0) - starttime << endl;
      }
   }

   return 0;
}
Back to top
BearishSun




Posts: 4484

PostPosted: Sun, 30th Nov 2014 18:02    Post subject:
(Didn't see you needed normal distribution, mine isn't it, sorry. You could apply a transform on top of the generated data, although you're on your own for that one.)
Back to top
[mrt]
[Admin] Code Monkey



Posts: 1342

PostPosted: Fri, 12th Dec 2014 20:35    Post subject:
Atropa wrote:
Is it easy to use the SSE instructions?


My view is this: When you vectorize the problem very nicely you are already 3/4 of the way there. SIMD's are pesky to write as documentation is somewhat harder to find, especially if its your first stab at it, but if you are reasonably good in C you catch on quick.

As for accellerating a problem i think the difficulty ladder goes something like this:

Easiest: single core solution with lookups, approximations and whatever thingamajigs you can think of

A bit harder: Multi-core solution with threads (have to be carefull with synchronization as it can bog you down quickly enough)

Even harder: vectorization with SIMD instructions (SSE, AVX...) as @BearishSun so neatly demonstrated (Nice job!)

Hardest: All of the above in One-Neat-Package Smile

@Atropa What are you working on, if you don't mind me asking that you need such speed?


teey
Back to top
tainted4ever
VIP Member



Posts: 11336

PostPosted: Fri, 12th Dec 2014 21:29    Post subject:
No one here uses Clang? Sad


Sense Amid Madness, Wit Amidst Folly
Back to top
WhiteBarbarian




Posts: 6011
Location: Russia
PostPosted: Fri, 12th Dec 2014 23:12    Post subject:
Anybody on Yosemite who uses Brew are using Clang indirectly.


Back to top
LeoNatan
☢ NFOHump Despot ☢



Posts: 73238
Location: Ramat HaSharon, Israel 🇮🇱
PostPosted: Sat, 13th Dec 2014 00:15    Post subject:
tainted4ever wrote:
No one here uses Clang? Sad

Back to top
Atropa




Posts: 878

PostPosted: Sun, 14th Dec 2014 09:51    Post subject:
BearishSun wrote:
(Didn't see you needed normal distribution, mine isn't it, sorry. You could apply a transform on top of the generated data, although you're on your own for that one.)


Thank you for the code. Unfortunatly I haven't had the time to go any further with it. Will have to read up on how to make an efficient transform from uniform to gaussian distribution when I get the time ( hopefully over christmas).

[mrt] wrote:
Atropa wrote:
Is it easy to use the SSE instructions?


My view is this: When you vectorize the problem very nicely you are already 3/4 of the way there. SIMD's are pesky to write as documentation is somewhat harder to find, especially if its your first stab at it, but if you are reasonably good in C you catch on quick.

As for accellerating a problem i think the difficulty ladder goes something like this:

Easiest: single core solution with lookups, approximations and whatever thingamajigs you can think of

A bit harder: Multi-core solution with threads (have to be carefull with synchronization as it can bog you down quickly enough)

Even harder: vectorization with SIMD instructions (SSE, AVX...) as @BearishSun so neatly demonstrated (Nice job!)

Hardest: All of the above in One-Neat-Package Smile

@Atropa What are you working on, if you don't mind me asking that you need such speed?


I'm not reasonably good in C Sad. I am mostly a complete newbie but I find it somewhat easy to learn as long as I have a clear idea of what I want to do. For now I am using the easiest solution with lookups and only generating random uniform numbers. This save around 1/3 I think. I will take a stab at the multicore solutions later. Earlier I had some problem with state dependent generators generating the same "random" number with different cores but I guess I could just branch it out and it should work.

The random number generator is used in a study of droplet growth I am doing( sorth of like this http://en.wikipedia.org/wiki/Ostwald_ripening but not completely). We have this nice model( = partial differential equation ) which can be used to simulate all sorts of phenomena. To include thermal fluctuations in the model a usual method is to add a stochastic term to the dynamics( the random numbers and since it's thermal it has to be gaussian ). Essentially what we want to figure out is how the amplitude( sort of temperature ) of the noise( random numbers) affects the dynamics of the droplet growth. The only reasonable limit for the model is one big droplet in the end( see pic on wiki) however to get to this limit can be rather time consuming since the dynamics can be become very very slow hence long simulations. When doing long simulations it is crucial to have as fast timestep as possible which is why I need a speedy random number generator Smile
Back to top
[mrt]
[Admin] Code Monkey



Posts: 1342

PostPosted: Sun, 14th Dec 2014 18:14    Post subject:
@Atropa Sounds like quite a hobby! Smile

Have a look at this, maybe you can use it somewhat: http://http.developer.nvidia.com/GPUGems3/gpugems3_ch37.html

CUDA acceleration might get you somewhere. There are some examples with code for all kinds of generators. Have a look. Although, if you want real fast results have you thought about a hardware solution?


teey
Back to top
Page 1 of 1 All times are GMT + 1 Hour
NFOHump.com Forum Index - Programmers Corner
Signature/Avatar nuking: none (can be changed in your profile)  


Display posts from previous:   

Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB 2.0.8 © 2001, 2002 phpBB Group