m x n matrix with randomly assigned 0/1

2012-08-29 / Luis

Today Scott Chamberlain tweeted asking for a better/faster solution to building an m x n matrix with randomly assigned 0/1. He already had a working version:

r <- 1000
c <- 1000
m0 <- matrix(0, r, c)
apply(m0, c(1,2), function(x) sample(c(0,1),1))

Now, I’m the first to acknowledge that I’ve never got the ‘apply’ family of functions and that—thanks Hadley—if I need to do something like that I go for the plyr package. Nevertheless, if you think about it, the starting point is a sugary version of a loop; I like loops because they are explicit and clear, but they can be very slow in R. The loop could be written as:

m00 <- matrix(0, r, c)
for(i in 1:r) {
  for(j in 1:c) {
    m00[i, j] <- sample(c(0,1),1)
  }
}

In contrast, my first idea was to generate a bunch of uniformly distributed [0, 1) random numbers and round them to the closest integer, which is a more ‘matricy’ way of thinking:

m1 <- round(matrix(runif(r*c), r, c)))

and it happens to be a lot faster. My second idea, and Edmund Hart beat me to that one, was to use something like:

m2 <- matrix(rbinom(r*c,1,0.5),r,c)

which is a nice option, generating r*c random numbers following a binomial distribution, which also has the advantage of allowing for different probabilities, rather than 0.5 as in m1. This is also faster than m1, although probably one would not notice the difference until working with fairly big matrices. When Scott pointed out the speed bump he got me thinking about order of the operations; would this be better?

m3 <- matrix(round(runif(r*c)), r, c)

In terms of speed, m3 fits between m1 and m2, so the order can make a difference.

David Smith and Rafael Maia came up with a different approach, using sample() which I had not considered at all. m4 has the advantage that one could use any number of values to randomize.

m4 <- matrix(sample(0:1,r*c, replace=TRUE),r,c)

Any of these options can be timed using the system.time() function as in system.time(m3 = matrix(round(runif(r*c)), r, c)). It’s interesting how different people come up with alternative strategies (reflecting different ways of thinking about the problem) using exactly the same language. I’m not sure what should be the ‘obvious’ solution; but for almost any purpose any of them (with the exception of the explicit loop) will work just fine.

Gratuitous picture: training horses in the beach (Photo: Luis).

P.S. 2012-08-29, 22:33. Another way that would work (although 4 times slower than m1) would be m5 = matrix(ifelse(runif(r*c) < 0.5, 0, 1)). Still, it takes less than 0.5 seconds for a 1,000,000 elements matrix.
P.S. 2012-09-04. Scott Chamberlain wrote his own blog post discussing the different suggestions. Even cooler, Dirk Eddelbuettel implemented even faster creation of random 1/0 matrices using Rcpp, with inline C++ code. Much faster in a crazy overkill sort of way.

programming, r, rblogs, teaching

9 Comments

Ryan
2012-08-29 at 13:17

Interesting post.
m4 would be the usual approach I would use, but if efficiency is an absolute must this code is marginally quicker:

system.time((function(row, col){
temp <- sample(0:1, size = row*col, replace = TRUE)
dim(temp) <- c(row, col)
return(temp)
})(10000, 10000)
)

For the 10000^2 case above, creating a {0, 1} vector and then making it two-dimensional (converting it to a matrix) is around 15% faster than the m4 function.
- Luis (Post author)
  2012-08-29 at 13:27
  
  Thanks for the code. I’m still thinking if there is any other very different implementation.
Stephen Henderson
2012-08-29 at 20:33

binmatfun = function(mn)
{
mat=sample(0:1, mn^2, replace=T)
dim(mat)=c(mn ,mn)
}
- Luis (Post author)
  2012-08-29 at 21:41
  
  Yes, converting to a matrix takes some time as your function and the previous commenter’s try to reduce. One option would be to work only with vectors and totally sidestep this time; the feasibility of this approach will depend on what Scott was thinking of doing later. Thanks for the idea.
Berend Hasselman
2012-08-29 at 23:14

If you change the second example with the double for loop to this

m00 <- matrix(0, r, c)
for(j in 1:c){
m00[,j] <- sample(c(0,1),replace=TRUE,r)
}

where each column is filled in one go, you'll see that this is also quite quick.
So your conclusion about the loop only seems to hold if you fill the matrix a single element at a time.
- Luis (Post author)
  2012-08-30 at 08:35
  
  But then you are moving into matrix operations, so then why not do m1, m2, m3, m4 or m5? At the extreme I could have something like:
  
  for(i in 1){
  m2 <- matrix(rbinom(r*c,1,0.5),r,c) } It is a single pass loop but, in my view, if you are going to go for a matrix approach just do it the whole way. It is faster, more compact and, many times, very readable.
Carrie
2013-02-02 at 01:43

Thank you! This helped me tremendously. I wanted to make random 0/1 matrices with a fixed number of 1s. This is what I ended up doing.

# Specify the number of 1s needed in the matrix.
y<-100

# Make a vector with the appropriate number of 0s and 1s.
x<-c(rep(1, y), rep(0, r*c-y))

m6 <- matrix(sample(x,r*c, replace=FALSE),r,c)
sum(m6)==y #Check that it worked- this should be true.

Is there a better way?
- Luis (Post author)
  2013-02-02 at 08:05
  
  Hi Carrie, I think you nailed it and works very fast.
- Carrie
  2013-02-07 at 17:04
  
  It turns out that I actually needed matrices that had at least one 1 in each row and column. Here’s what I came up with and put in all in a loop. So far it seems to work.
  
  # Stand-alone example
  y<-20 #Number of 1s needed in the matrix
  r<-10 #Rows
  c<-10 #Columns
  
  # Make a matrix and fill it with NAs
  mrand<-matrix(NA, r, c)
  randcolc
  for(j in 1:r){
  mrand[j,randcol[j]]<-1
  }
  coltots<-.colSums(mrand, r, c, na.rm=TRUE) #sum each column, ignoring NAs. Save as a vector.
  fix<-grep(0, coltots) #Find which columns have 0 sums and save as a vector.
  randrow<-sample(1:r, length(fix), replace=TRUE) #Randomly selects as many row numbers as there are columns with sum 0.
  for(k in 1:length(fix)){ #loop through the columns that need a 1.
  mrand[randrow[k], fix[k]]<-1
  }
  msum<-sum(mrand, na.rm=TRUE)
  vec<-c(rep(1, y-msum), rep(0, r*c-y)) #Make a vector with the remaining number of 1s to be filled in and the rest 0s
  fill<-which(is.na(mrand)==TRUE) #Make a vector of the index numbers of each element of the matrix that needs to be fill (i.e. is currently NA)
  fillvalues<-sample(vec, length(fill), replace=FALSE) #randomly sample values from vec without replacement.
  for(l in 1:length(fill)){
  mrand[fill[l]]<-fillvalues[l] #Put the randomly sampled 1s and 0s in the places in the matrix that were NA.
  }
  #mrand
  sum(mrand)==y #Does the matrix have the right number of 1s?
  
  Any suggestions for improvement?