Categories

# Functions with multiple results in tidyverse

I have continued playing with the tidyverse for different parts of a couple of projects.

Often I need to apply a function by groups of observations; sometimes, that function returns more than a single number. It could be something like for each group fit a distribution and return the distribution parameters. Or, simpler for the purposes of this exploration, calculate and return a bunch of numbers.

```describe_c <- function(x) {
mn <- mean(x, na.rm = TRUE)
dev <- sd(x, na.rm = TRUE)
n <- sum(!is.na(x))
cv <- dev/mn*100
return(c(mean = mn, sdev = dev, count = n, coefvar = cv))
}
```

If I have a data frame called `field_data`, with family codes (trees with the same parents, codes have been changed to protect the innocent) and stem diameters (in mm), I could do the following in base R:

```# This line produces an annoying list
summary_one <- with(field_data, tapply(stem, family, FUN = describe_v))

# This puts together a matrix by joining
# the list results using rbind()
summary_one <- do.call(rbind, summary_one)

# To continue processing it might be better to convert
# to a data frame
summary_one <- data.frame(summary_one)
```

And if I need to do this for several variables, I will need to merge each of these matrices in a data frame.

Continuing with my experimentation with the tidyverse, I was wondering how to get the above going with dplyr et al. After failing a few times I asked the question in Twitter and got a number of helpful replies.

One of the keys is that `dplyr` can store a list result from a function. Modifying my toy function is pretty straightforward, and now looks like:

```describe_list <- function(x) {
mn <- mean(x, na.rm = TRUE)
dev <- sd(x, na.rm = TRUE)
n <- sum(!is.na(x))
cv <- dev/mn*100
return(list(c(mean = mn, sdev = dev, count = n, coefvar = cv)))
}
```

And we can check the contents of `summary_two` to see we have a list in which each element contains 4 values:

```head(summary_two)
#  A tibble: 6 x 2
#   family     model
#
# 1      A
# 2      B
# 3      C
# 4      D
# 5      E
# 6      F
```

We still need to extract the elements of each element of the list and assign them to a variable name. Using `map` from the `purrr` package is pretty straightforward in this case, and we can extract the values either using their names or their position in the element.

```summary_two %>%
mutate(mn = map_dbl(model,'mean'),
sd = map_dbl(model,'sdev'),
n = map_dbl(model,'count'),

#  A tibble: 6 x 6
#   family     model       mn       sd     n       cv
#
# 1      A  190.8306 23.71290   425 12.42615
# 2      B  190.1111 25.46554   396 13.39508
# 3      C  188.2646 27.39215   461 14.54981
# 4      D  189.2668 25.16330   431 13.29514
# 5      E  183.5238 19.70182    21 10.73530
# 6      F  183.1250 28.82377    24 15.73994
```

I'm still playing with ideas to be lazier at extraction time. An almost abhorrent idea is to provide the output as character for posterior type conversion, as in:

```describe_char <- function(x) {
mn <- mean(x, na.rm = TRUE)
dev <- sd(x, na.rm = TRUE)
n <- sum(!is.na(x))
cv <- dev/mn*100
return(paste(mn, dev, n, cv, sep = ':'))
}

field_data %>%
group_by(family) %>%
summarise(model = describe_char(stem)) -> summary_three

# A tibble: 6 x 2
#   family                                                  model
#
# 1      A 190.830588235294:23.7128956613006:425:12.4261502731746
# 2      B 190.111111111111:25.4655444116168:396:13.3950847284951
# 3      C  188.26464208243:27.3921487349435:461:14.5498105390125
# 4      D 189.266821345708:25.1632953227626:431:13.2951434085746
# 5      E   183.52380952381:19.7018249094317:21:10.7352963959021
# 6      F           183.125:28.8237711378767:24:15.7399432834822

summary_three %>%
separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>% head

# A tibble: 6 x 5
#   family               mn               sd     n               cv
#
# 1      A 190.830588235294 23.7128956613006   425 12.4261502731746
# 2      B 190.111111111111 25.4655444116168   396 13.3950847284951
# 3      C  188.26464208243 27.3921487349435   461 14.5498105390125
# 4      D 189.266821345708 25.1632953227626   431 13.2951434085746
# 5      E  183.52380952381 19.7018249094317    21 10.7352963959021
# 6      F          183.125 28.8237711378767    24 15.7399432834822
```

And we can get all the way there with:

```summary_three %>%
separate(model, c('mn', 'sd', 'n', 'cv'), sep = ':') %>%
mutate_at(c('mn', 'sd', 'n', 'cv'), as.numeric) %>% head

# A tibble: 6 x 5
#   family       mn       sd     n       cv
#
# 1      A 190.8306 23.71290   425 12.42615
# 2      B 190.1111 25.46554   396 13.39508
# 3      C 188.2646 27.39215   461 14.54981
# 4      D 189.2668 25.16330   431 13.29514
# 5      E 183.5238 19.70182    21 10.73530
# 6      F 183.1250 28.82377    24 15.73994
```

Which I assume has all sort of potential negative side-effects, but looks really cool.

In case you want to play with the problem, here is a tiny example of field data.