Custom Functions in dplyr pipelines

In plyr, you can do this:
> tmp<-data.frame(a=rep(c(1,2,3),each=4),b=1:12)

> tmp
   a  b
1  1  1
2  1  2
3  1  3
4  1  4
5  2  5
6  2  6
7  2  7
8  2  8
9  3  9
10 3 10
11 3 11
12 3 12


> ddply(tmp,.(a),function(x) c(bb=sum(x$b)))
  a bb
1 1 10
2 2 26
3 3 42
For this trivial example, you'd probably use summarise() - but when you need to manipulate multiple columns in more complex ways, writing your own function can be more efficient.

If you try to translate this into dplyr in a naive and direct way, you will get a silently wrong result:

> tmp %>% group_by(a) %>% (function(x) data.frame(bb=sum(x$b)))
  bb
1 78
WTF, it didn't group!

So how can you do thi? You have to wrap your function rather awkwardly in do():

> tmp %>% group_by(a) %>% do((function(x) data.frame(bb=sum(x$b)))(.))
Source: local data frame [3 x 2]
Groups: a [3]
      a    bb
  (dbl) (int)
1     1    10
2     2    26
3     3    42
Also note that to be used in dplyr, your function must return a data frame. This will fail:
> tmp %>% group_by(a) %>% do((function(x) c(bb=sum(x$b)))(.))
Error: Results are not data frames at positions: 1, 2, 3

[an error occurred while processing this directive]