First sketch at a sketch join #758

avibryant · 2014-01-27T06:49:55Z

Note: no test yet.

This join is appropriate where you have a larged, skewed left hand side that you want to join to a small right hand side: for example, joining page views against pages where the most popular pages each make up a significant fraction the total. It builds a CountMinSketch of the keys in the left hand side, then uses that to decide how many replicas of each item in the right side to produce, then does a normal join.

The semantics are similar to hash joins: you have to go directly from (TypedPipe[K,V],TypedPipe[K,V2]) to some TypedPipe[K,R], with no guarantee of ever seeing the entirety of a group at once (and in fact, we reuse the hashInner2 and hashLeft2 joiner functions).

Rough usage:

  left
   .groupBy{_.foo}
   .sketch()
   .join(right.groupBy{_.bar})
   .withReducers(20)
   .values

johnynek · 2014-01-27T18:16:12Z

scalding-core/src/main/scala/com/twitter/scalding/typed/Sketched.scala

+  eps: Double,
+  seed: Int,
+  reducers: Option[Int])
+  (implicit serialization: K => Array[Byte],


Can we use a type that does not extend function here? When you have an implicit function in scope, it can be used for implicit conversions, which will give K all the methods of Array[Byte](there are a lot: http://www.scala-lang.org/api/current/index.html#scala.Array) and might be confusing.

trait Encoder[K, T] { def apply(k: K): T }

or we can punt, add this to bijection, and make bijection a dep of scalding.

added: twitter/bijection#158

I was just copying what SketchMap did. It seems like importing bijection provides something this can use for String, at least, though TBH I'm not sure exactly what's going on there. I'm +1 on an explicit Encoder typeclass though, and I think having Scalding depend on bijection is inevitable and fine.

... BTW since really this is about hashing, it could also be an argument for a Hashable typeclass in scalding, which I know has come up before.

That is probably cleaner, or Hashable in Algebird.

There was a proposal, but I could never get something looked really good. Something like:

trait Hashable32[K] { def hash(k: K): Int } trait Hashable64[K] extends Hashable32[K] { def hash2(k: K): Long def hash(k: K): Int = { val l = hash2(k) ((l >>> 32) ^ l).toInt } } trait Hashable128[K] extends Hashable64[K] { def hash4(k: K): (Long, Long) def hash2(k: K) = { val h = hash4(k) h._1 ^ h._2 } }

might do. I went overboard last time (shocker).

That looks pretty good, though I might use hash64 and hash128 for the method names, for consistency with the trait names?

So... does this go in algebird or in bijection?

I guess algebird. algebird-hash? and then depend on that in algebird-core?

Sounds right. Do we want to block this PR on that? My vote is that we get this into 0.9.0 as is (once I've addressed other feedback), then try to do algebird-hash "right" for the next round of releases.

Nope. Let's get this in with (K => Array[Byte]) and fix all later.

johnynek · 2014-01-27T23:51:38Z

As for tests, it would be great to have some that did something like: given a distribution: K => Int for how many times K is to appear on the left, run the tests and be sure that you get the right result (done for instance in a non-sketch way).

As for the distributions: testing the case where you have a lot of keys mission on the left that appear on the right, would be good. Also the case where every key appears exactly once. Lastly, cases where we have a key that is half the total count.

johnynek · 2014-01-28T00:45:42Z

By the way: Dmitriy explained why bucketing on rand causes problems without a seed: if reducer R1 thinks it has all the input it needs, it runs and finishes. Now, Mapper M1 restarts because R2 failed, but this means we bucket differently and the proof that this algorithm works (and indeed, it does not work). (and @dvryaboy wrote something similar last time).

johnynek · 2014-01-28T01:49:16Z

scalding-core/src/main/scala/com/twitter/scalding/typed/Sketched.scala

+  //the most of any one reducer we want to try to take up with a single key
+  val maxReducerFraction = 0.1
+
+  private def flatMapWithReplicas[V](pipe: TypedPipe[(K,V)])(fn: Int => Iterable[Int]) =


this V parameter is confusing since there is also a V on the class. Are they distinct? I think so. Can you make this W or something not used in the class?

avibryant · 2014-01-28T06:36:16Z

Think this is probably in decent shape, but still needs a good test.

johnynek · 2014-01-28T19:11:23Z

Yep. As soon as we have tests, it's good. This is really great. Hopefully this + composable joins + optimizer in matrix 2 + bug fix for fields skewJoin will mean 0.9.0 is a really good release.

johnynek · 2014-01-29T19:53:02Z

scalding-core/src/test/scala/com/twitter/scalding/TypedPipeTest.scala

+  import Dsl._
+
+  val rng = new java.util.Random
+  def generateInput(size: Int, max: Int, dist: (Int) => Int): List[(Int,Int)] = {


dist is unused for now, right?

I don't think it is super essential that we use dist. We could just run tests whece size ~ max and where max << size (so we can expect keys to be repeated many times).

johnynek · 2014-01-29T19:56:38Z

I'm confortable merging if you remove the unused diet parameter. What do you think? This test (being random) seems like if it is wrong we will eventually stumble over it.

avibryant · 2014-01-29T20:10:36Z

I was just pushing WIP here for the tests. I actually would like to add some with different dists.

Also: this test was cargo culted somewhat from the skew join tests, and what I discovered is that because ++ does not actually mutate the Buffer objects, it's always just comparing List() == List(). When I switch it to ++=, I get a failure with run but a pass with runHadoop. So I want to at least fix that before merging. Separately, we should fix the skew join test.

If you have any ideas about why local mode would fail lemme know. Is the ValuePipe stuff known not to work there?

johnynek · 2014-01-29T20:19:12Z

Okay.

Good catch. Actually, I was concerned how your code works with both .run and .runHadoop. They are sharing a mutable buffer, right?

johnynek · 2014-01-29T20:20:56Z

ValuePipe is believed to work (it is just a type wrapper on something that was the result of a total aggregation to a single, or possibly empty) value.

avibryant · 2014-01-29T20:28:10Z

Yeah, it's not great (again, cargo culted from the skew join tests) - the results just get combined into the same buffer, so if you have a problem in one but not the other, it shows up, but it can be hard to tell what's going on.

avibryant · 2014-01-30T05:14:35Z

@johnynek see avibryant@099e69a for a more minimal example of the problem I'm seeing in local mode. Looks like something pretty deep in HashJoin...

johnynek · 2014-01-30T05:17:46Z

Looks like a cascading bug. I guess we need to report it to @cwensel

On Wednesday, January 29, 2014, avibryant [email protected] wrote:

@johnynek https://github.com/johnynek see avibryant@099e69ahttps://github.com/avibryant/scalding/commit/099e69a480a368de554da2ca6c0359f68359525afor a more minimal example of the problem I'm seeing in local mode. Looks
like something pretty deep in HashJoin...

Reply to this email directly or view it on GitHubhttps://github.com//pull/758#issuecomment-33660888
.

Oscar Boykin :: @posco :: http://twitter.com/posco

avibryant · 2014-01-30T05:42:47Z

See latest commit - works ok if you just use cross.

avibryant · 2014-01-30T05:50:21Z

@johnynek not totally clear it's a cascading bug - it could be in the typed hashJoin stuff, which cross() bypasses.

avibryant · 2014-01-30T05:50:49Z

I think we should be good to merge this PR now though, at least.

johnynek · 2014-01-30T18:33:21Z

scalding-core/src/main/scala/com/twitter/scalding/typed/Sketched.scala

+      //if the frequency is 0, maxReplicas.ceil will be 0 so we will filter out this key entirely
+      //if it's < maxPerReducer, the ceil will round maxReplicas up to 1 to ensure we still see it
+      val replicas = fn(maxReplicas.ceil.toInt.min(numReducers))
+      replicas.toList.map{i => (i,v._1) -> v._2}


why toList here?

Good catch, this was needed at one point but no longer.

johnynek · 2014-01-30T18:43:08Z

@avibryant about leftCross: if it is not a cascading bug, it seems strange that the job works in hadoop mode but not local mode. leftCross is just calling hashJoin with a particular joiner. I'll look at it after the merge.

First sketch at a sketch join

avi-stripe added 2 commits January 26, 2014 21:53

First sketch at a sketch join

4138394

implicit evidence must come first

58f524e

johnynek reviewed Jan 27, 2014
View reviewed changes

avi-stripe added 3 commits January 27, 2014 11:00

make some things private, avoid divide by zero in Sketched

8b81acc

Random needs seed

3264025

add some comments about maxReplicas math

5667165

johnynek reviewed Jan 28, 2014
View reviewed changes

address PR comments

2a3be39

avibryant added 2 commits January 28, 2014 22:19

start of a test for sketch joins

028dff2

also test sketch left join

a86a08e

johnynek reviewed Jan 29, 2014
View reviewed changes

avibryant added 2 commits January 29, 2014 19:59

flesh out sketch join tests; still don't work in local mode

5723d8a

narrowed the problem with running local to TypedSelfLeftCrossTest

099e69a

switch Sketched to use cross instead of leftCross to avoid bug

329ff78

johnynek reviewed Jan 30, 2014
View reviewed changes

remove spurious toList, tweak and comment params on CMS for sketched

6ca407d

johnynek added a commit that referenced this pull request Jan 30, 2014

Merge pull request #758 from avibryant/avi-sketch-join

7b321e2

First sketch at a sketch join

johnynek merged commit 7b321e2 into twitter:develop Jan 30, 2014

avibryant mentioned this pull request Feb 20, 2014

SketchMap stuff should use Injections instead of functions twitter/algebird#266

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First sketch at a sketch join #758

First sketch at a sketch join #758

avibryant commented Jan 27, 2014

johnynek Jan 27, 2014

johnynek Jan 27, 2014

avibryant Jan 27, 2014

avibryant Jan 27, 2014

johnynek Jan 27, 2014

avibryant Jan 27, 2014

johnynek Jan 28, 2014

avibryant Jan 28, 2014

johnynek Jan 28, 2014

johnynek commented Jan 27, 2014

johnynek commented Jan 28, 2014

johnynek Jan 28, 2014

avibryant commented Jan 28, 2014

johnynek commented Jan 28, 2014

johnynek Jan 29, 2014

johnynek Jan 29, 2014

johnynek commented Jan 29, 2014

avibryant commented Jan 29, 2014

johnynek commented Jan 29, 2014

johnynek commented Jan 29, 2014

avibryant commented Jan 29, 2014

avibryant commented Jan 30, 2014

johnynek commented Jan 30, 2014

avibryant commented Jan 30, 2014

avibryant commented Jan 30, 2014

avibryant commented Jan 30, 2014

johnynek Jan 30, 2014

avibryant Jan 30, 2014

johnynek commented Jan 30, 2014

First sketch at a sketch join #758

First sketch at a sketch join #758

Conversation

avibryant commented Jan 27, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jan 27, 2014

johnynek commented Jan 28, 2014

Choose a reason for hiding this comment

avibryant commented Jan 28, 2014

johnynek commented Jan 28, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jan 29, 2014

avibryant commented Jan 29, 2014

johnynek commented Jan 29, 2014

johnynek commented Jan 29, 2014

avibryant commented Jan 29, 2014

avibryant commented Jan 30, 2014

johnynek commented Jan 30, 2014

avibryant commented Jan 30, 2014

avibryant commented Jan 30, 2014

avibryant commented Jan 30, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnynek commented Jan 30, 2014