You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am testing out IndexedRDD and noticing some performance problems that I wouldn't expect based on the README and what I saw from your SparkSummit presentation. The use case I have for IndexedRDD is to use it for doing near realtime denormalization for a data warehouse using spark streaming. My hope was that I could join a small subset of data (the most recent group of changed records) against an IndexedRDD in hopes that I could avoid having to do a full scan of the RDD and process the records much faster.
I have tried testing this out using both real and some generated data and found that when doing an inner join with the small dataset (about 100 records) on the left hand side of the join and a large dataset (100,000,000 records) on the right hand side of the join, the vanilla spark RDD performs as fast or faster than the IndexedRDD, even when caching both datasets and ensuring they share a partitioner beforehand to avoid the cost of the repartition.
Of the few times I ran this test using generated data (code follows) the IndexedRDD implementation was 15-20% slower. Digging into the code, it looks like it won't do any actual pruning of partitions to scan and instead will zip all from both sides, even if some partitions on one side of the join are empty. I know that by using the PartitionPruningRDD you should be able to inform the scheduler that only a subset of partitions need to be processed, but I am curious if I am just misunderstanding some details and applying the wrong tool for the job.
As mentioned, here is the code that I used to generate the results:
I am testing out IndexedRDD and noticing some performance problems that I wouldn't expect based on the README and what I saw from your SparkSummit presentation. The use case I have for IndexedRDD is to use it for doing near realtime denormalization for a data warehouse using spark streaming. My hope was that I could join a small subset of data (the most recent group of changed records) against an IndexedRDD in hopes that I could avoid having to do a full scan of the RDD and process the records much faster.
I have tried testing this out using both real and some generated data and found that when doing an inner join with the small dataset (about 100 records) on the left hand side of the join and a large dataset (100,000,000 records) on the right hand side of the join, the vanilla spark RDD performs as fast or faster than the IndexedRDD, even when caching both datasets and ensuring they share a partitioner beforehand to avoid the cost of the repartition.
Of the few times I ran this test using generated data (code follows) the IndexedRDD implementation was 15-20% slower. Digging into the code, it looks like it won't do any actual pruning of partitions to scan and instead will zip all from both sides, even if some partitions on one side of the join are empty. I know that by using the PartitionPruningRDD you should be able to inform the scheduler that only a subset of partitions need to be processed, but I am curious if I am just misunderstanding some details and applying the wrong tool for the job.
As mentioned, here is the code that I used to generate the results:
I may try adding in the
PartitionPruningRDD
and see what perf this gives me but would love to get some feedback on my experiment.The text was updated successfully, but these errors were encountered: