// create RDD data
scala> val data = sc.parallelize(List(("sess-1","read"), ("sess-1","meet"),
("sess-1","walk"), ("sess-2","watch"),("sess-2","sleep"),
("sess-2","run"),("sess-2","drive")))
//groupByKey will return Iterable[String] CompactBuffer**
scala> val dataCB = data.groupByKey()`
//map CompactBuffer to List
scala> val tx = dataCB.map{case (col1,col2) => (col1,col2.toList)}.collect
data: org.apache.spark.rdd.RDD[(String, String)] =
ParallelCollectionRDD[211] at parallelize at <console>:26
dataCB: org.apache.spark.rdd.RDD[(String, Iterable[String])] =
ShuffledRDD[212] at groupByKey at <console>:30
tx: Array[(String, List[String])] = Array((sess-1,List(read, meet,
walk)), (sess-2,List(watch, sleep, run, drive)))
//groupByKey and map to List can also achieved in one statment
scala> val dataCB = data.groupByKey().map{case (col1,col2)
=> (col1,col2.toList)}.collect
спасибо за ответ ... что я просил немного отличается, и я обнаружил, что, делая некоторые RND, которые Я размещаю ниже –