scala - Spark: Reading S3 file exception with Spark 1.5.2 prebuilt with hadoop-2.6 -
i trying read existing file spark based application. here snippet:
sc.hadoopconfiguration.set("fs.s3.awsaccesskeyid", "mykey") sc.hadoopconfiguration.set("fs.s3.awssecretaccesskey", "mysecret") val = sc.textfile("s3://mybucket/tnrealtime/output/2016/01/27/22/45/00/a.txt").map{line => line.split(",")} val b = a.collect // **error** producing statement i getting exception:
org.apache.hadoop.mapred.invalidinputexception: input path not exist: s3://snapdeal-personalization-dev-us-west-2/tnrealtime/output/2016/01/27/22/45/00/a.txt @ org.apache.hadoop.mapred.fileinputformat.liststatus(fileinputformat.java:251) @ org.apache.hadoop.mapred.fileinputformat.getsplits(fileinputformat.java:270) @ org.apache.spark.rdd.hadooprdd.getpartitions(hadooprdd.scala:207) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:239) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:237) @ scala.option.getorelse(option.scala:120) @ org.apache.spark.rdd.rdd.partitions(rdd.scala:237) @ org.apache.spark.rdd.mappartitionsrdd.getpartitions(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:239) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:237) @ scala.option.getorelse(option.scala:120) @ org.apache.spark.rdd.rdd.partitions(rdd.scala:237) @ org.apache.spark.rdd.mappartitionsrdd.getpartitions(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:239) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:237) @ scala.option.getorelse(option.scala:120) @ org.apache.spark.rdd.rdd.partitions(rdd.scala:237) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1921) @ org.apache.spark.rdd.rdd$$anonfun$collect$1.apply(rdd.scala:909) @ org.apache.spark.rdd.rddoperationscope$.withscope(rddoperationscope.scala:147) @ org.apache.spark.rdd.rddoperationscope$.withscope(rddoperationscope.scala:108) @ org.apache.spark.rdd.rdd.withscope(rdd.scala:310) @ org.apache.spark.rdd.rdd.collect(rdd.scala:908) @ com.snapdeal.pears.trending.trendingdecay$.load(trendingdecay.scala:68) strangely, when tried same snippet spark-shell, different error:
java.io.ioexception: no filesystem scheme: s3 @ org.apache.hadoop.fs.filesystem.getfilesystemclass(filesystem.java:2584) @ org.apache.hadoop.fs.filesystem.createfilesystem(filesystem.java:2591) @ org.apache.hadoop.fs.filesystem.access$200(filesystem.java:91) @ org.apache.hadoop.fs.filesystem$cache.getinternal(filesystem.java:2630) @ org.apache.hadoop.fs.filesystem$cache.get(filesystem.java:2612) @ org.apache.hadoop.fs.filesystem.get(filesystem.java:370) @ org.apache.hadoop.fs.path.getfilesystem(path.java:296) @ org.apache.hadoop.mapred.fileinputformat.singlethreadedliststatus(fileinputformat.java:256) @ org.apache.hadoop.mapred.fileinputformat.liststatus(fileinputformat.java:228) @ org.apache.hadoop.mapred.fileinputformat.getsplits(fileinputformat.java:313) @ org.apache.spark.rdd.hadooprdd.getpartitions(hadooprdd.scala:207) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:239) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:237) @ scala.option.getorelse(option.scala:120) @ org.apache.spark.rdd.rdd.partitions(rdd.scala:237) @ org.apache.spark.rdd.mappartitionsrdd.getpartitions(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:239) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:237) @ scala.option.getorelse(option.scala:120) @ org.apache.spark.rdd.rdd.partitions(rdd.scala:237) @ org.apache.spark.rdd.mappartitionsrdd.getpartitions(mappartitionsrdd.scala:35) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:239) @ org.apache.spark.rdd.rdd$$anonfun$partitions$2.apply(rdd.scala:237) @ scala.option.getorelse(option.scala:120) @ org.apache.spark.rdd.rdd.partitions(rdd.scala:237) @ org.apache.spark.sparkcontext.runjob(sparkcontext.scala:1921) can me understand issue.
i'm not sure scenario is, when run spark locally , want access files on s3, specify key , secret in s3-path, this:
sc.textfile("s3://mykey:mysecret@mybucket/tnrealtime/output/2016/01/27/22/45/00/a.txt") maybe work well.
Comments
Post a Comment