Accessing data in Google storage for Apache Spark SQL -
i have 30gb worth of data in cloud storage query on using apache hive dataproc cluster. what's best strategy access data. best approach copy data master via gsutil , access there, or can access in cloud storage directly ? if latter, how specify location in spark cli ? can specify
location 'gs://<bucketname>'
when run
create external table
?
you should able create external table points directly @ data in cloud storage. should work both hive , spark sql. in many cases, best strategy.
here example based on public dataset in cloud storage.
create external table natality_csv ( source_year bigint, year bigint, month bigint, day bigint, wday bigint, state string, is_male boolean, child_race bigint, weight_pounds float, plurality bigint, apgar_1min bigint, apgar_5min bigint, mother_residence_state string, mother_race bigint, mother_age bigint, gestation_weeks bigint, lmp string, mother_married boolean, mother_birth_state string, cigarette_use boolean, cigarettes_per_day bigint, alcohol_use boolean, drinks_per_week bigint, weight_gain_pounds bigint, born_alive_alive bigint, born_alive_dead bigint, born_dead bigint, ever_born bigint, father_race bigint, father_age bigint, record_weight bigint ) row format delimited fields terminated ',' location 'gs://public-datasets/natality/csv'
admittedly, based on comment question, not sure if missing part of question.
Comments
Post a Comment