Accessing data in Google storage for Apache Spark SQL -


i have 30gb worth of data in cloud storage query on using apache hive dataproc cluster. what's best strategy access data. best approach copy data master via gsutil , access there, or can access in cloud storage directly ? if latter, how specify location in spark cli ? can specify

location 'gs://<bucketname>'  

when run

create external table  

?

you should able create external table points directly @ data in cloud storage. should work both hive , spark sql. in many cases, best strategy.

here example based on public dataset in cloud storage.

create external table natality_csv (   source_year bigint, year bigint, month bigint, day bigint, wday bigint,    state string, is_male boolean, child_race bigint, weight_pounds float,    plurality bigint, apgar_1min bigint, apgar_5min bigint,    mother_residence_state string, mother_race bigint, mother_age bigint,    gestation_weeks bigint, lmp string, mother_married boolean,    mother_birth_state string, cigarette_use boolean, cigarettes_per_day bigint,    alcohol_use boolean, drinks_per_week bigint, weight_gain_pounds bigint,    born_alive_alive bigint, born_alive_dead bigint, born_dead bigint,    ever_born bigint, father_race bigint, father_age bigint,    record_weight bigint  )  row format delimited fields terminated ','  location 'gs://public-datasets/natality/csv' 

admittedly, based on comment question, not sure if missing part of question.


Comments

Popular posts from this blog

php - Wordpress website dashboard page or post editor content is not showing but front end data is showing properly -

How to get the ip address of VM and use it to configure SSH connection dynamically in Ansible -

javascript - Get parameter of GET request -