Creating a Hadoop Cluster on Google Cloud Platform
At first, you need a sign up for the Google Cloud Platform
With this url, you can get two month free trial with a $300 credit for one year.
After that, you can start to use this perfect platform.
- Create a Google Dataproc Cluster. Select Dataproc from the navigation list on the left
- Clicking on “Create Cluster” will take you to the cluster configuration section, you can use the default configuration at the first time
You may meet an Error saying that you’ve exceeded your quota, reduce the number of worker nodes or choose a Machine Type(for master and worker) with fewer vCPUs.
- Now that the cluster is setup we’ll have to configure it a little before we can run jobs on it. Select the cluster you just created from the list of clusters under the cloud Dataproc section on your console. Go to the VM Instances tab and click on the
SSHbutton next to the instance with the Master Role. If you don’t see the SSH button click the Refresh button on the top of the page.
- There is no home directory on HDFS for the current user so set up the user directory on HDFS. So, we’ll have to set this up before proceeding further.
hadoop fs -mkdir -p /user/
Upload Data to the Storage Bucket
- From Clusters menu, you can click Google cloud storage staging and jump to storage section.
- Upload your data file or folder
Submitting the Hadoop job to your cluster
- Go to the “Jobs” section in the left navigation bar of the Dataproc page and click on “Submit job”.
- Fill the job parameters as follows
○ Cluster: Select the cluster you created
○ Job Type: Hadoop
○ Jar File: Full path to the jar file you uploaded earlier to the Google storage bucket. Don’t forget the gs://
○ Main Class or jar: The name of the java class you wrote the mapper and reducer in.
○ Arguments: This takes two arguments
i. Input: Path to the input data you uploaded
ii. Output: Path to the storage bucket followed by a new folder name. The folder is created during execution. You will get an error if you give the name of an existing folder.
○ Leave the rest at their default settings
- Submit Job. It will take quite a while. Please be patient. You can see the progress on the job’s 11status section.
Once the job executes copy all the log entries that were generated to a text file called
The output files will be stored in the
outputfolder on the bucket.