(Hadoop) analysis of Hive Meta Store Entity Using A Hook Function

Currently, I was working to implement a project to save metadata of Hive Program. Basically, I keep A SQL database to save and update the metadata of every execution of HQL sentence with a internal hook.

Basically, There are two important Set of Entities Class:
Set inputs and Set outputs

In this article, I only introduce the data inside of those Entities above, the exact structure of Hive program will be introduced in another article.

In the source code of org.apache.hadoop.hive.ql.plan.HiveOperation, you can found tens of different hive operation. For our goal, a metadata store system, I only care about those operation related to metadata.

Notice:
EXPLAIN AUTHORIZATION commend can show INPUTS, OUTPUTS, CURRENT_USER and OPERATION.

  1. CREATETABLE

input: null, or location if set location while create table.
output: new table, current database
log: operation is CREATETABLE,inputs :[],outputs:[db@tml_2, database:db]

  1. DROPTABLE

input: deleted table
output: deleted table
log: operation is DROPTABLE,inputs :[db@tml_1],outputs:[db@tml_1]

  1. ALTERTABLE_RENAME

input: old table
output: old table, new table
log: operation is ALTERTABLE_RENAME,inputs :[db@tml_2],outputs:[db@tml_2, db@tml_3]

  1. ALTERTABLE_RENAMECOL

input: null
output: new table
log: operation is ALTERTABLE_RENAMECOL,inputs :[],outputs:[db@tml_3]

  1. ALTERTABLE_REPLACECOLS

input: null
output: new table
log: operation is ALTERTABLE_RENAMECOL,inputs :[],outputs:[db@tml_3]

  1. ALTERTABLE_RENAMEPART

input: table, old partition
output: old partition, new partition
log: operation is ALTERTABLE_RENAMEPART,inputs :[db@tml_part, ks_xs@tml_part@dt=2008-08-08/hour=14],outputs:[db@tml_part@dt=2008-08-08/hour=14, db@tml_part@dt=2008-08-08/hour=15]

  1. ALTERPARTITION_LOCATION
    input: partition
    output: location, partition
    log: operation is ALTERPARTITION_LOCATION,inputs :[db@tml_part, db@tml_part@dt=2008-08-08/hour=15],outputs:[viewfs://hadoop-lt-cluster/home/dp/data/userprofile/db.db/tml_part/dt, db@tml_part@dt=2008-08-08/hour=15]

Conclusition

In the org.apache.hadoop.hive.ql.hooks.ENtity, You can found all the Type of Entity.

  /**
   * The type of the entity.
   */
  public static enum Type {
    DATABASE, TABLE, PARTITION, DUMMYPARTITION, DFS_DIR, LOCAL_DIR, FUNCTION
  }

What’s strange is that there is no COLUMN in them. So when we try to catch the operation of add/rename/replace columns, we have to get the data from their parent table.

Besides, we can get meta data easily with specific type.

Some experience about upload/download files on HDFS by JAVA Spring framework

  1. Using Stream method to transfer data

At first, I tried to use a easy way to transfer files, read them into memory as binary arrays, and then upload to HDFS. However, I met problem with java.lang.OutOfMemoryError: Java heap space. The reason is that I use some redundant function like File.getBytes(), which occupied too much memory.

Then I fixed the problem by using double stream. From Client(browser) to Web Server, and from Web Server to HDFS Server.

  • From Browser To Web Server

We can use the normal way, MultipartFile file to upload file.

The file contents are either stored in memory or temporarily on disk. In either case, the user is responsible for copying file contents to a session-level or persistent store as and if desired. The temporary storage will be cleared at the end of request processing.

Developer can config the threshold to decide if the temporary file is saved in memory or in disk by modify proterty file like following.

spring.http.multipart.file-size-threshold=10MB

So that we can support high level concurrent visit.

  • From Web Server to HDFS Server

After initialized FileSystem by Config Class, The process is easily.

FSDataOutputStream out = fs.create(path, true);
IOUtils.copyBytes(file.getInputStream(), out, 1024, true);
  • From HDFS to Web Server

While downloading file, Hadoop API offers powerful API, only if we input the file path, we can create a InputStream.

FSDataInputStream in = currentVfs.open(srcPath);
  • From Web Server to Browser

Normally, in Spring framework, developer can use Resource Instance to transer file. And kindly, Resource offers a function InputStreamResource(InputStream), so that we can use it to download files from HDFS to browser directly.

  1. File Name Code Problem

This problem is exactly for Chinese developer. Chinese characters will be grabled in Http header. The solution is that change charset to “ISO8859-1”

String fileName = new String(fileName.getBytes(), "ISO-8859-1");

Running Hadoop program on google cloud platform

Creating a Hadoop Cluster on Google Cloud Platform

At first, you need a sign up for the Google Cloud Platform
https://console.cloud.google.com/freetrial?pli=1&page=0
With this url, you can get two month free trial with a $300 credit for one year.

Continue reading “Running Hadoop program on google cloud platform”

(crawler4j)SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.

While using crawler4j to crawl some data, I met errors as following:

Based on the official document I found solution:

https://www.slf4j.org/codes.html#StaticLoggerBinder

Then I downloadedlogback and import logback-classic.jar.

Perfect!

Hive Lateral view Introduction

Lateral View syntax

Continue reading “Hive Lateral view Introduction”

Hadoop Error: java.lang.ClassNotFoundException:org.codehaus.jackson.map.JsonMappingException

Currently, I did some practise about Hadoop and met some erorrs. Like this, I can run the Hadoop application with terminal on HDFS, but I cannot run it locally with eclipse for following errors:
Continue reading “Hadoop Error: java.lang.ClassNotFoundException:org.codehaus.jackson.map.JsonMappingException”