Hive File Formats Simplified

sasirekha
3 min readNov 21, 2020

File formats decide how data stored inside the files. We knew that formats are selected based on whether it's structured data or unstructured data, but it doesn’t stop there. We also need to check for input and output I/O latency, Compression rate, space usage, and data encoding.

How it works:

File formats are classified into an input file format, output file format, and serde. To make it easy, file format is a global viewpoint, if you mention file format in the table you don’t need to mention input, output or serde file format explicitly.

Internally file formats will assign the input and output formats. This would make more sense when you describe the table you have created with just file format. You can see that input and output formats magically appear in your table description. Here are the available file formats:-

TEXT FILE FORMAT:

This is to store as a plain text file and also the default file format. Below are the tags used along with the text format.

  1. Delimited or Escaped by — separator used in data (comma, pipe, tab, any complex symbol, or even some characters). These are the available delimiters -fieldDelim, escapeDelim, collectionDelim, mapkeyDelim, lineDelim
  2. NULL format — change the null values in data to empty or any values you want, eg: NULL DEFINED AS ‘NULL’. Default is ‘\n’

The advantages are lightweight, good for data exchange and disadvantages are it can’t be used for complex data types, slow to read, and write.

Despite the disadvantages, text and CSV files are the popular formats for data processing and batching applications.

SEQUENCE FILE FORMAT:

Sequence files are flat files that store values in binary key-value pairs. The sequence files are in binary format and able to split these files. It can be used when data size is lower than block size which not happens in most cases.

The advantages are merging small files to one, supports block-level compression and disadvantages are the complexity of decompression, traditional MapReduce binary formatting.

RC FILE FORMAT:

RC is a row columnar file format. It supports a high row-level compression rate. This file format also stores data as binary key-value pairs. It stores columns of a table in form of a record in a columnar manner. it’s very useful while performing analytics.

ORC FILE FORMAT:

ORC is an optimized row columnar file format. It compresses the original data up to 75%(10GB to 2.5GB). It produces a single file as output for each task so it reduces the name node’s load. Internally orc splits files in stripe level which means a grouping of row data. It also has a recorded index for min and max, that’s the reason we are getting minimum and maximum value in second, even for trillion records.

ORC file format is a highly efficient way to store the data. It overcomes all limitations of other hive file formats.

PARQUET FILE FORMAT:

Parquet is a column-oriented binary file format. Especially good for scanning large scale queries. Parquet has snappy and gzip compression techniques. Parquet is optimized for a model called WORM(write once and read more). It's a good choice for heavy read loads.

There’s a popular saying that parquet file format solves most of the big data problems.

AVRO FILE FORMAT:

AVRO is a row-based format that has a high degree of splitting. The schema is stored in JSON format, while the data is stored in binary format, which minimizes file size and maximizes efficiency. It supports data serialization and most reliable for schema evolution.

Well then, it’s time for the Nitty-gritty.

That’s it, I hope you got a more clear picture of the file formats. Please do follow me for more blogs like this. Till then happy learning :)

--

--

sasirekha

I’m a Data Engineer. Love sharing ideas and thoughts :)