Bigdata File Formats Support on DellEMC Isilon – Unstructured Data Quick Tips

This article describes the DellEMC Isilon’s support for Apache Hadoop file formats in terms of disk space utilization. To determine this, we will use Apache Hive service to create and store different file format tables and analyze the disk space utilization by each table on the Isilon storage.

Apache Hive supports several familiar file formats used in Apache Hadoop. Hive can load and query different data files created by other Hadoop components such as PIG, Spark, MapReduce, etc. In this article, we will check Apache Hive file formats such as TextFile, SequenceFIle, RCFile, AVRO, ORC and Parquet formats. Cloudera Impala also supports these file formats.

To begin with, let us understand a bit about these Bigdata File formats. Different file formats and compression codes work better for different data sets in Hadoop, the main objective of this article is to determine their supportability on DellEMC Isilon storage which is a scale-out NAS storage for Hadoop cluster.

Following are the Hadoop file formats

Test File: This is a default storage format. You can use the text format to interchange the data with another client application. The text file format is very common for most of the applications. Data is stored in lines, with each line being a record. Each line is terminated by a newline character(\n).

The test format is a simple plane file format. You can use the compression (BZIP2) on the text file to reduce the storage spaces.

Sequence File: These are Hadoop flat files that store values in binary key-value pairs. The sequence files are in binary format and these files can split. The main advantage of using the sequence file is to merge two or more files into one file.

RC File: This is a row columnar file format mainly used in Hive Datawarehouse, offers high row-level compression rates. If you have a requirement to perform multiple rows at a time, then you can use the RCFile format. The RCFile is very much like the sequence file format. This file format also stores the data as key-value pairs.

AVRO File: AVRO is an open-source project that provides data serialization and data exchange services for Hadoop. You can exchange data between the Hadoop ecosystem and a program written in any programming language. Avro is one of the popular file formats in Big Data Hadoop based applications.

ORC File: The ORC file stands for Optimized Row Columnar file format. The ORC file format provides a highly efficient way to store data in the Hive table. This file system was designed to overcome limitations of the other Hive file formats. The Use of ORC files improves performance when Hive is reading, writing, and processing data from large tables.

More information on the ORC file format: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

Parquet File: Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.

More information on the Parquet file format: https://parquet.apache.org/documentation/latest/

Please note for below testing Hortonworks HDP 3.1 is installed on DellEMC Isilon OneFS 8.2.

Disk Space Utilization on DellEMC Isilon

What is the space on the disk that is used for these formats in Hadoop on DellEMC Isilon? Saving on disk space is always a good thing, but it can be hard to calculate exactly how much space you will be used with compression. Every file and data set is different, and the data inside will always be a determining factor for what type of compression you’ll get. The text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.

As a simple test, we took the 2008 data set from http://stat-computing.org/dataexpo/2009/the-data.html. The compressed bz2 download measures at 108.5 Mb, and uncompressed at 657.5 Mb. We then uploaded the data to DellEMC Isilon through HDFS protocol, and created an external table on top of the uncompressed data set:

Copy the original dataset to Hadoop cluster

(base) [root@pipe-hdp4 ~]# ll
-rw-r--r--   1 root root 689413344 Dec  9  2014 2008.csv
-rwxrwxrwx   1 root root 113753229 Dec  9  2014 2008.csv.bz2


(base) [root@pipe-hdp4 ~]#hadoop fs -put 2008.csv.bz2 /
(base) [root@pipe-hdp4 ~]#hadoop fs -mkdir /flight_arrivals
(base) [root@pipe-hdp4 ~]#hadoop fs -put 2008.csv /flight_arrivals/

From Hadoop Compute Node, create a table

Create external table flight_arrivals (
year int,
month int,
DayofMonth int,
DayOfWeek int,
DepTime int,
CRSDepTime int,
ArrTime int,
CRSArrTime int,
UniqueCarrier string,
FlightNum int,
TailNum string,
ActualElapsedTime int,
CRSElapsedTime int,
AirTime int,
ArrDelay int,
DepDelay int,
Origin string,
Dest string,
Distance int,
TaxiIn int,
TaxiOut int,
Cancelled int,
CancellationCode int,
Diverted int,
CarrierDelay string,
WeatherDelay string,
NASDelay string,
SecurityDelay string,
LateAircraftDelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location '/flight_arrivals';

The total number of records in this primary table is

select count(*) from flight_arrivals;
+----------+
|   _c0    |
+----------+
| 7009728  |
+----------+

Similarly, create different file format tables using the primary table

To create different file formats files by simply specifying ‘STORED AS FileFormatName’ option at the end of a CREATE TABLE Command.

Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals;
Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals;
Create external table flight_arrivals_external_textfile stored as textfile as select * from flight_arrivals;
Create external table flight_arrivals_external_sequencefile stored as sequencefile as select * from flight_arrivals;
Create external table flight_arrivals_external_rcfile stored as rcfile as select * from flight_arrivals;
Create external table flight_arrivals_external_avro stored as avro as select * from flight_arrivals;

Disk space utilization of the tables

Now, let us compare the disk usage on Isilon of all the files from Hadoop compute nodes.

(base) [root@pipe-hdp4 ~]# hadoop fs -du -h /warehouse/tablespace/external/hive/ | grep flight_arrivals
670.7 M  670.7 M /warehouse/tablespace/external/hive/flight_arrivals_external_textfile
403.1 M  403.1 M /warehouse/tablespace/external/hive/flight_arrivals_external_rcfile
751.1 M  751.1 M /warehouse/tablespace/external/hive/flight_arrivals_external_sequencefile
597.8 M  597.8 M /warehouse/tablespace/external/hive/flight_arrivals_external_avro
145.7 M  145.7 M  /warehouse/tablespace/external/hive/flight_arrivals_external_parquet
93.1 M   93.1 M  /warehouse/tablespace/external/hive/flight_arrivals_external_orc
(base) [root@pipe-hdp4 ~]#

Summary

From the below table we can conclude that DellEMC Isilon as HDFS storage supports all the Hadoop file formats and provides the same disk utilization as with the traditional HDFS storage.

Format	Size	Compressed%
BZ2	108.5 M	16.5%
CSV (Text)	657.5 M	–
ORC	93.1 M	14.25%
Parquet	145.7 M	22.1%
AVRO	597.8 M	90.9%
RC FIle	403.1 M	61.3%
Sequence	751.1 M	114.2%

Here the default settings and values wee used to create all different format tables, as well as no other optimizations, were used for any of the formats. Each file format ships with many options and optimizations to compress the data, only the defaults that ship HDP 3.1 were used.