Bigdata File Formats Support on Dell EMC ECS 3.6

This article describes the Dell EMC ECS’s support for Apache Hadoop file formats in terms of disk space utilization. To determine this, we will use Apache Hive service to create and store different file format tables and analyze the disk space utilization by each table on the ECS storage.

Apache Hive supports several familiar file formats used in Apache Hadoop. Hive can load and query different data files created by other Hadoop components such as PIG, Spark, MapReduce, etc. In this article, we will check Apache Hive file formats such as TextFile, SequenceFIle, RCFile, AVRO, ORC and Parquet formats. Cloudera Impala also supports these file formats.

To begin with, let us understand a bit about these Bigdata File formats. Different file formats and compression codes work better for different data sets in Hadoop, the main objective of this article is to determine their supportability on DellEMC ECS storage which is a S3 compatible object store for Hadoop cluster.

Following are the Hadoop file formats

Test File: This is a default storage format. You can use the text format to interchange the data with another client application. The text file format is very common for most of the applications. Data is stored in lines, with each line being a record. Each line is terminated by a newline character(\n).

The test format is a simple plane file format. You can use the compression (BZIP2) on the text file to reduce the storage spaces.

Sequence File: These are Hadoop flat files that store values in binary key-value pairs. The sequence files are in binary format and these files can split. The main advantage of using the sequence file is to merge two or more files into one file.

RC File: This is a row columnar file format mainly used in Hive Datawarehouse, offers high row-level compression rates. If you have a requirement to perform multiple rows at a time, then you can use the RCFile format. The RCFile is very much like the sequence file format. This file format also stores the data as key-value pairs.

AVRO File: AVRO is an open-source project that provides data serialization and data exchange services for Hadoop. You can exchange data between the Hadoop ecosystem and a program written in any programming language. Avro is one of the popular file formats in Big Data Hadoop based applications.

ORC File: The ORC file stands for Optimized Row Columnar file format. The ORC file format provides a highly efficient way to store data in the Hive table. This file system was designed to overcome limitations of the other Hive file formats. The Use of ORC files improves performance when Hive is reading, writing, and processing data from large tables.

More information on the ORC file format: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

Parquet File: Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.

More information on the Parquet file format: https://parquet.apache.org/documentation/latest/

Please note for below testing Cloudera CDP Private Cloud Base 7.1.6 Hadoop cluster is used.

Disk Space Utilization on Dell EMC ECS

What is the space on the disk that is used for these formats in Hadoop on Dell EMC ECS? Saving on disk space is always a good thing, but it can be hard to calculate exactly how much space you will be used with compression. Every file and data set is different, and the data inside will always be a determining factor for what type of compression you’ll get. The text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.

As a simple test, we took the 2008 data set from http://stat-computing.org/dataexpo/2009/the-data.html. The compressed bz2 download measures at 108.5 Mb, and uncompressed at 657.5 Mb. We then uploaded the data to Dell EMC ECS through s3a protocol, and created an external table on top of the uncompressed data set:

Copy the original dataset to Hadoop cluster
[root@hop-kiran-n65 ~]# ll
total 111128
-rwxr-xr-x 1 root root 113753229 May 28 02:25 2008.csv.bz2
-rw-------. 1 root root 1273 Oct 31 2020 anaconda-ks.cfg
-rw-r--r--. 1 root root 36392 Dec 15 07:48 docu99139
[root@hop-kiran-n65 ~]# hadoop fs -put ./2008.csv.bz2 s3a://hive.ecs.bucket/diff_file_format_db/bz2/
[root@hop-kiran-n65 ~]# hadoop fs -ls s3a://hive.ecs.bucket/diff_file_format_db/bz2/
Found 1 items
-rw-rw-rw- 1 root root 113753229 2021-05-28 02:00 s3a://hive.ecs.bucket/diff_file_format_db/bz2/2008.csv.bz2
[root@hop-kiran-n65 ~]#
From Hadoop Compute Node, create a database with data location on ECS bucket and create an external table for the flights data uploaded to ECS bucket location.
DROP DATABASE IF EXISTS diff_file_format_db CASCADE;

CREATE database diff_file_format_db COMMENT 'Holds all the tables data on ECS bucket' LOCATION 's3a://hive.ecs.bucket/diff_file_format_db' ;
USE diff_file_format_db;

Create external table flight_arrivals_txt_bz2 (
year int,
month int,
DayofMonth int,
DayOfWeek int,
DepTime int,
CRSDepTime int,
ArrTime int,
CRSArrTime int,
UniqueCarrier string,
FlightNum int,
TailNum string,
ActualElapsedTime int,
CRSElapsedTime int,
AirTime int,
ArrDelay int,
DepDelay int,
Origin string,
Dest string,
Distance int,
TaxiIn int,
TaxiOut int,
Cancelled int,
CancellationCode int,
Diverted int,
CarrierDelay string,
WeatherDelay string,
NASDelay string,
SecurityDelay string,
LateAircraftDelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location 's3a://hive.ecs.bucket/diff_file_format_db/bz2/';
The total number of records in this master table is
select count(*) from flight_arrivals_txt_bz2 ;

+----------+
|   _c0    |
+----------+
| 7009728  |
+----------+
Similarly, create different file format tables using the master table

To create different file formats files by simply specifying ‘STORED AS FileFormatName’ option at the end of a CREATE TABLE Command.

Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_textfile stored as textfile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_sequencefile stored as sequencefile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_rcfile stored as rcfile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_avro stored as avro as select * from flight_arrivals_txt_bz2;
Disk space utilization of the tables

Now, let us compare the disk usage on ECS of all the files from Hadoop compute nodes.

[root@hop-kiran-n65 ~]# hadoop fs -du -h s3a://hive.ecs.bucket/diff_file_format_db/ | grep flight_arrivals
597.7 M 597.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_avro
93.5 M 93.5 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_orc
146.6 M 146.6 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_parquet
403.1 M 403.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_rcfile
751.1 M 751.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_sequencefile
670.7 M 670.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_textfile
[root@hop-kiran-n65 ~]#

Summary

From the below table we can conclude that Dell EMC ECS as S3 storage supports all the Hadoop file formats and provides the same disk utilization as with the traditional HDFS storage.

Compressed Percentage lower is batter and Compression ratio higher is better.

Format
Size
Compressed%
Compressed Ratio
CSV (Text) 670.7 M
BZ2 108.5 M 16.18% 83.82%
ORC 93.5 M 13.94% 86.06%
Parquet 146.6 M 21.85% 78.15%
RC FIle 403.1 M 60.10% 39.90%
AVRO 597.7 M 89.12% 10.88%
Sequence 751.1 M 111.97% -11.87%

Here the default settings and values were used to create all the different file format tables, there were no other optimizations done for this testing. Each file format ships with many options and optimizations to compress the data, only the defaults that ship CDP pvt cloud base 7.1.6 were used.

 

 

 

 

 

 

 

Dell EMC ECS IAM and Hadoop S3A Implementation

This paper describes basic information on IAM features with Dell EMC ECS and step by step process to configure ECS with AD FS to determine SAML support features, that allow the Hadoop administrator to setup access policies to control access to S3A Hadoop data.

https://www.dellemc.com/resources/en-us/asset/white-papers/products/storage/h18420-dell-emc-ecs-iam-and-hadoop-s3a-implementation.pdf

 

ECS CIFS Gateway Demo

ECS CIFS Gateway

Accessing Data On ECS with CIFS Gateway

Elastic Cloud Storage (ECS) is object based platform supporting the S3, HDFS, and NFS protocols. However, what happens you want to access data in a Windows environment through Server Messaging Block (SMB)?  ECS now offers a CIFs Gateway that builds in SMB support for accessing to data in ECS.

The ECS CIFS gateway can easily be installed on Windows based machines to allow for file shares. In a multiprotocol world this allows for data to be written via S3 then shared out through SMB or vice versa. Checkout the video below for the ECS CIFS Gateway Demo.

Transcript – ECS CIFS Gateway

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Isilon Quick Tips. In this one, we’re going to show how to use ECS to set up CIFS shares. First thing, let’s just in, and let’s look at our users and our CIFS users. This is the specific user. It’s going to be used to set up and access our shares. Now, I’ve already downloaded the EXE file. You can see this CIFS ECS 1.2 version. Let’s click on this and try to install this real quick. Accept about licensing agreement, and verify that this is where we want to put our directory and this program file.

Now, as this is installing, have it finish up. We’re going to map that first ECS directory. We’re going to call this our local ECS. For our CIFS host, all files and folders to lowercase. Let’s go in here, our lab ECS. You can see here all the required fields. Let’s put back in our CIFS user for our user ID. You can see we’re going to use HTTPS and we’re going to set it up to HTTP, and 9020 is going to be our [Inaudible 00:01:40].

Add in our host name, which is ECS.demo.local. Add that over to our list. Verify that works. Use this one, and let’s find out CIFS bucket. CIFS bucket is CIFS data. Got that selected. Now, let’s move along, and verify everything. Everything looks fine. Let’s finish this up. Now, we have that share to our drive. Let’s go ahead and select that E drive. Our local ECS, and let’s put a file filter on it. What’s going to do is, we’re going to say that we want to exclude MP3s. Say that you didn’t want MP3s to come into this file share. Put some kind of policy on it, you have the ability here to lock that in. We can add that to this local ECS to do just map to our environment. Now, we’ve stopped MP3s from being uploaded. Let’s test this out by opening and creating out a test document. Go ahead and test out our first document that we uploaded to our E share, here, on our local ECS. We’ve got this. Let’s look at the properties here. Let’s see. We have our CIFS ECS. Appears to be uploaded.

Now, let’s double-check that by jumping into Cyber Duck and using their S3 protocol to check out that CIFS data. you can see here that we have our test document. Congratulations, just use drive to upload a document.