Better Protection with Dell EMC ECS Object Lock

Dell EMC ECS supported WORM (write-once-read-many) based retention from ECS 2.X. However, to gain more compatibility with more applications, ECS support the object lock feature from 3.6.2 version which is compatible with the capabilities of Amazon S3 object lock.

Dell EMC ECS object lock protects object versions from accidental or malicious deletion such as a ransomware attack. It does this by allowing object versions to enter a Write Once Read Many (WORM) state where access is restricted based on attributes set on the object version.

Object lock is designed to meet compliance requirements such as SEC 17a4(f), FINRA Rule 4511(c), and CFTC Rule 17.

Object lock overview

Object lock prevents object version deletion during a user-defined retention period.  Immutable S3 objects are protected using object- or bucket-level configuration of WORM and retention attributes. The retention policy is defined using the S3 API or bucket-level defaults.  Objects are locked for the duration of the retention period, and legal hold scenarios are also supported.

There are two lock types for object lock:

  • Retention period — Specifies a fixed period of time during which an object version remains locked. During this period, your object version is WORM-protected and can’t be overwritten or deleted.
  • Legal hold — Provides the same protection as a retention period, but it has no expiration date. Instead, a legal hold remains in place until you explicitly remove it. legal holds are independent from retention periods.

There are two mode for the retention period:

  • Governance mode — users can’t overwrite or delete an object version or alter its lock settings unless they have special permissions. With governance mode, you protect objects against being deleted by most users, but you can still grant some users permission to alter the retention settings or delete the object if necessary. You can also use governance mode to test retention-period settings before creating a compliance-mode retention period.
  • Compliance mode — a protected object version can’t be overwritten or deleted by any user, including the root user in your account. When an object is locked in compliance mode, its retention mode can’t be changed, and its retention period can’t be shortened. Compliance mode helps ensure that an object version can’t be overwritten or deleted for the duration of the retention period.

Object lock and lifecycle

Objects under lock are protected from lifecycle deletions.

Lifecycle logic is made difficult due to variety of behavior of different locks. From lifecycle point of view there are locks without a date, locks with date that can be extended, and locks with date that can be decreased.

  • For compliance mode, the retain until date can’t be decreased, but can be increased:
  • For governance mode, the lock date can increase, decrease, or get removed.
  • For legal hold, the lock is indefinite.

Some key points for the S3 object lock with ECS

  • Object lock requires FS (File System) disabled on bucket in ECS 3.6.2 version.
  • Object lock requires ADO (Access During Outage) disabled on bucket in ECS 3.6.2 version.
  • Object lock is only supported by S3 API, not UI workflows in ECS 3.6.2 version.
  • Object lock only works with IAM, not legacy accounts.
  • Object lock works only in versioned buckets.
  • Enabling locking on the bucket automatically makes it versioned.
  • Once bucket locking is enabled, it is not possible to disable object lock or suspend versioning for the bucket.
  • A bucket has default configuration include a retention mode (governance or compliance) and also a retention period (which is days or years).
  • Object locks apply to individual object versions only.
  • Different versions of a single object can have different retention modes and periods.
  • Lock prevents an object from being deleted or overwritten. Overwritten does not mean that new versions can’t be created (new version can be created with their own lock settings).
  • Object can still be deleted; it will create a delete marker and the version still exists and is locked.
  • Compliance mode is stricter, locks can’t be removed, decreased, or downgraded to governance mode.
  • Governance mode is less strict, it can be removed, bypassed, elevated to compliance mode.
  • Object can still be deleted, but the version still exists and is locked.
  • Updating an object version’s metadata, as occurs when you place or alter an object lock, doesn’t overwrite the object version or reset its Last-Modified timestamp.
  • Retention period can be placed on an object explicitly, or implicitly through a bucket default setting.
  • Placing a default retention setting on a bucket doesn’t place any retention settings on objects that already exist in the bucket.
  • Changing a bucket’s default retention period doesn’t change the existing retention period for any objects in that bucket.
  • object lock and traditional bucket/object ECS retention can co-exist.

ECS object lock condition keys

Access control using IAM policies is an important part of the object lock functionality. The s3:BypassGovernanceRetention permission is important since it is required to delete a WORM-protected object in Governance mode.  IAM policy conditions have been defined below to allow you to limit what retention period and legal hold can be specified in objects.

Condition Key Description
s3:object-lock-legal-hold Enables enforcement of the specified object legal hold status
s3:object-lock-mode Enables enforcement of the specified object retention mode
s3:object-lock-retain-until-date Enables enforcement of a specific retain-until-date
s3:object-lock-remaining-retention-days Enables enforcement of an object relative to the remaining retention days

ECS object lock API examples

This section lists s3curl examples of object Lock APIs. Put and Get object lock APIs can be used with and without versionId parameter. If no versionId parameter is used, then the action applies to the latest version.

Operation API request examples
Create lock-enabled bucket s3curl.pl –id=ecsflex –createBucket — http://${s3ip}/mybucket

-H “x-amz-bucket-object-lock-enabled: true”

Enable object lock on existing bucket s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket?enable-objectlock

-X PUT

Get bucket default lock configuration s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket?object-lock
Put bucket default lock

configuration

s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket?object-lock

-X PUT \

-d “<ObjectLockConfiguration><ObjectLockEnabled>Enabled</

ObjectLockEnabled>

<Rule><DefaultRetention><Mode>GOVERNANCE</Mode><Days>1</Days></

DefaultRetention></Rule></ObjectLockConfiguration>”

Get legal hold s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj?legal-hold
Put legal hold on create s3curl.pl –id=ecsflex –put=/root/100b.file — http://${s3ip}/

my-bucket/obj -H “x-amz-object-lock-legal-hold: ON”

Put legal hold on existing object s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj?legalhold

-X PUT -d “<LegalHold><Status>OFF</Status></LegalHold>”

Get retention s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj?retention
Put retention on create s3curl.pl –id=ecsflex –put=/root/100b.file — http://${s3ip}/

my-bucket/obj -H “x-amz-object-lock-mode: GOVERNANCE” -H “x-amz-object-lock-retain-until-date: 2030-01-01T00:00:00.000Z”

Put retention on existing object s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj?

retention -X PUT -d “<Retention><Mode>GOVERNANCE</

Mode><RetainUntilDate>2030-01-01T00:00:00.000Z</

RetainUntilDate></Retention>”

Put retention on existing

object (with bypass)

s3curl.pl –id=ecsflex — http://${s3ip}/my-bucket/obj?

retention -X PUT -d “<Retention><Mode>GOVERNANCE</

Mode><RetainUntilDate>2030-01-01T00:00:00.000Z</

RetainUntilDate></Retention>” -H “x-amz-bypass-governance-retention:

true”

 

Bigdata File Formats Support on Dell EMC ECS 3.6

This article describes the Dell EMC ECS’s support for Apache Hadoop file formats in terms of disk space utilization. To determine this, we will use Apache Hive service to create and store different file format tables and analyze the disk space utilization by each table on the ECS storage.

Apache Hive supports several familiar file formats used in Apache Hadoop. Hive can load and query different data files created by other Hadoop components such as PIG, Spark, MapReduce, etc. In this article, we will check Apache Hive file formats such as TextFile, SequenceFIle, RCFile, AVRO, ORC and Parquet formats. Cloudera Impala also supports these file formats.

To begin with, let us understand a bit about these Bigdata File formats. Different file formats and compression codes work better for different data sets in Hadoop, the main objective of this article is to determine their supportability on DellEMC ECS storage which is a S3 compatible object store for Hadoop cluster.

Following are the Hadoop file formats

Test File: This is a default storage format. You can use the text format to interchange the data with another client application. The text file format is very common for most of the applications. Data is stored in lines, with each line being a record. Each line is terminated by a newline character(\n).

The test format is a simple plane file format. You can use the compression (BZIP2) on the text file to reduce the storage spaces.

Sequence File: These are Hadoop flat files that store values in binary key-value pairs. The sequence files are in binary format and these files can split. The main advantage of using the sequence file is to merge two or more files into one file.

RC File: This is a row columnar file format mainly used in Hive Datawarehouse, offers high row-level compression rates. If you have a requirement to perform multiple rows at a time, then you can use the RCFile format. The RCFile is very much like the sequence file format. This file format also stores the data as key-value pairs.

AVRO File: AVRO is an open-source project that provides data serialization and data exchange services for Hadoop. You can exchange data between the Hadoop ecosystem and a program written in any programming language. Avro is one of the popular file formats in Big Data Hadoop based applications.

ORC File: The ORC file stands for Optimized Row Columnar file format. The ORC file format provides a highly efficient way to store data in the Hive table. This file system was designed to overcome limitations of the other Hive file formats. The Use of ORC files improves performance when Hive is reading, writing, and processing data from large tables.

More information on the ORC file format: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

Parquet File: Parquet is a column-oriented binary file format. The parquet is highly efficient for the types of large-scale queries. Parquet is especially good for queries scanning particular columns within a particular table. The Parquet table uses compression Snappy, gzip; currently Snappy by default.

More information on the Parquet file format: https://parquet.apache.org/documentation/latest/

Please note for below testing Cloudera CDP Private Cloud Base 7.1.6 Hadoop cluster is used.

Disk Space Utilization on Dell EMC ECS

What is the space on the disk that is used for these formats in Hadoop on Dell EMC ECS? Saving on disk space is always a good thing, but it can be hard to calculate exactly how much space you will be used with compression. Every file and data set is different, and the data inside will always be a determining factor for what type of compression you’ll get. The text will compress better than binary data. Repeating values and strings will compress better than pure random data, and so forth.

As a simple test, we took the 2008 data set from http://stat-computing.org/dataexpo/2009/the-data.html. The compressed bz2 download measures at 108.5 Mb, and uncompressed at 657.5 Mb. We then uploaded the data to Dell EMC ECS through s3a protocol, and created an external table on top of the uncompressed data set:

Copy the original dataset to Hadoop cluster
[root@hop-kiran-n65 ~]# ll
total 111128
-rwxr-xr-x 1 root root 113753229 May 28 02:25 2008.csv.bz2
-rw-------. 1 root root 1273 Oct 31 2020 anaconda-ks.cfg
-rw-r--r--. 1 root root 36392 Dec 15 07:48 docu99139
[root@hop-kiran-n65 ~]# hadoop fs -put ./2008.csv.bz2 s3a://hive.ecs.bucket/diff_file_format_db/bz2/
[root@hop-kiran-n65 ~]# hadoop fs -ls s3a://hive.ecs.bucket/diff_file_format_db/bz2/
Found 1 items
-rw-rw-rw- 1 root root 113753229 2021-05-28 02:00 s3a://hive.ecs.bucket/diff_file_format_db/bz2/2008.csv.bz2
[root@hop-kiran-n65 ~]#
From Hadoop Compute Node, create a database with data location on ECS bucket and create an external table for the flights data uploaded to ECS bucket location.
DROP DATABASE IF EXISTS diff_file_format_db CASCADE;

CREATE database diff_file_format_db COMMENT 'Holds all the tables data on ECS bucket' LOCATION 's3a://hive.ecs.bucket/diff_file_format_db' ;
USE diff_file_format_db;

Create external table flight_arrivals_txt_bz2 (
year int,
month int,
DayofMonth int,
DayOfWeek int,
DepTime int,
CRSDepTime int,
ArrTime int,
CRSArrTime int,
UniqueCarrier string,
FlightNum int,
TailNum string,
ActualElapsedTime int,
CRSElapsedTime int,
AirTime int,
ArrDelay int,
DepDelay int,
Origin string,
Dest string,
Distance int,
TaxiIn int,
TaxiOut int,
Cancelled int,
CancellationCode int,
Diverted int,
CarrierDelay string,
WeatherDelay string,
NASDelay string,
SecurityDelay string,
LateAircraftDelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
location 's3a://hive.ecs.bucket/diff_file_format_db/bz2/';
The total number of records in this master table is
select count(*) from flight_arrivals_txt_bz2 ;

+----------+
|   _c0    |
+----------+
| 7009728  |
+----------+
Similarly, create different file format tables using the master table

To create different file formats files by simply specifying ‘STORED AS FileFormatName’ option at the end of a CREATE TABLE Command.

Create external table flight_arrivals_external_orc stored as ORC as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_parquet stored as Parquet as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_textfile stored as textfile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_sequencefile stored as sequencefile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_rcfile stored as rcfile as select * from flight_arrivals_txt_bz2;
Create external table flight_arrivals_external_avro stored as avro as select * from flight_arrivals_txt_bz2;
Disk space utilization of the tables

Now, let us compare the disk usage on ECS of all the files from Hadoop compute nodes.

[root@hop-kiran-n65 ~]# hadoop fs -du -h s3a://hive.ecs.bucket/diff_file_format_db/ | grep flight_arrivals
597.7 M 597.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_avro
93.5 M 93.5 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_orc
146.6 M 146.6 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_parquet
403.1 M 403.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_rcfile
751.1 M 751.1 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_sequencefile
670.7 M 670.7 M s3a://hive.ecs.bucket/diff_file_format_db/flight_arrivals_external_textfile
[root@hop-kiran-n65 ~]#

Summary

From the below table we can conclude that Dell EMC ECS as S3 storage supports all the Hadoop file formats and provides the same disk utilization as with the traditional HDFS storage.

Compressed Percentage lower is batter and Compression ratio higher is better.

Format
Size
Compressed%
Compressed Ratio
CSV (Text) 670.7 M
BZ2 108.5 M 16.18% 83.82%
ORC 93.5 M 13.94% 86.06%
Parquet 146.6 M 21.85% 78.15%
RC FIle 403.1 M 60.10% 39.90%
AVRO 597.7 M 89.12% 10.88%
Sequence 751.1 M 111.97% -11.87%

Here the default settings and values were used to create all the different file format tables, there were no other optimizations done for this testing. Each file format ships with many options and optimizations to compress the data, only the defaults that ship CDP pvt cloud base 7.1.6 were used.

 

 

 

 

 

 

 

Dell EMC ECS IAM and Hadoop S3A Implementation

This paper describes basic information on IAM features with Dell EMC ECS and step by step process to configure ECS with AD FS to determine SAML support features, that allow the Hadoop administrator to setup access policies to control access to S3A Hadoop data.

https://www.dellemc.com/resources/en-us/asset/white-papers/products/storage/h18420-dell-emc-ecs-iam-and-hadoop-s3a-implementation.pdf

 

ECS CIFS Gateway Demo

ECS CIFS Gateway

Accessing Data On ECS with CIFS Gateway

Elastic Cloud Storage (ECS) is object based platform supporting the S3, HDFS, and NFS protocols. However, what happens you want to access data in a Windows environment through Server Messaging Block (SMB)?  ECS now offers a CIFs Gateway that builds in SMB support for accessing to data in ECS.

The ECS CIFS gateway can easily be installed on Windows based machines to allow for file shares. In a multiprotocol world this allows for data to be written via S3 then shared out through SMB or vice versa. Checkout the video below for the ECS CIFS Gateway Demo.

Transcript – ECS CIFS Gateway

Hi folks! Thomas Henson here with thomashenson.com. Today is another episode of Isilon Quick Tips. In this one, we’re going to show how to use ECS to set up CIFS shares. First thing, let’s just in, and let’s look at our users and our CIFS users. This is the specific user. It’s going to be used to set up and access our shares. Now, I’ve already downloaded the EXE file. You can see this CIFS ECS 1.2 version. Let’s click on this and try to install this real quick. Accept about licensing agreement, and verify that this is where we want to put our directory and this program file.

Now, as this is installing, have it finish up. We’re going to map that first ECS directory. We’re going to call this our local ECS. For our CIFS host, all files and folders to lowercase. Let’s go in here, our lab ECS. You can see here all the required fields. Let’s put back in our CIFS user for our user ID. You can see we’re going to use HTTPS and we’re going to set it up to HTTP, and 9020 is going to be our [Inaudible 00:01:40].

Add in our host name, which is ECS.demo.local. Add that over to our list. Verify that works. Use this one, and let’s find out CIFS bucket. CIFS bucket is CIFS data. Got that selected. Now, let’s move along, and verify everything. Everything looks fine. Let’s finish this up. Now, we have that share to our drive. Let’s go ahead and select that E drive. Our local ECS, and let’s put a file filter on it. What’s going to do is, we’re going to say that we want to exclude MP3s. Say that you didn’t want MP3s to come into this file share. Put some kind of policy on it, you have the ability here to lock that in. We can add that to this local ECS to do just map to our environment. Now, we’ve stopped MP3s from being uploaded. Let’s test this out by opening and creating out a test document. Go ahead and test out our first document that we uploaded to our E share, here, on our local ECS. We’ve got this. Let’s look at the properties here. Let’s see. We have our CIFS ECS. Appears to be uploaded.

Now, let’s double-check that by jumping into Cyber Duck and using their S3 protocol to check out that CIFS data. you can see here that we have our test document. Congratulations, just use drive to upload a document.