OneFS Job Engine Monitoring & Reporting

The OneFS Job Engine provides detailed monitoring and statistics gathering, with insight into jobs and Job Engine. A variety of Job Engine specific metrics are available via the OneFS CLI, including per job disk usage, etc, with CLI command ‘isi job statistics list’.

For example:

# isi job statistics list

Job ID  Phase  CPU Avg.  Virt. Mem. Avg.  Phys. Mem. Avg.  I/O Ops  I/O Bytes

----------------------------------------------------------------------------

87      1      0.52%     77.44M           37.00M           1115337  8.51G

88      1      2.35%     75.84M           36.80M           16192    125.66M

89      1      0.48%     6.01M            2.67M            0        0.00

----------------------------------------------------------------------------

In verbose mode, worker statistics and job level resource usage can also be viewed:

# isi job statistics list -v | more

Job ID: 87

 Phase: 1
 Nodes
             Node: 1
              PID: 38164
              CPU: 0.00% (0.00% min, 4.30% max, 0.40% avg)

       Virtual Mem.: 76.92M (74.38M min, 76.92M max, 76.53M avg)
      Physical Mem.: 36.33M (35.93M min, 36.33M max, 36.13M avg)

           I/O Read: 16634 ops, 129.95M
          I/O Write: 38899 ops, 303.90M
          Workers: 1 (9.00 STW avg.)

             Node: 2
              PID: 66095
              CPU: 0.68% (0.34% min, 4.98% max, 1.22% avg)

       Virtual Mem.: 77.05M (74.38M min, 77.05M max, 76.92M avg)
      Physical Mem.: 42.07M (41.39M min, 42.07M max, 41.97M avg)

           I/O Read: 16602 ops, 129.70M
          I/O Write: 39173 ops, 306.04M
          Workers: 1 (0.00 STW avg.)
…

Additionally, the status of the Job Engine workers is available via the OneFS CLI using the ‘isi job statistics view’ command. For example:

# isi job statistics view --job-id 857

Job ID: 857

Phase: 1
Nodes
  Node: 1
     PID: 26224
     CPU: 7.96% (0.00% min, 28.96% max, 4.60% avg)

 Virtual: 187.23M (187.23M min, 187.23M max, 187.23M avg)
Physical: 19.01M (18.52M min, 19.33M max, 18.96M avg)

    Read: 931043 ops, 7.099G
   Write: 1610213 ops, 12.269G
 Workers: 1 (0.00 STW avg.)

Job events, including pause/resume, waiting, phase completion, job success, failure, etc, are reported under the ‘Job Events’ tab of the WebUI. Additional information for each event is available via the “View Details” button for the appropriate job events entry in the WebUI. These are accessed by navigating to Cluster Management > Job Operations > Job Events.

A comprehensive job report is also provided for each phase of a job. This report contains detailed information on runtime, CPU, drive and memory utilization, the number of data and metadata objects scanned, and other work details or errors specific to the job type.

After a job finishes, you can view a report about the job from the CLI. You need to specify the job ID to view the report for a completed job. The isi job reports list command displays a list of all recent jobs, including job IDs. Run the isi job reports view command with a specific job ID. The following command displays the report of a Collect job with an ID of 857:

# isi job reports view 857

Collect[857] phase 1 (2020-11-14T11:39:57)

------------------------------------------

LIN scan

Elapsed time:              6506 seconds

LINs traversed:            433423

Files seen:                396980

Directories seen:          36439

Errors:                    0

Total blocks:              27357443452 (13678721726 KB)

CPU usage:                 max 28% (dev 1), min 0% (dev 1), avg 4%

Virtual memory size:       max 193300K (dev 1), min 191728K (dev 1), avg 1925

Resident memory size:      max 21304K (dev 1), min 18884K (dev 2), avg 20294K

Read:                      11637860 ops, 95272875008 bytes (90859.3M)

Write:                     20717079 ops, 169663891968 bytes (161804.1M)

While a job is running, an Active Job Details report is also available. This provides contextual information, including elapsed time, current job phase, job progress status, etc.

For inode (LIN) based jobs, progress as an estimated percentage completion is also displayed, based on processed LIN counts.

Detailed and granular job performance information and statistics are now available in a job’s report. These new statistics include per job phase CPU and memory utilization (including minimum, minimum, and average), and total read and write IOPS and throughput.

OneFS performance resource management provides statistics for the resources used by jobs – both cluster-wide and per-node. This information is available via the isi statistics workload CLI command. Available in a ‘top’ format, this command displays the top jobs and processes, and periodically updates the information.

For example, the following syntax shows, and indefinitely refreshes, the top five processes on a cluster:

# isi statistics workload –-limit=5 -–format=top

last update:  2020-11-14T16:45:25 (s)ort: default

CPU   Reads Writes      L2    L3    Node  SystemName        JobType

1.4s  9.1k  0.0         3.5k  497.0 2     Job:  237         IntegrityScan[0]

1.2s  85.7  714.7       4.9k  0.0   1     Job:  238         Dedupe[0]

1.2s  9.5k  0.0         3.5k  48.5  1     Job:  237         IntegrityScan[0]

1.2s  7.4k  541.3       4.9k  0.0   3     Job:  238         Dedupe[0]

1.1s  7.9k  0.0         3.5k  41.6  2     Job:  237         IntegrityScan[0]

The resource statistics tracked per job, per job phase, and per node include CPU, reads, writes, and L2 & L3 cache hits. Unlike the output from the ‘top’ command, this makes it easier to diagnose individual job resource issues, etc.

Leave a Reply

Your email address will not be published. Required fields are marked *