Diagnosing ARX ‘WORKSLOW’ Problems

Tech Tip submitted by Dave Porter, F5 Data Solutions

Introduction

The effectiveness of CIFS virtualization in the ARX is impaired by slow response from external systems such as file servers and domain controllers, resulting in poor response to client requests, and so-called WORKSLOW messages in syslog. Version 6.01.000 of the ARX adds new aids to diagnosing the underlying cause.

Problem Statement

The problem we’re solving here is one of diagnosis: ARX users complain that ‘the ARX is slow’, and then a knowledgeable technician must try and determine the reason. Over time, we have learned that a good first step is to look for WORKSLOW messages in the system log. But then what? In the past, that has required dumping internal data structures, checking various counters, getting network traces, and so on.

In V6.01.000, we’ve attempted to provide a single command that can serve as the first point of diagnosis. Information that was previously hidden in private data structures is now exposed to view.

Problem Solution

The new command is ‘show statistics cifs work-queues’, and some example output is shown below. The output is a little on the technical side; there's not really any way to simplify it, since slowdowns will naturally be related to internal structure. In this document, the output is broken into sections for explanation.

The command used for this example was

show statistics cifs work-queues volume-group 2

The overall output structure is: firstly, information about how the internal queues are doing, followed by more detailed info that might help you determine where any slowdowns are occurring.

Terminology note: ‘dNAS’ is the name of the control plane volume manager component; ‘vcifs’ is the name of the CIFS subcomponent of dNAS. Both of these names are used in syslog as well as in the present document. The volume manager is involved in more-or-less all CIFS traffic except data reads and writes.

Warning: this example is a composite, synthesized from bits and pieces. The numbers might not be consistent.

Work Queues

First, the work queues (or ‘work lists’; the terms are interchangeable). Vcifs has some number of these, used for different purposes. The formal name and approximate use are given. Stats indicate the number of items currently queued, and the high-water mark on the number of items queued. The number of items completed is given—whether the CIFS operation ‘succeeded’ or ‘failed’ is immaterial here; a failed access is not a problem in our software, e.g. user might have typed the wrong filename. ‘Discarded’ happens in times of stress; discarded items never even got started.

CIFS work queue pool statistics
-------------------------------

  Work list "main": used for most client requests
    0 items active or waiting; highest was 4 at 06/01/2011 15:33:49 -0400
    8 items completed, 0 discarded
    Average 13 uSec queued, 1277 uSec processing
    Time waiting in queue:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:        33         2         0         1         0         9         2
    Time processing on thread:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:        34         1         0         1         0         9         2

  Work list "auth": used for client authentication
    0 items active or waiting; highest was 1 at 06/01/2011 15:19:33 -0400
    12 items completed, 0 discarded
    Average 16 uSec queued, 1332 uSec processing
    Time waiting in queue:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:        33        11         2         0         0         0         0
    Time processing on thread:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:        34        10         2         0         0         0         0

  Work list "setup": authentication support for data plane
    0 items active or waiting; highest was 1 at 06/01/2011 15:18:12 -0400
    2 items completed, 0 discarded
    Average 26 uSec queued, 866 uSec processing
    Time waiting in queue:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:         8         0         0         0         0         0         0
    Time processing on thread:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:         8         0         0         0         0         0         0

  Work list "special": internal notifications, including client disconnect
    0 items active or waiting; highest was 1 at 06/01/2011 15:33:50 -0400
    4 items completed, 0 discarded
    Average 27 uSec queued, 151 uSec processing
    Time waiting in queue:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:        14         0         0         0         0         1         0
    Time processing on thread:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:        14         0         0         0         0         1         0

  Work list "no-op": processes keepalives from data plane
    0 items active or waiting; highest was 1 at 06/01/2011 15:33:51 -0400
    57 items completed, 0 discarded
    Average 22 uSec queued, 44 uSec processing
    Time waiting in queue:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:       997        10         0         0         0         0         0
    Time processing on thread:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:      1007         0         0         0         0         0         0

The average times are calculated over ‘completed’ items. Queue time is the time between being placed on the queue and being picked up by a worker thread; this will be non-zero even if worker threads were immediately available. Processing time is the time for which a worker thread was dedicated to this item of work. The vcifs implementation means that a work item ties up a work thread from the time it starts processing until the time it is completed.

In this example, there were a handful of client requests that could be problematic: 9 requests took 10 to 100 seconds, and 2 requests took more than 100 seconds.

There are histograms showing the distribution of the queue times and processing times for the completed work items; the definitions of queue time and processing time are as above. The histogram scale is logarithmic, ranging from 1 msec to 100 secs.

At the end of all the work-list info, there is a table of status messages.

Work queue status messages:
    Timestamp (UTC)      WkList  Event   Age(sec) Operation
    2011-04-08 13:43:20  main    DONE    224      CLIENT_REQ
                                                  TRANSACTION2/FIND_FIRST2
                                                  [in filer wait, 192.168.97.45, 223 sec]
    2011-04-08 14:26:26  main    DONE    288      CLIENT_REQ
                                                  TRANSACTION2/FIND_FIRST2
                                                  [in filer wait, 192.168.97.45, 225 sec]
    2011-04-08 14:42:07  main    SLOW    40       REQ_FILE_OPEN
                                                  NT_CREATE_ANDX
                                                  [in dnas wait, 40 sec]
    2011-04-08 14:59:32  main    SLOW    40       TREE_CONNECT
                                                  [in dnas wait, 39 sec]

There is a status entry corresponding to the WORKSLOW and WORKSTUCK log messages, with approximately the same information. The ‘event’ is SLOW, STUCK, or DONE; this last one is the case when a previously SLOW or STUCK message is finally completed.

Thus, a given work item may first show up as SLOW after ~30 secs on a work thread. It will be updated to STUCK at about the 5 minute mark. It may eventually be further updated to DONE, if we're lucky.

The timestamp shows the real time at which the entry was last updated (i.e., first became SLOW, or transitioned to STUCK or DONE) and the ‘age’ was the number of seconds it had been occupying a thread at the time of update.

The ‘operation’ contains an internal message type indicator on the first line, and where relevant, a CIFS message type on a second line.

For currently active (SLOW/STUCK, not DONE) operations, the current run state and time in that state are shown. The state is the same state that has always been part of the log messages; the time in state is a recent addition to log messages and this display: it removes ambiguity about whether we've really been stuck in one place for most of the operation. In the example, yes, we have been stuck in dNAS all that time. Filer-wait state now shows the IP address of the filer in question.

For DONE events, the displayed run state and time in that state are the worst (longest) experienced by the item in its lifetime. Note that it's the single worst: if a given work item accessed filers twice, taking 35 seconds for one access and 40 seconds for the other, then the DONE even will show a filer-wait time of 40 secs, not 75 secs.

The last 20 such events are retained; the table may be cleared on dNAS restart.

Processing Time Per Op Type

This gives you a more detailed breakdown of what's taking so long, as measured at the 'front end' of vcifs.

Average processing time per operation type
------------------------------------------

  CIFS SMB                               Count   Avg time (uSec)

  Transaction2/FindFirst2                    3         170709437
  NTCreateAndX                               1          32007162
  TreeConnectAndX                           16              6783
  SessionSetupAndX                          30              1644
  Negotiate                                 16               674
  TreeDisconnect                             1                46
  Transaction2/QueryFSInfo                   2                29

  All SMBs                                  69           7888467

After having determined that some operations are taking a long time, by looking at the work-queues info described above, this helps you figure out whether particular SMB types are taking longer than others. Here, directory enumeration (FindFirst2) is troublesome, as is file lookup and creation (NTCreateAndx).

The processing time is the time for which a given SMB was tying up a single vcifs thread; for each SMB type, the count of those SMBs and the average processing time are both shown. The output is ordered by average processing time (longest first) and entries with zero counts are not displayed.

File Server Round Trip Time Per Op Type

For operations that appear to be taking an unusually long time, this helps you determine whether the problem lies with the file server.

Average file server round-trip time per operation type
------------------------------------------------------

  CIFS SMB                               Count        RTT (uSec)

  Close                                      1              1479
  Transaction2/FindFirst2                    4              1445
  SessionSetupAndX                          30              1356
  Negotiate                                 14               860
  NTCreateAndX                               1               705
  TreeConnectAndX                           16               622
  TreeDisconnect                             4               352
  Transaction2/QueryFSInfo                 650               297

  All SMBs                                 720               368

  No response timeouts
  No network errors

The average round-trip time is given, per CIFS operation type, for all shares in all volumes serviced by this dNAS instance. The time is measured from when vcifs transmits the request to the time at which it has received the entire response.

Cases of network error or response timeout do not contribute to the average round-trip time, but are instead counted separately.

If there appear to be cases of slow file server response, the particular server or share can be isolated using the ‘show statistics filer’ command. The '’show statistics cifs work-queues’ command only presents the overall averages.

Time Spent In Authentication

The last major section of the output shows statistics concerning authentication, particularly the time taken for authentication requests outside the ARX.

Time spent in authentication 
----------------------------

  Kerberos service ticket requests:
    Total calls: 2
    Elapsed time in call:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:         0         1         1         0         0         0         0
    Average time: 0.016 secs
    DC replacement requested: 0

  Netlogon requests:
    Total calls: 0
    Elapsed time in call:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:         0         0         0         0         0         0         0
    Average time: none

  ARX Secure Agent requests:
    Total calls: 0
    Elapsed time in call:
      Secs:    < 0.001    < 0.01     < 0.1       < 1      < 10     < 100    Longer
      Count:         0         0         0         0         0         0         0
    Average time: none

We retain separate statistics for getting Kerberos service tickets, for authenticating clients with Netlogon, and for communication with the ARX Secure Agent for client and file server logons. The results are presented as a histogram. In the above example, there is nothing that is worrying.

In addition, there is a table showing details of the 20 longest round-trip times.

Table of longest recorded round-trip times:
    Timestamp (UTC)      Type   Time (uSec)  Principal Name
    2011-04-08 13:07:38  KRB          30596  Administrator@MOUTH.ORG
    2011-04-08 13:07:38  KRB           1455  Administrator@MOUTH.ORG

The ‘type’ is one of KRB, ASA, or NL (the last-named being for NetLogon). The principal name is the name of the client for whom authentication is being done.

Summary

This new command should help you with diagnosis of CIFS response-time problems, by providing the information you need to get started on figuring out what is wrong. You may still need to use other commands to further isolate the issue (for example, to determine which particular filer is slow), but we hope it presents the top-level data in a useful form.

Author Bio

Dave Porter is an architect with F5 Data Solutions; he’s been working on ARX file server virtualization since it was invented, and is the principal author of the ARX control-plane CIFS code.

Published Oct 07, 2011

Version 1.0

news

storage

tech tip

Diagnosing ARX &lsquo;WORKSLOW&rsquo; Problems

Diagnosing ARX ‘WORKSLOW’ Problems