Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-14548

sanityn test 31a hangs in client lock with 'Could not add any time (5/5), not sending early reply'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Minor
    • None
    • Lustre 2.14.0, Lustre 2.12.7, Lustre 2.15.0
    • 3
    • 9223372036854775807

    Description

      sanityn test_31a hangs on the client. This hang was first seen on 21 AUG 2020 while testing the patch https://review.whamcloud.com/39598 (LU-13692 ldlm: Ensure we reprocess the resource on ast error) on Lustre 2.13.55.46 with logs at https://testing.whamcloud.com/test_sets/6b73874c-2358-4c75-ac94-018ba68b604b. The first time it was seen for non-review testing was on 01 NOV 2020 for an interop test session between 2.12.5 clients and 2.13.56.63 servers; https://testing.whamcloud.com/test_sets/551af6c3-ac5d-4d68-9f70-9ed93187afda.

      Looking at a recent hang between 2.13.0 clients and 2.14.50.203 servers at https://testing.whamcloud.com/test_sets/12bfd95b-3a68-4228-82b9-0a83f065233e, we see the following trace on client1 (vm1):

      [ 8288.619007] Lustre: DEBUG MARKER: == sanityn test 31a: voluntary cancel / blocking ast race============================================= 00:52:53 (1616460773)
      [ 8288.941328] Lustre: *** cfs_fail_loc=314, val=0***
      [ 8329.079615] Lustre: ldlm_cb00_000: service thread pid 15494 was inactive for 40.125 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [ 8329.083243] Pid: 15494, comm: ldlm_cb00_000 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019
      [ 8329.084917] Call Trace:
      [ 8329.085424]  [<ffffffffc0bbd039>] ldlm_handle_cp_callback+0x109/0xb20 [ptlrpc]
      [ 8329.087092]  [<ffffffffc0bc0d4e>] ldlm_callback_handler.part.11+0x153e/0x1dd0 [ptlrpc]
      [ 8329.088550]  [<ffffffffc0bc1617>] ldlm_callback_handler+0x37/0xd0 [ptlrpc]
      [ 8329.089896]  [<ffffffffc0bee856>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
      [ 8329.091347]  [<ffffffffc0bf238c>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
      [ 8329.092652]  [<ffffffffba8c50d1>] kthread+0xd1/0xe0
      [ 8329.093600]  [<ffffffffbaf8cd37>] ret_from_fork_nospec_end+0x0/0x39
      [ 8329.094818]  [<ffffffffffffffff>] 0xffffffffffffffff
      [ 8883.934431] Lustre: 21838:0:(service.c:1442:ptlrpc_at_send_early_reply()) @@@ Could not add any time (5/5), not sending early reply  req@ffff96ac93018000 x1694973578214336/t0(0) o105->LOV_OSC_UUID@10.9.5.214@tcp:334/0 lens 392/224 e 24 to 0 dl 1616461374 ref 2 fl Interpret:/0/0 rc 0/0 job:''
      [11946.899331] SysRq : Changing Loglevel
      

      We’ve seen sanityn test 31a hang in this way 124 times since August 2020 and many of the hangs are for interop testing. Logs for recent failures are at:
      2.10.8 clients/2.12.6.43 servers - https://testing.whamcloud.com/test_sets/82787b59-0ba3-40f5-a0f2-8ca4cf357994
      2.11.0 clients/2.12.6.43 servers - https://testing.whamcloud.com/test_sets/4a602563-ea81-4301-8b1a-05cb08b5c984
      2.13.0 clients/2.14.50.190 servers - https://testing.whamcloud.com/test_sets/7d866672-6aeb-4340-ba74-38be6fb836b5

      Attachments

        Issue Links

          Activity

            People

              wc-triage WC Triage
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: