Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11768

sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.8
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.6, Lustre 2.12.7
    • None
    • 3
    • 9223372036854775807

    Description

      sanity-quota test_6 started failing on November 13, 2018, Lustre tag 2.11.56.140, with the error

      [ 1733.308968] LNet: Service thread pid 18400 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 
      

      In sanity-quota test_6, we scan the OST1 dmesg log to see if watchdog was triggered. Looking at the logs for https://testing.whamcloud.com/test_sets/9f3095ea-fdc2-11e8-b837-52540065bddc , the dmesg log from OST1 (vm3) contains the NET error and the stack trace

      [18752.909319] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
      [18795.136287] LNet: Service thread pid 14192 was inactive for 40.14s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [18795.137958] Pid: 14192, comm: ll_ost_io00_002 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Sat Dec 8 05:52:11 UTC 2018
      [18795.138944] Call Trace:
      [18795.139235]  [<ffffffffc0f2a880>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
      [18795.140051]  [<ffffffffc0f2acd3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      [18795.140837]  [<ffffffffc1115308>] qsd_send_dqacq+0x2e8/0x340 [lquota]
      [18795.141528]  [<ffffffffc1123383>] qsd_acquire+0x8e3/0xcb0 [lquota]
      [18795.142183]  [<ffffffffc11238d4>] qsd_op_begin0+0x184/0x960 [lquota]
      [18795.142838]  [<ffffffffc1124312>] qsd_op_begin+0x262/0x4b0 [lquota]
      [18795.143571]  [<ffffffffc116eac7>] osd_declare_quota+0xd7/0x360 [osd_zfs]
      [18795.144322]  [<ffffffffc1177ff0>] osd_declare_write_commit+0x3d0/0x7f0 [osd_zfs]
      [18795.145083]  [<ffffffffc12958d9>] ofd_commitrw_write+0x939/0x1d40 [ofd]
      [18795.145833]  [<ffffffffc1299de2>] ofd_commitrw+0x4b2/0xa10 [ofd]
      [18795.146465]  [<ffffffffc0f98d6c>] obd_commitrw+0x9c/0x370 [ptlrpc]
      [18795.147178]  [<ffffffffc0f9b9dd>] tgt_brw_write+0x100d/0x1a90 [ptlrpc]
      [18795.147927]  [<ffffffffc0f9f29a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [18795.148649]  [<ffffffffc0f4391b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [18795.149488]  [<ffffffffc0f4724c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
      [18795.150201]  [<ffffffff86abdf21>] kthread+0xd1/0xe0
      [18795.150788]  [<ffffffff871255f7>] ret_from_fork_nospec_end+0x0/0x39
      [18795.151437]  [<ffffffffffffffff>] 0xffffffffffffffff
      [18795.152089] LustreError: dumping log to /tmp/lustre-log.1544552141.14192
       

      There is no other indication of a problem in the console and dmesg logs. We see this issue for both zfs and ldiskfs environments.

      Some of these failures are attributed to LU-11644, but the stack traces do not look the same.

      Logs for this failure are at
      https://testing.whamcloud.com/test_sets/e2bf61ea-e78f-11e8-b67f-52540065bddc
      https://testing.whamcloud.com/test_sets/bca63f5a-f60e-11e8-bfe1-52540065bddc
      https://testing.whamcloud.com/test_sets/613c72d6-f5d9-11e8-bfe1-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11768] sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’

            Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41345
            Subject: LU-11768 test: make at_max to take effect
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: ff23257c5a429153fab54ee3317863c6cfad04b5

            gerrit Gerrit Updater added a comment - Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41345 Subject: LU-11768 test: make at_max to take effect Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: ff23257c5a429153fab54ee3317863c6cfad04b5
            ofaaland Olaf Faaland added a comment -

            I see the following in master, that was not backported to b2_12, that seems like it may be a fix for this issue. Does that look right?

            • 550af84 LU-11768 test: make at_max to take effect
            ofaaland Olaf Faaland added a comment - I see the following in master, that was not backported to b2_12, that seems like it may be a fix for this issue. Does that look right? 550af84 LU-11768 test: make at_max to take effect
            hornc Chris Horn added a comment - +1 on b2_12: https://testing.whamcloud.com/test_sessions/c402e1ee-1be5-4ef3-859d-8da51d2ce887
            adilger Andreas Dilger added a comment - +1 on b2_12: https://testing.whamcloud.com/test_sessions/6d690370-68b6-4cd4-9ef4-9656d8b00c07

            This ticket might need to be reopened, it looks like I have a new occurence on 2.13.50:
            https://testing.whamcloud.com/test_sets/45f983d2-1d52-11ea-adca-52540065bddc

            sebastien Sebastien Buisson added a comment - This ticket might need to be reopened, it looks like I have a new occurence on 2.13.50: https://testing.whamcloud.com/test_sets/45f983d2-1d52-11ea-adca-52540065bddc

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36365/
            Subject: LU-11768 test: limit at_max to timeout in time
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 1d37c1a04efdcc64232019ba09a97ae1ff0a083e

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36365/ Subject: LU-11768 test: limit at_max to timeout in time Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 1d37c1a04efdcc64232019ba09a97ae1ff0a083e
            pjones Peter Jones added a comment -

            Second attempt landed for 2.13

            pjones Peter Jones added a comment - Second attempt landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36431/
            Subject: LU-11768 test: make at_max to take effect
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 550af84a91505c85824ffad2990d31c8e8ab4dd9

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36431/ Subject: LU-11768 test: make at_max to take effect Project: fs/lustre-release Branch: master Current Patch Set: Commit: 550af84a91505c85824ffad2990d31c8e8ab4dd9

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36431
            Subject: LU-11768 test: make at_max to take effect
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 136c71c7134b92258c77457e11b7472143a38f8d

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36431 Subject: LU-11768 test: make at_max to take effect Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 136c71c7134b92258c77457e11b7472143a38f8d

            I compared the sanity-quota test_6 results from the past week and the week before the patch landed, and both weeks had about the same number (20 failures vs. 22 failures) respectively.

            adilger Andreas Dilger added a comment - I compared the sanity-quota test_6 results from the past week and the week before the patch landed, and both weeks had about the same number ( 20 failures vs. 22 failures ) respectively.

            It looks like we are still experiencing this issue on master. See https://testing.whamcloud.com/test_sets/31f6894c-e9fa-11e9-9874-52540065bddc for a recent failure and logs.

            jamesanunez James Nunez (Inactive) added a comment - It looks like we are still experiencing this issue on master. See https://testing.whamcloud.com/test_sets/31f6894c-e9fa-11e9-9874-52540065bddc for a recent failure and logs.

            People

              hongchao.zhang Hongchao Zhang
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: