Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11768

sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.8
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.6, Lustre 2.12.7
    • None
    • 3
    • 9223372036854775807

    Description

      sanity-quota test_6 started failing on November 13, 2018, Lustre tag 2.11.56.140, with the error

      [ 1733.308968] LNet: Service thread pid 18400 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 
      

      In sanity-quota test_6, we scan the OST1 dmesg log to see if watchdog was triggered. Looking at the logs for https://testing.whamcloud.com/test_sets/9f3095ea-fdc2-11e8-b837-52540065bddc , the dmesg log from OST1 (vm3) contains the NET error and the stack trace

      [18752.909319] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
      [18795.136287] LNet: Service thread pid 14192 was inactive for 40.14s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [18795.137958] Pid: 14192, comm: ll_ost_io00_002 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Sat Dec 8 05:52:11 UTC 2018
      [18795.138944] Call Trace:
      [18795.139235]  [<ffffffffc0f2a880>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
      [18795.140051]  [<ffffffffc0f2acd3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      [18795.140837]  [<ffffffffc1115308>] qsd_send_dqacq+0x2e8/0x340 [lquota]
      [18795.141528]  [<ffffffffc1123383>] qsd_acquire+0x8e3/0xcb0 [lquota]
      [18795.142183]  [<ffffffffc11238d4>] qsd_op_begin0+0x184/0x960 [lquota]
      [18795.142838]  [<ffffffffc1124312>] qsd_op_begin+0x262/0x4b0 [lquota]
      [18795.143571]  [<ffffffffc116eac7>] osd_declare_quota+0xd7/0x360 [osd_zfs]
      [18795.144322]  [<ffffffffc1177ff0>] osd_declare_write_commit+0x3d0/0x7f0 [osd_zfs]
      [18795.145083]  [<ffffffffc12958d9>] ofd_commitrw_write+0x939/0x1d40 [ofd]
      [18795.145833]  [<ffffffffc1299de2>] ofd_commitrw+0x4b2/0xa10 [ofd]
      [18795.146465]  [<ffffffffc0f98d6c>] obd_commitrw+0x9c/0x370 [ptlrpc]
      [18795.147178]  [<ffffffffc0f9b9dd>] tgt_brw_write+0x100d/0x1a90 [ptlrpc]
      [18795.147927]  [<ffffffffc0f9f29a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [18795.148649]  [<ffffffffc0f4391b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [18795.149488]  [<ffffffffc0f4724c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
      [18795.150201]  [<ffffffff86abdf21>] kthread+0xd1/0xe0
      [18795.150788]  [<ffffffff871255f7>] ret_from_fork_nospec_end+0x0/0x39
      [18795.151437]  [<ffffffffffffffff>] 0xffffffffffffffff
      [18795.152089] LustreError: dumping log to /tmp/lustre-log.1544552141.14192
       

      There is no other indication of a problem in the console and dmesg logs. We see this issue for both zfs and ldiskfs environments.

      Some of these failures are attributed to LU-11644, but the stack traces do not look the same.

      Logs for this failure are at
      https://testing.whamcloud.com/test_sets/e2bf61ea-e78f-11e8-b67f-52540065bddc
      https://testing.whamcloud.com/test_sets/bca63f5a-f60e-11e8-bfe1-52540065bddc
      https://testing.whamcloud.com/test_sets/613c72d6-f5d9-11e8-bfe1-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11768] sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’
            adilger Andreas Dilger added a comment - +1 on b2_12: https://testing.whamcloud.com/test_sessions/6d690370-68b6-4cd4-9ef4-9656d8b00c07

            This ticket might need to be reopened, it looks like I have a new occurence on 2.13.50:
            https://testing.whamcloud.com/test_sets/45f983d2-1d52-11ea-adca-52540065bddc

            sebastien Sebastien Buisson added a comment - This ticket might need to be reopened, it looks like I have a new occurence on 2.13.50: https://testing.whamcloud.com/test_sets/45f983d2-1d52-11ea-adca-52540065bddc

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36365/
            Subject: LU-11768 test: limit at_max to timeout in time
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 1d37c1a04efdcc64232019ba09a97ae1ff0a083e

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36365/ Subject: LU-11768 test: limit at_max to timeout in time Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 1d37c1a04efdcc64232019ba09a97ae1ff0a083e
            pjones Peter Jones added a comment -

            Second attempt landed for 2.13

            pjones Peter Jones added a comment - Second attempt landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36431/
            Subject: LU-11768 test: make at_max to take effect
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: 550af84a91505c85824ffad2990d31c8e8ab4dd9

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36431/ Subject: LU-11768 test: make at_max to take effect Project: fs/lustre-release Branch: master Current Patch Set: Commit: 550af84a91505c85824ffad2990d31c8e8ab4dd9

            Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36431
            Subject: LU-11768 test: make at_max to take effect
            Project: fs/lustre-release
            Branch: master
            Current Patch Set: 1
            Commit: 136c71c7134b92258c77457e11b7472143a38f8d

            gerrit Gerrit Updater added a comment - Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36431 Subject: LU-11768 test: make at_max to take effect Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 136c71c7134b92258c77457e11b7472143a38f8d

            I compared the sanity-quota test_6 results from the past week and the week before the patch landed, and both weeks had about the same number (20 failures vs. 22 failures) respectively.

            adilger Andreas Dilger added a comment - I compared the sanity-quota test_6 results from the past week and the week before the patch landed, and both weeks had about the same number ( 20 failures vs. 22 failures ) respectively.

            It looks like we are still experiencing this issue on master. See https://testing.whamcloud.com/test_sets/31f6894c-e9fa-11e9-9874-52540065bddc for a recent failure and logs.

            jamesanunez James Nunez (Inactive) added a comment - It looks like we are still experiencing this issue on master. See https://testing.whamcloud.com/test_sets/31f6894c-e9fa-11e9-9874-52540065bddc for a recent failure and logs.

            James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36365
            Subject: LU-11768 test: limit at_max to timeout in time
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: 51596be523efcb7d83fed4adc196461b09a28792

            gerrit Gerrit Updater added a comment - James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36365 Subject: LU-11768 test: limit at_max to timeout in time Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: 51596be523efcb7d83fed4adc196461b09a28792
            pjones Peter Jones added a comment -

            Landed for 2.13

            pjones Peter Jones added a comment - Landed for 2.13

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35651/
            Subject: LU-11768 test: limit at_max to timeout in time
            Project: fs/lustre-release
            Branch: master
            Current Patch Set:
            Commit: d8226b9353dbc1448af8d23c13cae5f21cbe3a86

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35651/ Subject: LU-11768 test: limit at_max to timeout in time Project: fs/lustre-release Branch: master Current Patch Set: Commit: d8226b9353dbc1448af8d23c13cae5f21cbe3a86

            People

              hongchao.zhang Hongchao Zhang
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: