Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-11768

sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’

Details

    • Bug
    • Resolution: Fixed
    • Minor
    • Lustre 2.12.8
    • Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.6, Lustre 2.12.7
    • None
    • 3
    • 9223372036854775807

    Description

      sanity-quota test_6 started failing on November 13, 2018, Lustre tag 2.11.56.140, with the error

      [ 1733.308968] LNet: Service thread pid 18400 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 
      

      In sanity-quota test_6, we scan the OST1 dmesg log to see if watchdog was triggered. Looking at the logs for https://testing.whamcloud.com/test_sets/9f3095ea-fdc2-11e8-b837-52540065bddc , the dmesg log from OST1 (vm3) contains the NET error and the stack trace

      [18752.909319] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
      [18795.136287] LNet: Service thread pid 14192 was inactive for 40.14s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
      [18795.137958] Pid: 14192, comm: ll_ost_io00_002 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Sat Dec 8 05:52:11 UTC 2018
      [18795.138944] Call Trace:
      [18795.139235]  [<ffffffffc0f2a880>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
      [18795.140051]  [<ffffffffc0f2acd3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
      [18795.140837]  [<ffffffffc1115308>] qsd_send_dqacq+0x2e8/0x340 [lquota]
      [18795.141528]  [<ffffffffc1123383>] qsd_acquire+0x8e3/0xcb0 [lquota]
      [18795.142183]  [<ffffffffc11238d4>] qsd_op_begin0+0x184/0x960 [lquota]
      [18795.142838]  [<ffffffffc1124312>] qsd_op_begin+0x262/0x4b0 [lquota]
      [18795.143571]  [<ffffffffc116eac7>] osd_declare_quota+0xd7/0x360 [osd_zfs]
      [18795.144322]  [<ffffffffc1177ff0>] osd_declare_write_commit+0x3d0/0x7f0 [osd_zfs]
      [18795.145083]  [<ffffffffc12958d9>] ofd_commitrw_write+0x939/0x1d40 [ofd]
      [18795.145833]  [<ffffffffc1299de2>] ofd_commitrw+0x4b2/0xa10 [ofd]
      [18795.146465]  [<ffffffffc0f98d6c>] obd_commitrw+0x9c/0x370 [ptlrpc]
      [18795.147178]  [<ffffffffc0f9b9dd>] tgt_brw_write+0x100d/0x1a90 [ptlrpc]
      [18795.147927]  [<ffffffffc0f9f29a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
      [18795.148649]  [<ffffffffc0f4391b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
      [18795.149488]  [<ffffffffc0f4724c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
      [18795.150201]  [<ffffffff86abdf21>] kthread+0xd1/0xe0
      [18795.150788]  [<ffffffff871255f7>] ret_from_fork_nospec_end+0x0/0x39
      [18795.151437]  [<ffffffffffffffff>] 0xffffffffffffffff
      [18795.152089] LustreError: dumping log to /tmp/lustre-log.1544552141.14192
       

      There is no other indication of a problem in the console and dmesg logs. We see this issue for both zfs and ldiskfs environments.

      Some of these failures are attributed to LU-11644, but the stack traces do not look the same.

      Logs for this failure are at
      https://testing.whamcloud.com/test_sets/e2bf61ea-e78f-11e8-b67f-52540065bddc
      https://testing.whamcloud.com/test_sets/bca63f5a-f60e-11e8-bfe1-52540065bddc
      https://testing.whamcloud.com/test_sets/613c72d6-f5d9-11e8-bfe1-52540065bddc

      Attachments

        Issue Links

          Activity

            [LU-11768] sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’
            eaujames Etienne Aujames added a comment - - edited +1 on b2_12 (2.12.8 - ZFS): https://testing.whamcloud.com/test_sets/e6e54b3f-88e9-40d7-b9c5-363fb52037ce

            "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41345/
            Subject: LU-11768 test: make at_max to take effect
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 1a88d6501268709f44d20e322d76c57266b2c112

            gerrit Gerrit Updater added a comment - "Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41345/ Subject: LU-11768 test: make at_max to take effect Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 1a88d6501268709f44d20e322d76c57266b2c112
            eaujames Etienne Aujames added a comment - +1 on b2_12 (2.12.7): https://testing.whamcloud.com/test_sets/2f7e2227-736b-47b1-bcf5-262d195aa78f
            ofaaland Olaf Faaland added a comment - - edited

            Hi All,

            The backport (https://review.whamcloud.com/41345) has passed testing and review.  It's just a test fix.  Is there a reason not to land it for 2.12.7?

            ofaaland Olaf Faaland added a comment - - edited Hi All, The backport ( https://review.whamcloud.com/41345 ) has passed testing and review.  It's just a test fix.  Is there a reason not to land it for 2.12.7?

            Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41345
            Subject: LU-11768 test: make at_max to take effect
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set: 1
            Commit: ff23257c5a429153fab54ee3317863c6cfad04b5

            gerrit Gerrit Updater added a comment - Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41345 Subject: LU-11768 test: make at_max to take effect Project: fs/lustre-release Branch: b2_12 Current Patch Set: 1 Commit: ff23257c5a429153fab54ee3317863c6cfad04b5
            ofaaland Olaf Faaland added a comment -

            I see the following in master, that was not backported to b2_12, that seems like it may be a fix for this issue. Does that look right?

            • 550af84 LU-11768 test: make at_max to take effect
            ofaaland Olaf Faaland added a comment - I see the following in master, that was not backported to b2_12, that seems like it may be a fix for this issue. Does that look right? 550af84 LU-11768 test: make at_max to take effect
            hornc Chris Horn added a comment - +1 on b2_12: https://testing.whamcloud.com/test_sessions/c402e1ee-1be5-4ef3-859d-8da51d2ce887
            adilger Andreas Dilger added a comment - +1 on b2_12: https://testing.whamcloud.com/test_sessions/6d690370-68b6-4cd4-9ef4-9656d8b00c07

            This ticket might need to be reopened, it looks like I have a new occurence on 2.13.50:
            https://testing.whamcloud.com/test_sets/45f983d2-1d52-11ea-adca-52540065bddc

            sebastien Sebastien Buisson added a comment - This ticket might need to be reopened, it looks like I have a new occurence on 2.13.50: https://testing.whamcloud.com/test_sets/45f983d2-1d52-11ea-adca-52540065bddc

            Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36365/
            Subject: LU-11768 test: limit at_max to timeout in time
            Project: fs/lustre-release
            Branch: b2_12
            Current Patch Set:
            Commit: 1d37c1a04efdcc64232019ba09a97ae1ff0a083e

            gerrit Gerrit Updater added a comment - Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36365/ Subject: LU-11768 test: limit at_max to timeout in time Project: fs/lustre-release Branch: b2_12 Current Patch Set: Commit: 1d37c1a04efdcc64232019ba09a97ae1ff0a083e

            People

              hongchao.zhang Hongchao Zhang
              jamesanunez James Nunez (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: