[LU-11768] sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’ Created: 12/Dec/18  Updated: 14/Feb/22  Resolved: 13/Sep/21

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.6, Lustre 2.12.7
Fix Version/s: Lustre 2.12.8

Type: Bug Priority: Minor
Reporter: James Nunez (Inactive) Assignee: Hongchao Zhang
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by LU-12749 sanity-quota test_6: FAIL: [22292.915... Resolved
Related
is related to LU-11644 LNet: Service thread inactive for 300... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

sanity-quota test_6 started failing on November 13, 2018, Lustre tag 2.11.56.140, with the error

[ 1733.308968] LNet: Service thread pid 18400 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: 

In sanity-quota test_6, we scan the OST1 dmesg log to see if watchdog was triggered. Looking at the logs for https://testing.whamcloud.com/test_sets/9f3095ea-fdc2-11e8-b837-52540065bddc , the dmesg log from OST1 (vm3) contains the NET error and the stack trace

[18752.909319] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1
[18795.136287] LNet: Service thread pid 14192 was inactive for 40.14s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
[18795.137958] Pid: 14192, comm: ll_ost_io00_002 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Sat Dec 8 05:52:11 UTC 2018
[18795.138944] Call Trace:
[18795.139235]  [<ffffffffc0f2a880>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc]
[18795.140051]  [<ffffffffc0f2acd3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc]
[18795.140837]  [<ffffffffc1115308>] qsd_send_dqacq+0x2e8/0x340 [lquota]
[18795.141528]  [<ffffffffc1123383>] qsd_acquire+0x8e3/0xcb0 [lquota]
[18795.142183]  [<ffffffffc11238d4>] qsd_op_begin0+0x184/0x960 [lquota]
[18795.142838]  [<ffffffffc1124312>] qsd_op_begin+0x262/0x4b0 [lquota]
[18795.143571]  [<ffffffffc116eac7>] osd_declare_quota+0xd7/0x360 [osd_zfs]
[18795.144322]  [<ffffffffc1177ff0>] osd_declare_write_commit+0x3d0/0x7f0 [osd_zfs]
[18795.145083]  [<ffffffffc12958d9>] ofd_commitrw_write+0x939/0x1d40 [ofd]
[18795.145833]  [<ffffffffc1299de2>] ofd_commitrw+0x4b2/0xa10 [ofd]
[18795.146465]  [<ffffffffc0f98d6c>] obd_commitrw+0x9c/0x370 [ptlrpc]
[18795.147178]  [<ffffffffc0f9b9dd>] tgt_brw_write+0x100d/0x1a90 [ptlrpc]
[18795.147927]  [<ffffffffc0f9f29a>] tgt_request_handle+0xaea/0x1580 [ptlrpc]
[18795.148649]  [<ffffffffc0f4391b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc]
[18795.149488]  [<ffffffffc0f4724c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc]
[18795.150201]  [<ffffffff86abdf21>] kthread+0xd1/0xe0
[18795.150788]  [<ffffffff871255f7>] ret_from_fork_nospec_end+0x0/0x39
[18795.151437]  [<ffffffffffffffff>] 0xffffffffffffffff
[18795.152089] LustreError: dumping log to /tmp/lustre-log.1544552141.14192
 

There is no other indication of a problem in the console and dmesg logs. We see this issue for both zfs and ldiskfs environments.

Some of these failures are attributed to LU-11644, but the stack traces do not look the same.

Logs for this failure are at
https://testing.whamcloud.com/test_sets/e2bf61ea-e78f-11e8-b67f-52540065bddc
https://testing.whamcloud.com/test_sets/bca63f5a-f60e-11e8-bfe1-52540065bddc
https://testing.whamcloud.com/test_sets/613c72d6-f5d9-11e8-bfe1-52540065bddc



 Comments   
Comment by Peter Jones [ 12/Dec/18 ]

Hongchao

Can you please assess this issue?

Thanks

Peter

Comment by Hongchao Zhang [ 13/Dec/18 ]

This could be caused by larger timeout in "qsd_timeout", which will cause the wait for "QUOTA_DQACQ" not to time out

static inline int qsd_wait_timeout(struct qsd_instance *qsd)
{               
        if (qsd->qsd_timeout != 0)
                return qsd->qsd_timeout;
        return min_t(int, at_max / 2, obd_timeout / 2);
} 
Comment by Patrick Farrell (Inactive) [ 14/Dec/18 ]

Hit in:
https://testing.whamcloud.com/test_sessions/3cd6e2ff-c3a4-416c-9441-faea50b1d1c7

Comment by Gerrit Updater [ 28/Dec/18 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33931
Subject: LU-11768 test: reset qsd_time before test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 9627ce42089beb5a9e8d3d5fd0fdf3ce4d264b53

Comment by Jian Yu [ 19/Jan/19 ]

+1 on master branch:
https://testing.whamcloud.com/test_sets/45eb11e0-1b5e-11e9-a2cc-52540065bddc

Comment by Gerrit Updater [ 27/Feb/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33931/
Subject: LU-11768 test: reset qsd_time before test
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6cb284c2ed92284454580e6e54e00ddc33530c6e

Comment by Peter Jones [ 27/Feb/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 12/Jun/19 ]

Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35212
Subject: LU-11768 test: reset qsd_time before test
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 1a139ccf57ec8971b535f14aa872498177468e9e

Comment by Gerrit Updater [ 25/Jun/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35212/
Subject: LU-11768 test: reset qsd_time before test
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 8feb93bcb079f3d119c74f8fde0e589692098190

Comment by Sebastien Buisson [ 19/Jul/19 ]

This issue was hit again on b2_12 with patch "LU-11768 test: reset qsd_time before test" included:
https://testing.whamcloud.com/test_sets/76e57046-a9da-11e9-a0be-52540065bddc

Comment by Hongchao Zhang [ 30/Jul/19 ]

the reason of this issue is the timeout of QUOTA_DQACQ request is even longer than the obd_timeout * 2.

Lustre: 15662:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1563502007/real 1563502007]  req@ffff93c89d32c900 x1639433946429456/t0(0) o601->lustre-MDT0000-lwp-OST0000@10.9.5.137@tcp:23/10 lens 336/336 e 0 to 1 dl 1563502051 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

the interval to time out is 44s (1563502051 - 1563502007), which cause the watch dog to be triggered.

Comment by Gerrit Updater [ 30/Jul/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35651
Subject: LU-11768 test: limit at_max to timeout in time
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: e4263953f3c1698596c9c301be6aa1ce8beecfad

Comment by Gerrit Updater [ 27/Aug/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35651/
Subject: LU-11768 test: limit at_max to timeout in time
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: d8226b9353dbc1448af8d23c13cae5f21cbe3a86

Comment by Peter Jones [ 27/Aug/19 ]

Landed for 2.13

Comment by Gerrit Updater [ 03/Oct/19 ]

James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36365
Subject: LU-11768 test: limit at_max to timeout in time
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: 51596be523efcb7d83fed4adc196461b09a28792

Comment by James Nunez (Inactive) [ 09/Oct/19 ]

It looks like we are still experiencing this issue on master. See https://testing.whamcloud.com/test_sets/31f6894c-e9fa-11e9-9874-52540065bddc for a recent failure and logs.

Comment by Andreas Dilger [ 09/Oct/19 ]

I compared the sanity-quota test_6 results from the past week and the week before the patch landed, and both weeks had about the same number (20 failures vs. 22 failures) respectively.

Comment by Gerrit Updater [ 11/Oct/19 ]

Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36431
Subject: LU-11768 test: make at_max to take effect
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 136c71c7134b92258c77457e11b7472143a38f8d

Comment by Gerrit Updater [ 22/Oct/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36431/
Subject: LU-11768 test: make at_max to take effect
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 550af84a91505c85824ffad2990d31c8e8ab4dd9

Comment by Peter Jones [ 23/Oct/19 ]

Second attempt landed for 2.13

Comment by Gerrit Updater [ 21/Nov/19 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36365/
Subject: LU-11768 test: limit at_max to timeout in time
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 1d37c1a04efdcc64232019ba09a97ae1ff0a083e

Comment by Sebastien Buisson [ 13/Dec/19 ]

This ticket might need to be reopened, it looks like I have a new occurence on 2.13.50:
https://testing.whamcloud.com/test_sets/45f983d2-1d52-11ea-adca-52540065bddc

Comment by Andreas Dilger [ 31/Jan/20 ]

+1 on b2_12: https://testing.whamcloud.com/test_sessions/6d690370-68b6-4cd4-9ef4-9656d8b00c07

Comment by Chris Horn [ 09/Jun/20 ]

+1 on b2_12: https://testing.whamcloud.com/test_sessions/c402e1ee-1be5-4ef3-859d-8da51d2ce887

Comment by Olaf Faaland [ 28/Jan/21 ]

I see the following in master, that was not backported to b2_12, that seems like it may be a fix for this issue. Does that look right?

  • 550af84 LU-11768 test: make at_max to take effect
Comment by Gerrit Updater [ 28/Jan/21 ]

Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41345
Subject: LU-11768 test: make at_max to take effect
Project: fs/lustre-release
Branch: b2_12
Current Patch Set: 1
Commit: ff23257c5a429153fab54ee3317863c6cfad04b5

Comment by Olaf Faaland [ 12/Jul/21 ]

Hi All,

The backport (https://review.whamcloud.com/41345) has passed testing and review.  It's just a test fix.  Is there a reason not to land it for 2.12.7?

Comment by Etienne Aujames [ 03/Aug/21 ]

+1 on b2_12 (2.12.7): https://testing.whamcloud.com/test_sets/2f7e2227-736b-47b1-bcf5-262d195aa78f

Comment by Gerrit Updater [ 13/Sep/21 ]

"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41345/
Subject: LU-11768 test: make at_max to take effect
Project: fs/lustre-release
Branch: b2_12
Current Patch Set:
Commit: 1a88d6501268709f44d20e322d76c57266b2c112

Comment by Etienne Aujames [ 14/Feb/22 ]

+1 on b2_12 (2.12.8 - ZFS): https://testing.whamcloud.com/test_sets/e6e54b3f-88e9-40d7-b9c5-363fb52037ce

Generated at Sat Feb 10 02:46:45 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.