[LU-11768] sanity-quota test 6 fails with ‘LNet: Service thread pid <pid> was inactive for …’ Created: 12/Dec/18 Updated: 14/Feb/22 Resolved: 13/Sep/21 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.0, Lustre 2.13.0, Lustre 2.12.3, Lustre 2.12.6, Lustre 2.12.7 |
| Fix Version/s: | Lustre 2.12.8 |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | Hongchao Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Severity: | 3 | ||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||
| Description |
|
sanity-quota test_6 started failing on November 13, 2018, Lustre tag 2.11.56.140, with the error [ 1733.308968] LNet: Service thread pid 18400 was inactive for 40.06s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: In sanity-quota test_6, we scan the OST1 dmesg log to see if watchdog was triggered. Looking at the logs for https://testing.whamcloud.com/test_sets/9f3095ea-fdc2-11e8-b837-52540065bddc , the dmesg log from OST1 (vm3) contains the NET error and the stack trace [18752.909319] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1 [18795.136287] LNet: Service thread pid 14192 was inactive for 40.14s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [18795.137958] Pid: 14192, comm: ll_ost_io00_002 3.10.0-862.14.4.el7_lustre.x86_64 #1 SMP Sat Dec 8 05:52:11 UTC 2018 [18795.138944] Call Trace: [18795.139235] [<ffffffffc0f2a880>] ptlrpc_set_wait+0x500/0x8d0 [ptlrpc] [18795.140051] [<ffffffffc0f2acd3>] ptlrpc_queue_wait+0x83/0x230 [ptlrpc] [18795.140837] [<ffffffffc1115308>] qsd_send_dqacq+0x2e8/0x340 [lquota] [18795.141528] [<ffffffffc1123383>] qsd_acquire+0x8e3/0xcb0 [lquota] [18795.142183] [<ffffffffc11238d4>] qsd_op_begin0+0x184/0x960 [lquota] [18795.142838] [<ffffffffc1124312>] qsd_op_begin+0x262/0x4b0 [lquota] [18795.143571] [<ffffffffc116eac7>] osd_declare_quota+0xd7/0x360 [osd_zfs] [18795.144322] [<ffffffffc1177ff0>] osd_declare_write_commit+0x3d0/0x7f0 [osd_zfs] [18795.145083] [<ffffffffc12958d9>] ofd_commitrw_write+0x939/0x1d40 [ofd] [18795.145833] [<ffffffffc1299de2>] ofd_commitrw+0x4b2/0xa10 [ofd] [18795.146465] [<ffffffffc0f98d6c>] obd_commitrw+0x9c/0x370 [ptlrpc] [18795.147178] [<ffffffffc0f9b9dd>] tgt_brw_write+0x100d/0x1a90 [ptlrpc] [18795.147927] [<ffffffffc0f9f29a>] tgt_request_handle+0xaea/0x1580 [ptlrpc] [18795.148649] [<ffffffffc0f4391b>] ptlrpc_server_handle_request+0x24b/0xab0 [ptlrpc] [18795.149488] [<ffffffffc0f4724c>] ptlrpc_main+0xafc/0x1fb0 [ptlrpc] [18795.150201] [<ffffffff86abdf21>] kthread+0xd1/0xe0 [18795.150788] [<ffffffff871255f7>] ret_from_fork_nospec_end+0x0/0x39 [18795.151437] [<ffffffffffffffff>] 0xffffffffffffffff [18795.152089] LustreError: dumping log to /tmp/lustre-log.1544552141.14192 There is no other indication of a problem in the console and dmesg logs. We see this issue for both zfs and ldiskfs environments. Some of these failures are attributed to LU-11644, but the stack traces do not look the same. Logs for this failure are at |
| Comments |
| Comment by Peter Jones [ 12/Dec/18 ] |
|
Hongchao Can you please assess this issue? Thanks Peter |
| Comment by Hongchao Zhang [ 13/Dec/18 ] |
|
This could be caused by larger timeout in "qsd_timeout", which will cause the wait for "QUOTA_DQACQ" not to time out static inline int qsd_wait_timeout(struct qsd_instance *qsd)
{
if (qsd->qsd_timeout != 0)
return qsd->qsd_timeout;
return min_t(int, at_max / 2, obd_timeout / 2);
}
|
| Comment by Patrick Farrell (Inactive) [ 14/Dec/18 ] |
|
Hit in: |
| Comment by Gerrit Updater [ 28/Dec/18 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/33931 |
| Comment by Jian Yu [ 19/Jan/19 ] |
|
+1 on master branch: |
| Comment by Gerrit Updater [ 27/Feb/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/33931/ |
| Comment by Peter Jones [ 27/Feb/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 12/Jun/19 ] |
|
Minh Diep (mdiep@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35212 |
| Comment by Gerrit Updater [ 25/Jun/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35212/ |
| Comment by Sebastien Buisson [ 19/Jul/19 ] |
|
This issue was hit again on b2_12 with patch " |
| Comment by Hongchao Zhang [ 30/Jul/19 ] |
|
the reason of this issue is the timeout of QUOTA_DQACQ request is even longer than the obd_timeout * 2. Lustre: 15662:0:(client.c:2134:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1563502007/real 1563502007] req@ffff93c89d32c900 x1639433946429456/t0(0) o601->lustre-MDT0000-lwp-OST0000@10.9.5.137@tcp:23/10 lens 336/336 e 0 to 1 dl 1563502051 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 the interval to time out is 44s (1563502051 - 1563502007), which cause the watch dog to be triggered. |
| Comment by Gerrit Updater [ 30/Jul/19 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/35651 |
| Comment by Gerrit Updater [ 27/Aug/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/35651/ |
| Comment by Peter Jones [ 27/Aug/19 ] |
|
Landed for 2.13 |
| Comment by Gerrit Updater [ 03/Oct/19 ] |
|
James Nunez (jnunez@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36365 |
| Comment by James Nunez (Inactive) [ 09/Oct/19 ] |
|
It looks like we are still experiencing this issue on master. See https://testing.whamcloud.com/test_sets/31f6894c-e9fa-11e9-9874-52540065bddc for a recent failure and logs. |
| Comment by Andreas Dilger [ 09/Oct/19 ] |
|
I compared the sanity-quota test_6 results from the past week and the week before the patch landed, and both weeks had about the same number (20 failures vs. 22 failures) respectively. |
| Comment by Gerrit Updater [ 11/Oct/19 ] |
|
Hongchao Zhang (hongchao@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36431 |
| Comment by Gerrit Updater [ 22/Oct/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36431/ |
| Comment by Peter Jones [ 23/Oct/19 ] |
|
Second attempt landed for 2.13 |
| Comment by Gerrit Updater [ 21/Nov/19 ] |
|
Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36365/ |
| Comment by Sebastien Buisson [ 13/Dec/19 ] |
|
This ticket might need to be reopened, it looks like I have a new occurence on 2.13.50: |
| Comment by Andreas Dilger [ 31/Jan/20 ] |
|
+1 on b2_12: https://testing.whamcloud.com/test_sessions/6d690370-68b6-4cd4-9ef4-9656d8b00c07 |
| Comment by Chris Horn [ 09/Jun/20 ] |
|
+1 on b2_12: https://testing.whamcloud.com/test_sessions/c402e1ee-1be5-4ef3-859d-8da51d2ce887 |
| Comment by Olaf Faaland [ 28/Jan/21 ] |
|
I see the following in master, that was not backported to b2_12, that seems like it may be a fix for this issue. Does that look right?
|
| Comment by Gerrit Updater [ 28/Jan/21 ] |
|
Olaf Faaland-LLNL (faaland1@llnl.gov) uploaded a new patch: https://review.whamcloud.com/41345 |
| Comment by Olaf Faaland [ 12/Jul/21 ] |
|
Hi All, The backport (https://review.whamcloud.com/41345) has passed testing and review. It's just a test fix. Is there a reason not to land it for 2.12.7? |
| Comment by Etienne Aujames [ 03/Aug/21 ] |
|
+1 on b2_12 (2.12.7): https://testing.whamcloud.com/test_sets/2f7e2227-736b-47b1-bcf5-262d195aa78f |
| Comment by Gerrit Updater [ 13/Sep/21 ] |
|
"Oleg Drokin <green@whamcloud.com>" merged in patch https://review.whamcloud.com/41345/ |
| Comment by Etienne Aujames [ 14/Feb/22 ] |
|
+1 on b2_12 (2.12.8 - ZFS): https://testing.whamcloud.com/test_sets/e6e54b3f-88e9-40d7-b9c5-363fb52037ce |