[LU-12781] sanity test_272a crashes with SSK Created: 18/Sep/19  Updated: 03/Jan/20  Resolved: 03/Jan/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.13.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Minor
Reporter: Sebastien Buisson Assignee: Mikhail Pershin
Resolution: Fixed Votes: 0
Labels: gss

Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

With the recent landing of patch "LU-12443 ptlrpc: fix reply buffers shrinking and growing" (https://review.whamcloud.com/35243), sanity test_272a crashes with SSK enabled.

I used the following patches to trigger tests:
https://review.whamcloud.com/36226
https://review.whamcloud.com/36227

Without SSK, test_272a does not crash. With SSK, test_272a crashed unless patch "LU-12443 ptlrpc: fix reply buffers shrinking and growing" is reverted.

The crash is due to an assertion failed:

[  406.653680] Lustre: DEBUG MARKER: == sanity test 272a: DoM migration: new layout with the same DOM component =========================== 08:37:07 (1568795827)
[  406.726294] format at mdt_io.c:215:mdt_rw_hpreq_check doesn't end in newline
[  406.743661] format at mdt_io.c:215:mdt_rw_hpreq_check doesn't end in newline
[  406.792396] LustreError: 15793:0:(pack_generic.c:454:lustre_shrink_msg_v2()) ASSERTION( msg->lm_buflens[segment] >= newlen ) failed: 
[  406.793584] LustreError: 15793:0:(pack_generic.c:454:lustre_shrink_msg_v2()) LBUG
[  406.794352] Pid: 15793, comm: mdt00_002 3.10.0-957.27.2.el7_lustre.x86_64 #1 SMP Thu Sep 12 03:53:14 UTC 2019
[  406.795309] Call Trace:
[  406.795600]  [<ffffffffc09188ac>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[  406.796459]  [<ffffffffc091895c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[  406.797125]  [<ffffffffc0e32c54>] lustre_shrink_msg+0x164/0x200 [ptlrpc]
[  406.797912]  [<ffffffffc146e11e>] gss_svc_authorize+0x16e/0x5b0 [ptlrpc_gss]
[  406.798676]  [<ffffffffc0e647c5>] sptlrpc_svc_wrap_reply+0x55/0x1d0 [ptlrpc]
[  406.799455]  [<ffffffffc0e2eca8>] ptlrpc_send_reply+0x1e8/0x830 [ptlrpc]
[  406.800340]  [<ffffffffc0ded6be>] target_send_reply_msg+0x8e/0x170 [ptlrpc]
[  406.801092]  [<ffffffffc0df7d4e>] target_send_reply+0x30e/0x730 [ptlrpc]
[  406.801847]  [<ffffffffc0e9d3d1>] tgt_request_handle+0x2f1/0x15c0 [ptlrpc]
[  406.802620]  [<ffffffffc0e42516>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[  406.803501]  [<ffffffffc0e4604c>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[  406.804193]  [<ffffffff954c2e81>] kthread+0xd1/0xe0
[  406.804779]  [<ffffffff95b77c37>] ret_from_fork_nospec_end+0x0/0x39
[  406.805484]  [<ffffffffffffffff>] 0xffffffffffffffff
[  406.806058] Kernel panic - not syncing: LBUG
[  406.806628] CPU: 1 PID: 15793 Comm: mdt00_002 Kdump: loaded Tainted: G           OE  ------------   3.10.0-957.27.2.el7_lustre.x86_64 #1
[  406.807770] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  406.808333] Call Trace:
[  406.808604]  [<ffffffff95b65147>] dump_stack+0x19/0x1b
[  406.809120]  [<ffffffff95b5e850>] panic+0xe8/0x21f
[  406.809595]  [<ffffffffc09189ab>] lbug_with_loc+0x9b/0xa0 [libcfs]
[  406.810219]  [<ffffffffc0e32c54>] lustre_shrink_msg+0x164/0x200 [ptlrpc]
[  406.810867]  [<ffffffffc146e11e>] gss_svc_authorize+0x16e/0x5b0 [ptlrpc_gss]
[  406.811570]  [<ffffffffc0e647c5>] sptlrpc_svc_wrap_reply+0x55/0x1d0 [ptlrpc]
[  406.812272]  [<ffffffffc0e2eca8>] ptlrpc_send_reply+0x1e8/0x830 [ptlrpc]
[  406.812946]  [<ffffffffc0ded6be>] target_send_reply_msg+0x8e/0x170 [ptlrpc]
[  406.813633]  [<ffffffffc0df7d4e>] target_send_reply+0x30e/0x730 [ptlrpc]
[  406.814305]  [<ffffffffc0e362d7>] ? lustre_msg_set_last_committed+0x27/0xa0 [ptlrpc]
[  406.815083]  [<ffffffffc0e9d3d1>] tgt_request_handle+0x2f1/0x15c0 [ptlrpc]
[  406.815752]  [<ffffffffc0a60f3e>] ? libcfs_nid2str_r+0xfe/0x130 [lnet]
[  406.816412]  [<ffffffffc0e42516>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[  406.817157]  [<ffffffff954cfeb4>] ? __wake_up+0x44/0x50
[  406.817689]  [<ffffffffc0e4604c>] ptlrpc_main+0xbac/0x1540 [ptlrpc]
[  406.818302]  [<ffffffff954d1ad0>] ? finish_task_switch+0x50/0x1c0
[  406.818914]  [<ffffffffc0e454a0>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
[  406.819620]  [<ffffffff954c2e81>] kthread+0xd1/0xe0
[  406.820102]  [<ffffffff954c2db0>] ? insert_kthread_work+0x40/0x40
[  406.820685]  [<ffffffff95b77c37>] ret_from_fork_nospec_begin+0x21/0x21
[  406.821312]  [<ffffffff954c2db0>] ? insert_kthread_work+0x40/0x40


 Comments   
Comment by Sebastien Buisson [ 18/Sep/19 ]

Mike, any advice on this?
Then please feel free to assign to me.

Comment by Oleg Drokin [ 21/Oct/19 ]

I tried to run all of our tests with SSK on and some more failed with this: sanity-pfl, racer and replay-single

See the test session here: http://testing.linuxhacker.ru:3333/lustre-reports/3825/results-retry3.html

Comment by Andreas Dilger [ 01/Nov/19 ]

On a semi-related note, please fix the mdt_rw_hpreq_check() message format to include a newline as part of this patch.

Comment by Andreas Dilger [ 10/Nov/19 ]

This seems possibly related to the other lu_buf shrinking issue that you are both working on?

Comment by Gerrit Updater [ 11/Nov/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36727
Subject: LU-12781 debug: output reply buffer info at error
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 142b603ff33618ae497d0f3a84e86c04233c30fb

Comment by Gerrit Updater [ 11/Nov/19 ]

Mike Pershin (mpershin@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/36732
Subject: LU-12781 ptlrpc: use proper buffer in reply grow
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: d8782fa9c62a7a7f02cf1d6a14f2881b4b01f405

Comment by Mikhail Pershin [ 12/Nov/19 ]

With the latest patch SSK looks better, Sebastien, can you try this patch in your SSK testing, please?

Comment by Sebastien Buisson [ 12/Nov/19 ]

Hi Mike, it looks better indeed, I commented on the patch.
Thanks.

Comment by Gerrit Updater [ 03/Jan/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/36732/
Subject: LU-12781 ptlrpc: fix inline reply buffer grow
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 02c23a2e851fdebc3e2bde45a51fb043559504ab

Comment by Peter Jones [ 03/Jan/20 ]

Landed for 2.14

Generated at Sat Feb 10 02:55:35 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.