[LU-13469] MDS hung during mount Created: 21/Apr/20  Updated: 27/Sep/23  Resolved: 27/Sep/23

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.14.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Sarah Liu Assignee: Alex Zhuravlev
Resolution: Cannot Reproduce Votes: 0
Labels: soak
Environment:

lustre-master-ib #404


Attachments: File lustre-log.1588133843.6068-soak-8     File soak-11.log-051120     File soak-9.log-20200419.gz     HTML File trace-8     HTML File trace-s-11-051120     HTML File trace-soak8    
Issue Links:
Duplicate
Related
is related to LU-13402 sanity test_252: Invalid number of md... Resolved
is related to LU-13195 replay-single test_118: dt_declare_re... Resolved
is related to LU-15644 failed llog cancel should not generat... Open
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

1 MDS hung during mount during failover process.

soak-9 console

[ 3961.086008] mount.lustre    D ffff8f5730291070     0  5206   5205 0x00000082
[ 3961.093940] Call Trace:
[ 3961.096752]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.105419]  [<ffffffff99380a09>] schedule+0x29/0x70
[ 3961.110980]  [<ffffffff9937e511>] schedule_timeout+0x221/0x2d0
[ 3961.117509]  [<ffffffff98ce10f6>] ? select_task_rq_fair+0x5a6/0x760
[ 3961.124565]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.133226]  [<ffffffff99380dbd>] wait_for_completion+0xfd/0x140
[ 3961.139955]  [<ffffffff98cdb4c0>] ? wake_up_state+0x20/0x20
[ 3961.146222]  [<ffffffffc12f8b84>] llog_process_or_fork+0x254/0x520 [obdclass]
[ 3961.154226]  [<ffffffffc12f8e64>] llog_process+0x14/0x20 [obdclass]
[ 3961.161271]  [<ffffffffc132b055>] class_config_parse_llog+0x125/0x350 [obdclass]
[ 3961.169552]  [<ffffffffc15beaf8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
[ 3961.176961]  [<ffffffffc15c223f>] mgc_process_log+0x3bf/0x920 [mgc]
[ 3961.184004]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.192673]  [<ffffffffc15c3cc3>] mgc_process_config+0xc63/0x1870 [mgc]
[ 3961.200110]  [<ffffffffc1336f27>] lustre_process_log+0x2d7/0xad0 [obdclass]
[ 3961.207925]  [<ffffffffc136a064>] server_start_targets+0x12d4/0x2970 [obdclass]
[ 3961.216133]  [<ffffffffc1339fe7>] ? lustre_start_mgc+0x257/0x2420 [obdclass]
[ 3961.224020]  [<ffffffff98e23db6>] ? kfree+0x106/0x140
[ 3961.229698]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.238396]  [<ffffffffc136c7cc>] server_fill_super+0x10cc/0x1890 [obdclass]
[ 3961.246314]  [<ffffffffc133cd88>] lustre_fill_super+0x498/0x990 [obdclass]
[ 3961.254033]  [<ffffffffc133c8f0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[ 3961.262511]  [<ffffffff98e4e7df>] mount_nodev+0x4f/0xb0
[ 3961.268390]  [<ffffffffc1334d98>] lustre_mount+0x18/0x20 [obdclass]
[ 3961.275401]  [<ffffffff98e4f35e>] mount_fs+0x3e/0x1b0
[ 3961.281064]  [<ffffffff98e6d507>] vfs_kern_mount+0x67/0x110
[ 3961.287299]  [<ffffffff98e6fc5f>] do_mount+0x1ef/0xce0
[ 3961.293070]  [<ffffffff98e4737a>] ? __check_object_size+0x1ca/0x250
[ 3961.300073]  [<ffffffff98e250ec>] ? kmem_cache_alloc_trace+0x3c/0x200
[ 3961.307276]  [<ffffffff98e70a93>] SyS_mount+0x83/0xd0
[ 3961.312939]  [<ffffffff9938dede>] system_call_fastpath+0x25/0x2a
[ 3961.319665]  [<ffffffff9938de21>] ? system_call_after_swapgs+0xae/0x146
[ 4024.321554] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[ 4024.360505] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect
[ 4025.087731] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 27 clients reconnect



 Comments   
Comment by Alex Zhuravlev [ 21/Apr/20 ]

it would be helpful to see all available traces

Comment by Sarah Liu [ 22/Apr/20 ]

I restarted the test and the system crashed overnight on 3 MDSs . There are crash dumps available on spirit for soak-8/9/10, the path is:
/scratch/dumps/soak-9.spirit.whamcloud.com/10.10.1.109-2020-04-22-01:59:08

Last login: Tue Apr 21 12:05:20 on ttys001
DDN11368M:~ wliu$ ssh w3liu@ssh-1.spirit.whamcloud.int
Last login: Tue Apr 21 18:50:40 2020 from host-64-47-132-181.masergy.com
[w3liu@ssh-1 ~]$ ssh root@soak-16
The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
Password: 
Last login: Tue Apr 21 17:06:58 2020 from ssh-1.spirit.whamcloud.com
[root@soak-16 ~]# ssh soak-9
Last login: Tue Mar  3 23:55:21 2020 from soak-16.spirit.whamcloud.com
[root@soak-9 ~]# ls /tmp/
krb5ccmachine_SPIRIT.WHAMCLOUD.COM                                                      systemd-private-ba1d46467241463c9bf777e964af7eaa-ntpd.service-yRleX7
openldap-tlsmc-certs--F2A4EB14DB359420C21982FE2480932C55B52B238FF1E9079F379137D27D13CC  systemd-private-e3a3d4c94d1741a68674eb88c50144a6-ntpd.service-5xNWmx
[root@soak-9 ~]# exit
logout
Connection to soak-9 closed.
[root@soak-16 ~]# ssh soak-8
Last login: Mon Apr  6 19:47:28 2020 from soak-16.spirit.whamcloud.com
[root@soak-8 ~]# ls /tmp/
krb5ccmachine_SPIRIT.WHAMCLOUD.COM                                                      systemd-private-c062f40ef20c422c98b9375d108e2d78-ntpd.service-bpcY2l
openldap-tlsmc-certs--283FB28B7CD85BA8C6BD6F2F1837458C8301AFC4036ADCD22B40A473F4F87DAE  systemd-private-d5bd51fa26ff4fc3aab039698e87f37f-ntpd.service-wXZA7q
systemd-private-1faf9075d15b445eab312529a1bc7e2e-ntpd.service-oOzCQD                    systemd-private-f83975b64fc54f1498329d9f871d565d-ntpd.service-tdxy0M
[root@soak-8 ~]# exit
logout
Connection to soak-8 closed.
[root@soak-16 ~]# screen -x
[detached from 12070.pts-0.soak-16]
[root@soak-16 ~]# exit
logout
Connection to soak-16 closed.
[w3liu@ssh-1 ~]$ ls
aeon          ddn-681             HP259  lu-13050.stack  savelog     soak-conf        soak-homedir.tgz  soak-toolbin  tmp          toolbin.tgz
backup-frank  download-build-bin  logs   repo            soak-5.tgz  soak-format.tgz  soak-merge        test-git      tmp-scripts
[w3liu@ssh-1 ~]$ mkdir lu-13469
[w3liu@ssh-1 ~]$ cd lu-13469/
[w3liu@ssh-1 lu-13469]$ scp soak-16://scratch/logs/console/soak-9.log-20200419.tgz .
The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
scp: //scratch/logs/console/soak-9.log-20200419.tgz: No such file or directory
[w3liu@ssh-1 lu-13469]$ ssh soak-16
The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
Last login: Mon Apr 20 21:08:14 2020 from ssh-1.spirit.whamcloud.com
[w3liu@soak-16 ~]$ exit
logout
Connection to soak-16 closed.
[w3liu@ssh-1 lu-13469]$ ssh root@soak-16
The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
Password: 
Last login: Tue Apr 21 20:26:23 2020 from ssh-1.spirit.whamcloud.com
[root@soak-16 ~]# screen -x
[detached from 12070.pts-0.soak-16]
[root@soak-16 ~]# exit
logout
Connection to soak-16 closed.
[w3liu@ssh-1 lu-13469]$ scp root@soak-16:/scratch/logs/console/soak-9.log-20200419.tgz .
The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
Password: 
scp: /scratch/logs/console/soak-9.log-20200419.tgz: No such file or directory
[w3liu@ssh-1 lu-13469]$ scp root@soak-16:/scratch/logs/console/soak-9.log-20200419.gz .
The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
Password: 
soak-9.log-20200419.gz                                                                                                          100%  184KB 183.6KB/s   00:00    
[w3liu@ssh-1 lu-13469]$ ssh root@soak-16
The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established.
RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'soak-16' (RSA) to the list of known hosts.
Password: 
Last login: Tue Apr 21 20:29:06 2020 from ssh-1.spirit.whamcloud.com
[root@soak-16 ~]# screen -x

[338152.399548] Lustre: Skipped 1 previous similar message
[338153.078244] Lustre: soaked-MDT0000: Client 333b7f48-68da-4 (at 192.168.1.120@o2ib) reconnecting
[338153.088076] Lustre: Skipped 6 previous similar messages
[338154.126802] Lustre: soaked-MDT0000: Client b8a5f1df-2f21-4 (at 192.168.1.119@o2ib) reconnecting
[338154.136639] Lustre: Skipped 8 previous similar messages
[338533.965788] LNet: 57040:0:(o2iblnd_cb.c:3394:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 1 seconds
[338533.977364] LNet: 57040:0:(o2iblnd_cb.c:3394:kiblnd_check_conns()) Skipped 146 previous similar messages
[338610.961038] LNetError: 57040:0:(lib-move.c:3680:lnet_handle_recovery_reply()) peer NI (192.168.1.110@o2ib) recovery failed with -110
[338610.974469] LNetError: 57040:0:(lib-move.c:3680:lnet_handle_recovery_reply()) Skipped 122 previous similar messages
[338753.412960] Lustre: soaked-MDT0000: Client f59a1caf-b57f-4 (at 192.168.1.130@o2ib) reconnecting
[338753.422834] Lustre: Skipped 2 previous similar messages
[338753.428846] Lustre: soaked-MDT0000: Connection restored to 3a61c982-16a5-4 (at 192.168.1.130@o2ib)
[338753.438969] Lustre: Skipped 20 previous similar messages
[338760.544124] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID
[339000.160995] Lustre: 57949:0:(mdd_device.c:1825:mdd_changelog_clear()) soaked-MDD0000: No entry for user 1
[340067.766748] LustreError: 57125:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g
6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524
[340067.789495] LustreError: 57125:0:(out_handler.c:918:out_tx_end()) Skipped 299 previous similar messages
[340131.819910] LustreError: 58494:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g
6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524
[340131.842686] LustreError: 58494:0:(out_handler.c:918:out_tx_end()) Skipped 1855 previous similar messages
[340265.331826] LustreError: 57983:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g
6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524
[340265.354552] LustreError: 57983:0:(out_handler.c:918:out_tx_end()) Skipped 2651 previous similar messages
[340419.033915] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
[340419.048319] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2
[340424.618904] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
[340424.633301] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 22 previous similar messages
[340424.644781] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2
[340424.659646] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 22 previous similar messages
[340432.347952] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
...
[340475.463999] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -116
[340475.479044] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 177 previous similar messages
[340516.956460] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
[340516.970860] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 294 previous similar messages
[340516.982425] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2
[340516.997295] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 294 previous similar messages
[340581.591455] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2
[340581.605835] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 172 previous similar messages
[340581.617394] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2
[340581.632256] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 172 previous similar messages
[340681.741732] LustreError: 58187:0:(llog_osd.c:616:llog_osd_write_rec()) soaked-MDT0003-osp-MDT0000: index 36369 already set in log bitmap
[340681.755535] LustreError: 58187:0:(llog_osd.c:618:llog_osd_write_rec()) LBUG
[340681.763417] Pid: 58187, comm: mdt_rdpg00_004 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020
[340681.775080] Call Trace:
[340681.777921]  [<ffffffffc1041e6c>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[340681.785341]  [<ffffffffc1041f1c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[340681.792402]  [<ffffffffc10d34e5>] llog_osd_write_rec+0x1395/0x14b0 [obdclass]
[340681.800519]  [<ffffffffc10c22c3>] llog_write_rec+0x273/0x540 [obdclass]
[340681.808044]  [<ffffffffc10c79f0>] llog_cat_add_rec+0x220/0x880 [obdclass]
[340681.815763]  [<ffffffffc10bf422>] llog_add+0x162/0x1d0 [obdclass]
[340681.822701]  [<ffffffffc1490a01>] sub_updates_write+0x2e8/0xdd7 [ptlrpc]
[340681.830372]  [<ffffffffc1472027>] top_trans_stop+0x827/0xbc0 [ptlrpc]
[340681.837748]  [<ffffffffc19ab98c>] lod_trans_stop+0x25c/0x340 [lod]
[340681.844795]  [<ffffffffc1a6623e>] mdd_trans_stop+0x2e/0x174 [mdd]
[340681.851729]  [<ffffffffc1a59a07>] mdd_attr_set+0x6a7/0xdd0 [mdd]
[340681.858558]  [<ffffffffc18e21b9>] mdt_mfd_close+0x6b9/0x860 [mdt]
[340681.865523]  [<ffffffffc18e7a71>] mdt_close_internal+0x121/0x220 [mdt]
[340681.872958]  [<ffffffffc18e7d91>] mdt_close+0x221/0x790 [mdt]
[340681.879515]  [<ffffffffc145db9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc]
[340681.887378]  [<ffffffffc1402f76>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[340681.896074]  [<ffffffffc1407424>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]
[340681.903228]  [<ffffffff842c61f1>] kthread+0xd1/0xe0
[340681.908808]  [<ffffffff8498dd37>] ret_from_fork_nospec_end+0x0/0x39
[340681.915929]  [<ffffffffffffffff>] 0xffffffffffffffff
[340681.921610] Kernel panic - not syncing: LBUG
[340681.926473] CPU: 7 PID: 58187 Comm: mdt_rdpg00_004 Kdump: loaded Tainted: P           OE  ------------   3.10.0-1062.9.1.el7_lustre.x86_64 #1
[340681.940752] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013
[340681.953368] Call Trace:
[340681.956203]  [<ffffffff8497ac23>] dump_stack+0x19/0x1b
[340681.962038]  [<ffffffff84974967>] panic+0xe8/0x21f
[340681.967489]  [<ffffffffc1041f6b>] lbug_with_loc+0x9b/0xa0 [libcfs]
[340681.974504]  [<ffffffffc10d34e5>] llog_osd_write_rec+0x1395/0x14b0 [obdclass]
[340681.982585]  [<ffffffffc10c22c3>] llog_write_rec+0x273/0x540 [obdclass]
[340681.990085]  [<ffffffffc10c79f0>] llog_cat_add_rec+0x220/0x880 [obdclass]
[340681.997775]  [<ffffffffc10bf422>] llog_add+0x162/0x1d0 [obdclass]
[340682.004696]  [<ffffffffc10edf19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[340682.012803]  [<ffffffffc1490a01>] sub_updates_write+0x2e8/0xdd7 [ptlrpc]
[340682.020422]  [<ffffffffc1475b43>] ? update_records_attr_set_pack+0xc3/0xf0 [ptlrpc]
[340682.029124]  [<ffffffffc14906fd>] ? prepare_writing_updates.isra.12+0x139/0x14a [ptlrpc]
[340682.038311]  [<ffffffffc1472027>] top_trans_stop+0x827/0xbc0 [ptlrpc]
[340682.045606]  [<ffffffffc19ab98c>] lod_trans_stop+0x25c/0x340 [lod]
[340682.052620]  [<ffffffffc1a6623e>] mdd_trans_stop+0x2e/0x174 [mdd]
[340682.059528]  [<ffffffffc1a59a07>] mdd_attr_set+0x6a7/0xdd0 [mdd]
[340682.066362]  [<ffffffffc13f5e97>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc]
[340682.074451]  [<ffffffffc18e21b9>] mdt_mfd_close+0x6b9/0x860 [mdt]
[340682.081377]  [<ffffffffc13f636f>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc]
[340682.089646]  [<ffffffffc18e7a71>] mdt_close_internal+0x121/0x220 [mdt]
[340682.097039]  [<ffffffffc18e7d91>] mdt_close+0x221/0x790 [mdt]
[340682.103593]  [<ffffffffc145db9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc]
[340682.111377]  [<ffffffffc102902e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs]
[340682.119391]  [<ffffffffc1402f76>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc]
[340682.128070]  [<ffffffffc13fde2e>] ? ptlrpc_wait_event+0x12e/0x5b0 [ptlrpc]
[340682.135852]  [<ffffffff842d3903>] ? __wake_up+0x13/0x20
[340682.141814]  [<ffffffffc140254a>] ? ptlrpc_server_handle_req_in+0x92a/0x1100 [ptlrpc]
[340682.150657]  [<ffffffff842c72e0>] ? wake_up_atomic_t+0x30/0x30
[340682.157298]  [<ffffffffc1407424>] ptlrpc_main+0xbb4/0x1550 [ptlrpc]
[340682.164440]  [<ffffffffc1406870>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc]
[340682.172793]  [<ffffffff842c61f1>] kthread+0xd1/0xe0
[340682.178335]  [<ffffffff842c6120>] ? insert_kthread_work+0x40/0x40
[340682.185231]  [<ffffffff8498dd37>] ret_from_fork_nospec_begin+0x21/0x21
[340682.192614]  [<ffffffff842c6120>] ? insert_kthread_work+0x40/0x40
Comment by Alex Zhuravlev [ 22/Apr/20 ]

I think this share root cause with LU-13195 - basically llog got corrupted and then lead to different symptoms.

Comment by Sarah Liu [ 22/Apr/20 ]

The LBUG seems can be reproduced easily on soak, I have come cross this 2/3 times so far.

Comment by Alex Zhuravlev [ 23/Apr/20 ]

would it be possible to grab full Lustre logs with libcfs_panic_on_lbug=0 ?

Comment by Sarah Liu [ 23/Apr/20 ]

Hi Alex, I will restart with the debug on

Comment by Sarah Liu [ 24/Apr/20 ]

Hi Alex,

I am having a weird issue when setting up the panic_on_lbug=0 permanently on soak-8(MGS), Here is what I did
1. lctl set_param -P panic_on_lbug=0
2. umount and remount as ldiskfs and check the config log, and the value was set as 0
3. mount lustre, check the panic_on_lbug=1, it didn't change.

I am not sure if this is related to the llog issue here, can you please check? Do you need any log for this? If it is unrelated, I will create a new ticket, and may need delete bad stuff and restart.

Thanks

Comment by Alex Zhuravlev [ 26/Apr/20 ]

sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file

Comment by Sarah Liu [ 29/Apr/20 ]

I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed.

Comment by Alex Zhuravlev [ 06/May/20 ]

thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this.
can you please explain what the test is doing?

Comment by Sarah Liu [ 06/May/20 ]

there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover
1. mds1 failover
reboot mds1
mount the disks to failover pair mds2
after mds1 up, fail back the disks to mds1

2. mds restart
this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up

Comment by Alex Zhuravlev [ 06/May/20 ]

sorry, it's not quite enough information..
it would be very helpful if you can start the test and then grab logs (let's start with messages and/or consoles) from all the nodes.
one interesting thing from the log attached:

[ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0)
[ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00)
[ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00
[ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)
[ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) 
[ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00)
[ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840)
[ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00
[ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)

I guess this shouldn't happen during this test?

Comment by Sarah Liu [ 07/May/20 ]

Ok, I will restart the tests and post logs.
The quoted log seems hardware related, not expected during the test.

Comment by Sarah Liu [ 11/May/20 ]

restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120

Comment by Alex Zhuravlev [ 13/May/20 ]

sarah I think you should try with the recent master which has LU-13402

Comment by Alex Zhuravlev [ 20/May/20 ]

hello, any updates on this issue?

 

Comment by Peter Jones [ 23/May/20 ]

bzzz I believe that the testing has been delayed due to hardware issues

Comment by Sarah Liu [ 28/May/20 ]

Here is the update. 1 MDS hit LBUG again but a different one when running with master build including the fix of LU-13402

soak-8 console

[11360.312904] Lustre: Failing over soaked-MDT0001
[11360.312929] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -19
[11360.312937] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recove
ry to complete
[11360.349350] LustreError: 4582:0:(import.c:705:ptlrpc_connect_import_locked()) can't connect to a closed import
[11360.853248] Lustre: server umount soaked-MDT0001 complete
[11361.443391] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.102@o2ib (no target). If you are running an HA pair check that th
e target is mounted on the other server.
[11361.463245] LustreError: Skipped 93 previous similar messages
[11363.753573] Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.109@o2ib, removing former export from 0@lo
[11363.765487] Lustre: Skipped 1 previous similar message
[11393.837251] LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail.
[11393.853109] LustreError: 4733:0:(llog_osd.c:263:llog_osd_read_header()) soaked-MDT0001-osp-MDT0000: bad log  [0x240000401:0x1:0x0] header magic: 0x0 (expected 
0x10645539)
[11393.871120] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) ASSERTION( ctxt != ((void *)0) ) failed: 
[11393.882886] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) LBUG
[11393.891077] Pid: 4733, comm: lod0000_rec0001 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Thu Apr 23 02:14:20 UTC 2020
[11393.902641] Call Trace:
[11393.905394]  [<ffffffffc0ad10cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[11393.912739]  [<ffffffffc0ad117c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[11393.919677]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
[11393.926941]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
[11393.934759]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
[11393.940234]  [<ffffffffa518dd37>] ret_from_fork_nospec_end+0x0/0x39
[11393.947259]  [<ffffffffffffffff>] 0xffffffffffffffff
[11393.953057] LustreError: dumping log to /tmp/lustre-log.1590658382.4733
[11520.707452] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
[11520.715960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11520.724727] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
[11520.732665] Call Trace:
[11520.735438]  [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20
[11520.741684]  [<ffffffffa5180a09>] schedule+0x29/0x70
[11520.747261]  [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs]
[11520.754205]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
[11520.761429]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
[11520.769234]  [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod]
[11520.776734]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
[11520.782200]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
[11520.789020]  [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21
[11520.796318]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
[11550.607865] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reached hard limit: 900, extend: 1
[11550.622272] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) Skipped 31 previous similar messages
[11640.796481] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
[11640.804564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11640.813320] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
[11640.821289] Call Trace:
[11640.824059]  [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20
[11640.830319]  [<ffffffffa5180a09>] schedule+0x29/0x70
[11640.835909]  [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs]
[11640.842839]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
[11640.850079]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
[11640.857882]  [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod]
[11640.865394]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
[11640.870885]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
[11640.877723]  [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21
[11640.885019]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
[11760.885726] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
[11760.893864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11760.902634] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
Comment by Alex Zhuravlev [ 28/May/20 ]

how many successful failovers had it done?

Comment by Sarah Liu [ 29/May/20 ]

9 MDS failover completed before the failure

Comment by Alex Zhuravlev [ 30/May/20 ]

9 MDS failover completed before the failure

is it any better than before?

also, full logs are very appreciated.
there are two patches which can help to cure the corrupted llog's problem:
https://review.whamcloud.com/#/c/38385/
https://review.whamcloud.com/#/c/38387/

Comment by Sarah Liu [ 08/Jun/20 ]

Hi Alex, yes, it is better than before. Right now soak is running with b2_12 testing, I will try the patches when the testing is done.

Generated at Sat Feb 10 03:01:32 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.