[LU-13469] MDS hung during mount Created: 21/Apr/20 Updated: 27/Sep/23 Resolved: 27/Sep/23 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Sarah Liu | Assignee: | Alex Zhuravlev |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | soak | ||
| Environment: |
lustre-master-ib #404 |
||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||||||||||
| Description |
|
1 MDS hung during mount during failover process. soak-9 console [ 3961.086008] mount.lustre D ffff8f5730291070 0 5206 5205 0x00000082 [ 3961.093940] Call Trace: [ 3961.096752] [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 3961.105419] [<ffffffff99380a09>] schedule+0x29/0x70 [ 3961.110980] [<ffffffff9937e511>] schedule_timeout+0x221/0x2d0 [ 3961.117509] [<ffffffff98ce10f6>] ? select_task_rq_fair+0x5a6/0x760 [ 3961.124565] [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 3961.133226] [<ffffffff99380dbd>] wait_for_completion+0xfd/0x140 [ 3961.139955] [<ffffffff98cdb4c0>] ? wake_up_state+0x20/0x20 [ 3961.146222] [<ffffffffc12f8b84>] llog_process_or_fork+0x254/0x520 [obdclass] [ 3961.154226] [<ffffffffc12f8e64>] llog_process+0x14/0x20 [obdclass] [ 3961.161271] [<ffffffffc132b055>] class_config_parse_llog+0x125/0x350 [obdclass] [ 3961.169552] [<ffffffffc15beaf8>] mgc_process_cfg_log+0x788/0xc40 [mgc] [ 3961.176961] [<ffffffffc15c223f>] mgc_process_log+0x3bf/0x920 [mgc] [ 3961.184004] [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 3961.192673] [<ffffffffc15c3cc3>] mgc_process_config+0xc63/0x1870 [mgc] [ 3961.200110] [<ffffffffc1336f27>] lustre_process_log+0x2d7/0xad0 [obdclass] [ 3961.207925] [<ffffffffc136a064>] server_start_targets+0x12d4/0x2970 [obdclass] [ 3961.216133] [<ffffffffc1339fe7>] ? lustre_start_mgc+0x257/0x2420 [obdclass] [ 3961.224020] [<ffffffff98e23db6>] ? kfree+0x106/0x140 [ 3961.229698] [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 3961.238396] [<ffffffffc136c7cc>] server_fill_super+0x10cc/0x1890 [obdclass] [ 3961.246314] [<ffffffffc133cd88>] lustre_fill_super+0x498/0x990 [obdclass] [ 3961.254033] [<ffffffffc133c8f0>] ? lustre_common_put_super+0x270/0x270 [obdclass] [ 3961.262511] [<ffffffff98e4e7df>] mount_nodev+0x4f/0xb0 [ 3961.268390] [<ffffffffc1334d98>] lustre_mount+0x18/0x20 [obdclass] [ 3961.275401] [<ffffffff98e4f35e>] mount_fs+0x3e/0x1b0 [ 3961.281064] [<ffffffff98e6d507>] vfs_kern_mount+0x67/0x110 [ 3961.287299] [<ffffffff98e6fc5f>] do_mount+0x1ef/0xce0 [ 3961.293070] [<ffffffff98e4737a>] ? __check_object_size+0x1ca/0x250 [ 3961.300073] [<ffffffff98e250ec>] ? kmem_cache_alloc_trace+0x3c/0x200 [ 3961.307276] [<ffffffff98e70a93>] SyS_mount+0x83/0xd0 [ 3961.312939] [<ffffffff9938dede>] system_call_fastpath+0x25/0x2a [ 3961.319665] [<ffffffff9938de21>] ? system_call_after_swapgs+0xae/0x146 [ 4024.321554] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 4024.360505] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect [ 4025.087731] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 27 clients reconnect |
| Comments |
| Comment by Alex Zhuravlev [ 21/Apr/20 ] |
|
it would be helpful to see all available traces |
| Comment by Sarah Liu [ 22/Apr/20 ] |
|
I restarted the test and the system crashed overnight on 3 MDSs . There are crash dumps available on spirit for soak-8/9/10, the path is: Last login: Tue Apr 21 12:05:20 on ttys001 DDN11368M:~ wliu$ ssh w3liu@ssh-1.spirit.whamcloud.int Last login: Tue Apr 21 18:50:40 2020 from host-64-47-132-181.masergy.com [w3liu@ssh-1 ~]$ ssh root@soak-16 The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: Last login: Tue Apr 21 17:06:58 2020 from ssh-1.spirit.whamcloud.com [root@soak-16 ~]# ssh soak-9 Last login: Tue Mar 3 23:55:21 2020 from soak-16.spirit.whamcloud.com [root@soak-9 ~]# ls /tmp/ krb5ccmachine_SPIRIT.WHAMCLOUD.COM systemd-private-ba1d46467241463c9bf777e964af7eaa-ntpd.service-yRleX7 openldap-tlsmc-certs--F2A4EB14DB359420C21982FE2480932C55B52B238FF1E9079F379137D27D13CC systemd-private-e3a3d4c94d1741a68674eb88c50144a6-ntpd.service-5xNWmx [root@soak-9 ~]# exit logout Connection to soak-9 closed. [root@soak-16 ~]# ssh soak-8 Last login: Mon Apr 6 19:47:28 2020 from soak-16.spirit.whamcloud.com [root@soak-8 ~]# ls /tmp/ krb5ccmachine_SPIRIT.WHAMCLOUD.COM systemd-private-c062f40ef20c422c98b9375d108e2d78-ntpd.service-bpcY2l openldap-tlsmc-certs--283FB28B7CD85BA8C6BD6F2F1837458C8301AFC4036ADCD22B40A473F4F87DAE systemd-private-d5bd51fa26ff4fc3aab039698e87f37f-ntpd.service-wXZA7q systemd-private-1faf9075d15b445eab312529a1bc7e2e-ntpd.service-oOzCQD systemd-private-f83975b64fc54f1498329d9f871d565d-ntpd.service-tdxy0M [root@soak-8 ~]# exit logout Connection to soak-8 closed. [root@soak-16 ~]# screen -x [detached from 12070.pts-0.soak-16] [root@soak-16 ~]# exit logout Connection to soak-16 closed. [w3liu@ssh-1 ~]$ ls aeon ddn-681 HP259 lu-13050.stack savelog soak-conf soak-homedir.tgz soak-toolbin tmp toolbin.tgz backup-frank download-build-bin logs repo soak-5.tgz soak-format.tgz soak-merge test-git tmp-scripts [w3liu@ssh-1 ~]$ mkdir lu-13469 [w3liu@ssh-1 ~]$ cd lu-13469/ [w3liu@ssh-1 lu-13469]$ scp soak-16://scratch/logs/console/soak-9.log-20200419.tgz . The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. scp: //scratch/logs/console/soak-9.log-20200419.tgz: No such file or directory [w3liu@ssh-1 lu-13469]$ ssh soak-16 The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Last login: Mon Apr 20 21:08:14 2020 from ssh-1.spirit.whamcloud.com [w3liu@soak-16 ~]$ exit logout Connection to soak-16 closed. [w3liu@ssh-1 lu-13469]$ ssh root@soak-16 The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: Last login: Tue Apr 21 20:26:23 2020 from ssh-1.spirit.whamcloud.com [root@soak-16 ~]# screen -x [detached from 12070.pts-0.soak-16] [root@soak-16 ~]# exit logout Connection to soak-16 closed. [w3liu@ssh-1 lu-13469]$ scp root@soak-16:/scratch/logs/console/soak-9.log-20200419.tgz . The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: scp: /scratch/logs/console/soak-9.log-20200419.tgz: No such file or directory [w3liu@ssh-1 lu-13469]$ scp root@soak-16:/scratch/logs/console/soak-9.log-20200419.gz . The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: soak-9.log-20200419.gz 100% 184KB 183.6KB/s 00:00 [w3liu@ssh-1 lu-13469]$ ssh root@soak-16 The authenticity of host 'soak-16 (<no hostip for proxy command>)' can't be established. RSA key fingerprint is 37:1f:14:66:49:de:a8:f9:fa:ab:c7:2d:28:0f:26:44. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'soak-16' (RSA) to the list of known hosts. Password: Last login: Tue Apr 21 20:29:06 2020 from ssh-1.spirit.whamcloud.com [root@soak-16 ~]# screen -x [338152.399548] Lustre: Skipped 1 previous similar message [338153.078244] Lustre: soaked-MDT0000: Client 333b7f48-68da-4 (at 192.168.1.120@o2ib) reconnecting [338153.088076] Lustre: Skipped 6 previous similar messages [338154.126802] Lustre: soaked-MDT0000: Client b8a5f1df-2f21-4 (at 192.168.1.119@o2ib) reconnecting [338154.136639] Lustre: Skipped 8 previous similar messages [338533.965788] LNet: 57040:0:(o2iblnd_cb.c:3394:kiblnd_check_conns()) Timed out tx for 192.168.1.110@o2ib: 1 seconds [338533.977364] LNet: 57040:0:(o2iblnd_cb.c:3394:kiblnd_check_conns()) Skipped 146 previous similar messages [338610.961038] LNetError: 57040:0:(lib-move.c:3680:lnet_handle_recovery_reply()) peer NI (192.168.1.110@o2ib) recovery failed with -110 [338610.974469] LNetError: 57040:0:(lib-move.c:3680:lnet_handle_recovery_reply()) Skipped 122 previous similar messages [338753.412960] Lustre: soaked-MDT0000: Client f59a1caf-b57f-4 (at 192.168.1.130@o2ib) reconnecting [338753.422834] Lustre: Skipped 2 previous similar messages [338753.428846] Lustre: soaked-MDT0000: Connection restored to 3a61c982-16a5-4 (at 192.168.1.130@o2ib) [338753.438969] Lustre: Skipped 20 previous similar messages [338760.544124] Lustre: soaked-MDT0000: Received new LWP connection from 192.168.1.110@o2ib, removing former export from same NID [339000.160995] Lustre: 57949:0:(mdd_device.c:1825:mdd_changelog_clear()) soaked-MDD0000: No entry for user 1 [340067.766748] LustreError: 57125:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g 6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524 [340067.789495] LustreError: 57125:0:(out_handler.c:918:out_tx_end()) Skipped 299 previous similar messages [340131.819910] LustreError: 58494:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g 6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524 [340131.842686] LustreError: 58494:0:(out_handler.c:918:out_tx_end()) Skipped 1855 previous similar messages [340265.331826] LustreError: 57983:0:(out_handler.c:918:out_tx_end()) soaked-MDT0000-osd: undo for /tmp/rpmbuild-lustre-jenkins-UoS8uXZU/BUILD/lustre-2.13.53_18_g 6706bfa/lustre/ptlrpc/../../lustre/target/out_handler.c:453: rc = -524 [340265.354552] LustreError: 57983:0:(out_handler.c:918:out_tx_end()) Skipped 2651 previous similar messages [340419.033915] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 [340419.048319] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2 [340424.618904] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 [340424.633301] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 22 previous similar messages [340424.644781] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2 [340424.659646] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 22 previous similar messages [340432.347952] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 ... [340475.463999] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -116 [340475.479044] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 177 previous similar messages [340516.956460] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 [340516.970860] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 294 previous similar messages [340516.982425] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2 [340516.997295] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 294 previous similar messages [340581.591455] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 llog-records: rc = -2 [340581.605835] LustreError: 57214:0:(llog_cat.c:756:llog_cat_cancel_arr_rec()) Skipped 172 previous similar messages [340581.617394] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) soaked-MDT0003-osp-MDT0000: fail to cancel 1 of 1 llog-records: rc = -2 [340581.632256] LustreError: 57214:0:(llog_cat.c:793:llog_cat_cancel_records()) Skipped 172 previous similar messages [340681.741732] LustreError: 58187:0:(llog_osd.c:616:llog_osd_write_rec()) soaked-MDT0003-osp-MDT0000: index 36369 already set in log bitmap [340681.755535] LustreError: 58187:0:(llog_osd.c:618:llog_osd_write_rec()) LBUG [340681.763417] Pid: 58187, comm: mdt_rdpg00_004 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Tue Feb 11 19:11:56 UTC 2020 [340681.775080] Call Trace: [340681.777921] [<ffffffffc1041e6c>] libcfs_call_trace+0x8c/0xc0 [libcfs] [340681.785341] [<ffffffffc1041f1c>] lbug_with_loc+0x4c/0xa0 [libcfs] [340681.792402] [<ffffffffc10d34e5>] llog_osd_write_rec+0x1395/0x14b0 [obdclass] [340681.800519] [<ffffffffc10c22c3>] llog_write_rec+0x273/0x540 [obdclass] [340681.808044] [<ffffffffc10c79f0>] llog_cat_add_rec+0x220/0x880 [obdclass] [340681.815763] [<ffffffffc10bf422>] llog_add+0x162/0x1d0 [obdclass] [340681.822701] [<ffffffffc1490a01>] sub_updates_write+0x2e8/0xdd7 [ptlrpc] [340681.830372] [<ffffffffc1472027>] top_trans_stop+0x827/0xbc0 [ptlrpc] [340681.837748] [<ffffffffc19ab98c>] lod_trans_stop+0x25c/0x340 [lod] [340681.844795] [<ffffffffc1a6623e>] mdd_trans_stop+0x2e/0x174 [mdd] [340681.851729] [<ffffffffc1a59a07>] mdd_attr_set+0x6a7/0xdd0 [mdd] [340681.858558] [<ffffffffc18e21b9>] mdt_mfd_close+0x6b9/0x860 [mdt] [340681.865523] [<ffffffffc18e7a71>] mdt_close_internal+0x121/0x220 [mdt] [340681.872958] [<ffffffffc18e7d91>] mdt_close+0x221/0x790 [mdt] [340681.879515] [<ffffffffc145db9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc] [340681.887378] [<ffffffffc1402f76>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [340681.896074] [<ffffffffc1407424>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] [340681.903228] [<ffffffff842c61f1>] kthread+0xd1/0xe0 [340681.908808] [<ffffffff8498dd37>] ret_from_fork_nospec_end+0x0/0x39 [340681.915929] [<ffffffffffffffff>] 0xffffffffffffffff [340681.921610] Kernel panic - not syncing: LBUG [340681.926473] CPU: 7 PID: 58187 Comm: mdt_rdpg00_004 Kdump: loaded Tainted: P OE ------------ 3.10.0-1062.9.1.el7_lustre.x86_64 #1 [340681.940752] Hardware name: Intel Corporation S2600GZ ........../S2600GZ, BIOS SE5C600.86B.01.08.0003.022620131521 02/26/2013 [340681.953368] Call Trace: [340681.956203] [<ffffffff8497ac23>] dump_stack+0x19/0x1b [340681.962038] [<ffffffff84974967>] panic+0xe8/0x21f [340681.967489] [<ffffffffc1041f6b>] lbug_with_loc+0x9b/0xa0 [libcfs] [340681.974504] [<ffffffffc10d34e5>] llog_osd_write_rec+0x1395/0x14b0 [obdclass] [340681.982585] [<ffffffffc10c22c3>] llog_write_rec+0x273/0x540 [obdclass] [340681.990085] [<ffffffffc10c79f0>] llog_cat_add_rec+0x220/0x880 [obdclass] [340681.997775] [<ffffffffc10bf422>] llog_add+0x162/0x1d0 [obdclass] [340682.004696] [<ffffffffc10edf19>] ? lprocfs_counter_add+0xf9/0x160 [obdclass] [340682.012803] [<ffffffffc1490a01>] sub_updates_write+0x2e8/0xdd7 [ptlrpc] [340682.020422] [<ffffffffc1475b43>] ? update_records_attr_set_pack+0xc3/0xf0 [ptlrpc] [340682.029124] [<ffffffffc14906fd>] ? prepare_writing_updates.isra.12+0x139/0x14a [ptlrpc] [340682.038311] [<ffffffffc1472027>] top_trans_stop+0x827/0xbc0 [ptlrpc] [340682.045606] [<ffffffffc19ab98c>] lod_trans_stop+0x25c/0x340 [lod] [340682.052620] [<ffffffffc1a6623e>] mdd_trans_stop+0x2e/0x174 [mdd] [340682.059528] [<ffffffffc1a59a07>] mdd_attr_set+0x6a7/0xdd0 [mdd] [340682.066362] [<ffffffffc13f5e97>] ? lustre_msg_add_version+0x27/0xa0 [ptlrpc] [340682.074451] [<ffffffffc18e21b9>] mdt_mfd_close+0x6b9/0x860 [mdt] [340682.081377] [<ffffffffc13f636f>] ? lustre_pack_reply_flags+0x6f/0x1e0 [ptlrpc] [340682.089646] [<ffffffffc18e7a71>] mdt_close_internal+0x121/0x220 [mdt] [340682.097039] [<ffffffffc18e7d91>] mdt_close+0x221/0x790 [mdt] [340682.103593] [<ffffffffc145db9a>] tgt_request_handle+0x95a/0x1630 [ptlrpc] [340682.111377] [<ffffffffc102902e>] ? ktime_get_real_seconds+0xe/0x10 [libcfs] [340682.119391] [<ffffffffc1402f76>] ptlrpc_server_handle_request+0x256/0xb10 [ptlrpc] [340682.128070] [<ffffffffc13fde2e>] ? ptlrpc_wait_event+0x12e/0x5b0 [ptlrpc] [340682.135852] [<ffffffff842d3903>] ? __wake_up+0x13/0x20 [340682.141814] [<ffffffffc140254a>] ? ptlrpc_server_handle_req_in+0x92a/0x1100 [ptlrpc] [340682.150657] [<ffffffff842c72e0>] ? wake_up_atomic_t+0x30/0x30 [340682.157298] [<ffffffffc1407424>] ptlrpc_main+0xbb4/0x1550 [ptlrpc] [340682.164440] [<ffffffffc1406870>] ? ptlrpc_register_service+0xf90/0xf90 [ptlrpc] [340682.172793] [<ffffffff842c61f1>] kthread+0xd1/0xe0 [340682.178335] [<ffffffff842c6120>] ? insert_kthread_work+0x40/0x40 [340682.185231] [<ffffffff8498dd37>] ret_from_fork_nospec_begin+0x21/0x21 [340682.192614] [<ffffffff842c6120>] ? insert_kthread_work+0x40/0x40 |
| Comment by Alex Zhuravlev [ 22/Apr/20 ] |
|
I think this share root cause with |
| Comment by Sarah Liu [ 22/Apr/20 ] |
|
The LBUG seems can be reproduced easily on soak, I have come cross this 2/3 times so far. |
| Comment by Alex Zhuravlev [ 23/Apr/20 ] |
|
would it be possible to grab full Lustre logs with libcfs_panic_on_lbug=0 ? |
| Comment by Sarah Liu [ 23/Apr/20 ] |
|
Hi Alex, I will restart with the debug on |
| Comment by Sarah Liu [ 24/Apr/20 ] |
|
Hi Alex, I am having a weird issue when setting up the panic_on_lbug=0 permanently on soak-8(MGS), Here is what I did I am not sure if this is related to the llog issue here, can you please check? Do you need any log for this? If it is unrelated, I will create a new ticket, and may need delete bad stuff and restart. Thanks |
| Comment by Alex Zhuravlev [ 26/Apr/20 ] |
|
sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file |
| Comment by Sarah Liu [ 29/Apr/20 ] |
|
I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed. |
| Comment by Alex Zhuravlev [ 06/May/20 ] |
|
thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this. |
| Comment by Sarah Liu [ 06/May/20 ] |
|
there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover 2. mds restart |
| Comment by Alex Zhuravlev [ 06/May/20 ] |
|
sorry, it's not quite enough information.. [ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0) [ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00) [ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00 [ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) [ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) [ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00) [ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840) [ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00 [ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) I guess this shouldn't happen during this test? |
| Comment by Sarah Liu [ 07/May/20 ] |
|
Ok, I will restart the tests and post logs. |
| Comment by Sarah Liu [ 11/May/20 ] |
|
restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 |
| Comment by Alex Zhuravlev [ 13/May/20 ] |
|
sarah I think you should try with the recent master which has |
| Comment by Alex Zhuravlev [ 20/May/20 ] |
|
hello, any updates on this issue?
|
| Comment by Peter Jones [ 23/May/20 ] |
|
bzzz I believe that the testing has been delayed due to hardware issues |
| Comment by Sarah Liu [ 28/May/20 ] |
|
Here is the update. 1 MDS hit LBUG again but a different one when running with master build including the fix of soak-8 console [11360.312904] Lustre: Failing over soaked-MDT0001 [11360.312929] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -19 [11360.312937] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recove ry to complete [11360.349350] LustreError: 4582:0:(import.c:705:ptlrpc_connect_import_locked()) can't connect to a closed import [11360.853248] Lustre: server umount soaked-MDT0001 complete [11361.443391] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.102@o2ib (no target). If you are running an HA pair check that th e target is mounted on the other server. [11361.463245] LustreError: Skipped 93 previous similar messages [11363.753573] Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.109@o2ib, removing former export from 0@lo [11363.765487] Lustre: Skipped 1 previous similar message [11393.837251] LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. [11393.853109] LustreError: 4733:0:(llog_osd.c:263:llog_osd_read_header()) soaked-MDT0001-osp-MDT0000: bad log [0x240000401:0x1:0x0] header magic: 0x0 (expected 0x10645539) [11393.871120] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) ASSERTION( ctxt != ((void *)0) ) failed: [11393.882886] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) LBUG [11393.891077] Pid: 4733, comm: lod0000_rec0001 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Thu Apr 23 02:14:20 UTC 2020 [11393.902641] Call Trace: [11393.905394] [<ffffffffc0ad10cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [11393.912739] [<ffffffffc0ad117c>] lbug_with_loc+0x4c/0xa0 [libcfs] [11393.919677] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11393.926941] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11393.934759] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11393.940234] [<ffffffffa518dd37>] ret_from_fork_nospec_end+0x0/0x39 [11393.947259] [<ffffffffffffffff>] 0xffffffffffffffff [11393.953057] LustreError: dumping log to /tmp/lustre-log.1590658382.4733 [11520.707452] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11520.715960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11520.724727] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080 [11520.732665] Call Trace: [11520.735438] [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20 [11520.741684] [<ffffffffa5180a09>] schedule+0x29/0x70 [11520.747261] [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs] [11520.754205] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11520.761429] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11520.769234] [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod] [11520.776734] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11520.782200] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11520.789020] [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21 [11520.796318] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11550.607865] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reached hard limit: 900, extend: 1 [11550.622272] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) Skipped 31 previous similar messages [11640.796481] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11640.804564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11640.813320] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080 [11640.821289] Call Trace: [11640.824059] [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20 [11640.830319] [<ffffffffa5180a09>] schedule+0x29/0x70 [11640.835909] [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs] [11640.842839] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11640.850079] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11640.857882] [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod] [11640.865394] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11640.870885] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11640.877723] [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21 [11640.885019] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11760.885726] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11760.893864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11760.902634] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080 |
| Comment by Alex Zhuravlev [ 28/May/20 ] |
|
how many successful failovers had it done? |
| Comment by Sarah Liu [ 29/May/20 ] |
|
9 MDS failover completed before the failure |
| Comment by Alex Zhuravlev [ 30/May/20 ] |
is it any better than before? also, full logs are very appreciated. |
| Comment by Sarah Liu [ 08/Jun/20 ] |
|
Hi Alex, yes, it is better than before. Right now soak is running with b2_12 testing, I will try the patches when the testing is done. |