Details
-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
Lustre 2.14.0
-
lustre-master-ib #404
-
3
-
9223372036854775807
Description
1 MDS hung during mount during failover process.
soak-9 console
[ 3961.086008] mount.lustre D ffff8f5730291070 0 5206 5205 0x00000082 [ 3961.093940] Call Trace: [ 3961.096752] [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 3961.105419] [<ffffffff99380a09>] schedule+0x29/0x70 [ 3961.110980] [<ffffffff9937e511>] schedule_timeout+0x221/0x2d0 [ 3961.117509] [<ffffffff98ce10f6>] ? select_task_rq_fair+0x5a6/0x760 [ 3961.124565] [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 3961.133226] [<ffffffff99380dbd>] wait_for_completion+0xfd/0x140 [ 3961.139955] [<ffffffff98cdb4c0>] ? wake_up_state+0x20/0x20 [ 3961.146222] [<ffffffffc12f8b84>] llog_process_or_fork+0x254/0x520 [obdclass] [ 3961.154226] [<ffffffffc12f8e64>] llog_process+0x14/0x20 [obdclass] [ 3961.161271] [<ffffffffc132b055>] class_config_parse_llog+0x125/0x350 [obdclass] [ 3961.169552] [<ffffffffc15beaf8>] mgc_process_cfg_log+0x788/0xc40 [mgc] [ 3961.176961] [<ffffffffc15c223f>] mgc_process_log+0x3bf/0x920 [mgc] [ 3961.184004] [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 3961.192673] [<ffffffffc15c3cc3>] mgc_process_config+0xc63/0x1870 [mgc] [ 3961.200110] [<ffffffffc1336f27>] lustre_process_log+0x2d7/0xad0 [obdclass] [ 3961.207925] [<ffffffffc136a064>] server_start_targets+0x12d4/0x2970 [obdclass] [ 3961.216133] [<ffffffffc1339fe7>] ? lustre_start_mgc+0x257/0x2420 [obdclass] [ 3961.224020] [<ffffffff98e23db6>] ? kfree+0x106/0x140 [ 3961.229698] [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] [ 3961.238396] [<ffffffffc136c7cc>] server_fill_super+0x10cc/0x1890 [obdclass] [ 3961.246314] [<ffffffffc133cd88>] lustre_fill_super+0x498/0x990 [obdclass] [ 3961.254033] [<ffffffffc133c8f0>] ? lustre_common_put_super+0x270/0x270 [obdclass] [ 3961.262511] [<ffffffff98e4e7df>] mount_nodev+0x4f/0xb0 [ 3961.268390] [<ffffffffc1334d98>] lustre_mount+0x18/0x20 [obdclass] [ 3961.275401] [<ffffffff98e4f35e>] mount_fs+0x3e/0x1b0 [ 3961.281064] [<ffffffff98e6d507>] vfs_kern_mount+0x67/0x110 [ 3961.287299] [<ffffffff98e6fc5f>] do_mount+0x1ef/0xce0 [ 3961.293070] [<ffffffff98e4737a>] ? __check_object_size+0x1ca/0x250 [ 3961.300073] [<ffffffff98e250ec>] ? kmem_cache_alloc_trace+0x3c/0x200 [ 3961.307276] [<ffffffff98e70a93>] SyS_mount+0x83/0xd0 [ 3961.312939] [<ffffffff9938dede>] system_call_fastpath+0x25/0x2a [ 3961.319665] [<ffffffff9938de21>] ? system_call_after_swapgs+0xae/0x146 [ 4024.321554] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 [ 4024.360505] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect [ 4025.087731] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 27 clients reconnect
there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover
1. mds1 failover
reboot mds1
mount the disks to failover pair mds2
after mds1 up, fail back the disks to mds1
2. mds restart
this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up