[LU-13469] MDS hung during mount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.14.0
Labels:
- soak
Environment:
lustre-master-ib #404

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

1 MDS hung during mount during failover process.

soak-9 console

[ 3961.086008] mount.lustre    D ffff8f5730291070     0  5206   5205 0x00000082
[ 3961.093940] Call Trace:
[ 3961.096752]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.105419]  [<ffffffff99380a09>] schedule+0x29/0x70
[ 3961.110980]  [<ffffffff9937e511>] schedule_timeout+0x221/0x2d0
[ 3961.117509]  [<ffffffff98ce10f6>] ? select_task_rq_fair+0x5a6/0x760
[ 3961.124565]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.133226]  [<ffffffff99380dbd>] wait_for_completion+0xfd/0x140
[ 3961.139955]  [<ffffffff98cdb4c0>] ? wake_up_state+0x20/0x20
[ 3961.146222]  [<ffffffffc12f8b84>] llog_process_or_fork+0x254/0x520 [obdclass]
[ 3961.154226]  [<ffffffffc12f8e64>] llog_process+0x14/0x20 [obdclass]
[ 3961.161271]  [<ffffffffc132b055>] class_config_parse_llog+0x125/0x350 [obdclass]
[ 3961.169552]  [<ffffffffc15beaf8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
[ 3961.176961]  [<ffffffffc15c223f>] mgc_process_log+0x3bf/0x920 [mgc]
[ 3961.184004]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.192673]  [<ffffffffc15c3cc3>] mgc_process_config+0xc63/0x1870 [mgc]
[ 3961.200110]  [<ffffffffc1336f27>] lustre_process_log+0x2d7/0xad0 [obdclass]
[ 3961.207925]  [<ffffffffc136a064>] server_start_targets+0x12d4/0x2970 [obdclass]
[ 3961.216133]  [<ffffffffc1339fe7>] ? lustre_start_mgc+0x257/0x2420 [obdclass]
[ 3961.224020]  [<ffffffff98e23db6>] ? kfree+0x106/0x140
[ 3961.229698]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.238396]  [<ffffffffc136c7cc>] server_fill_super+0x10cc/0x1890 [obdclass]
[ 3961.246314]  [<ffffffffc133cd88>] lustre_fill_super+0x498/0x990 [obdclass]
[ 3961.254033]  [<ffffffffc133c8f0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[ 3961.262511]  [<ffffffff98e4e7df>] mount_nodev+0x4f/0xb0
[ 3961.268390]  [<ffffffffc1334d98>] lustre_mount+0x18/0x20 [obdclass]
[ 3961.275401]  [<ffffffff98e4f35e>] mount_fs+0x3e/0x1b0
[ 3961.281064]  [<ffffffff98e6d507>] vfs_kern_mount+0x67/0x110
[ 3961.287299]  [<ffffffff98e6fc5f>] do_mount+0x1ef/0xce0
[ 3961.293070]  [<ffffffff98e4737a>] ? __check_object_size+0x1ca/0x250
[ 3961.300073]  [<ffffffff98e250ec>] ? kmem_cache_alloc_trace+0x3c/0x200
[ 3961.307276]  [<ffffffff98e70a93>] SyS_mount+0x83/0xd0
[ 3961.312939]  [<ffffffff9938dede>] system_call_fastpath+0x25/0x2a
[ 3961.319665]  [<ffffffff9938de21>] ? system_call_after_swapgs+0xae/0x146
[ 4024.321554] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[ 4024.360505] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect
[ 4025.087731] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 27 clients reconnect

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1588133843.6068-soak-8
124.12 MB
29/Apr/20 5:53 PM
soak-11.log-051120
944 kB
11/May/20 10:19 PM
soak-9.log-20200419.gz
184 kB
22/Apr/20 12:54 AM
trace-8
1002 kB
22/Apr/20 5:25 PM
trace-s-11-051120
997 kB
11/May/20 10:19 PM
trace-soak8
976 kB
29/Apr/20 5:51 PM

Issue Links

is related to

LU-15644 failed llog cancel should not generate an error

Resolved

is related to

LU-13402 sanity test_252: Invalid number of mdtlov clients returned by /usr/sbin/lr_reader

Resolved

LU-13195 replay-single test_118: dt_declare_record_write() ASSERTION( dt->do_body_ops ) failed

Resolved

Activity

[LU-13469] MDS hung during mount

Alex Zhuravlev added a comment - 20/May/20 4:54 PM

hello, any updates on this issue?

Alex Zhuravlev added a comment - 20/May/20 4:54 PM hello, any updates on this issue?

Alex Zhuravlev added a comment - 13/May/20 5:53 PM

sarah I think you should try with the recent master which has ~~LU-13402~~

Alex Zhuravlev added a comment - 13/May/20 5:53 PM sarah I think you should try with the recent master which has LU-13402

Sarah Liu added a comment - 11/May/20 10:19 PM

restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120

Sarah Liu added a comment - 11/May/20 10:19 PM restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120

Sarah Liu added a comment - 07/May/20 10:40 PM

Ok, I will restart the tests and post logs.
The quoted log seems hardware related, not expected during the test.

Sarah Liu added a comment - 07/May/20 10:40 PM Ok, I will restart the tests and post logs. The quoted log seems hardware related, not expected during the test.

Alex Zhuravlev added a comment - 06/May/20 6:44 PM

sorry, it's not quite enough information..
it would be very helpful if you can start the test and then grab logs (let's start with messages and/or consoles) from all the nodes.
one interesting thing from the log attached:

[ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0)
[ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00)
[ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00
[ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)
[ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) 
[ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00)
[ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840)
[ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00
[ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)

I guess this shouldn't happen during this test?

Alex Zhuravlev added a comment - 06/May/20 6:44 PM sorry, it's not quite enough information.. it would be very helpful if you can start the test and then grab logs (let's start with messages and/or consoles) from all the nodes. one interesting thing from the log attached: [ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0) [ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00) [ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00 [ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) [ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) [ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00) [ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840) [ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00 [ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) I guess this shouldn't happen during this test?

Sarah Liu added a comment - 06/May/20 5:23 PM

there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover
1. mds1 failover
reboot mds1
mount the disks to failover pair mds2
after mds1 up, fail back the disks to mds1

2. mds restart
this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up

Sarah Liu added a comment - 06/May/20 5:23 PM there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover 1. mds1 failover reboot mds1 mount the disks to failover pair mds2 after mds1 up, fail back the disks to mds1 2. mds restart this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up

Alex Zhuravlev added a comment - 06/May/20 6:51 AM

thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this.
can you please explain what the test is doing?

Alex Zhuravlev added a comment - 06/May/20 6:51 AM thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this. can you please explain what the test is doing?

Sarah Liu added a comment - 29/Apr/20 5:55 PM

I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed.

Sarah Liu added a comment - 29/Apr/20 5:55 PM I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed.

Alex Zhuravlev added a comment - 26/Apr/20 6:03 PM

sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file

Alex Zhuravlev added a comment - 26/Apr/20 6:03 PM sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file

Sarah Liu added a comment - 24/Apr/20 4:18 AM

Hi Alex,

I am having a weird issue when setting up the panic_on_lbug=0 permanently on soak-8(MGS), Here is what I did
1. lctl set_param -P panic_on_lbug=0
2. umount and remount as ldiskfs and check the config log, and the value was set as 0
3. mount lustre, check the panic_on_lbug=1, it didn't change.

I am not sure if this is related to the llog issue here, can you please check? Do you need any log for this? If it is unrelated, I will create a new ticket, and may need delete bad stuff and restart.

Thanks

Sarah Liu added a comment - 24/Apr/20 4:18 AM Hi Alex, I am having a weird issue when setting up the panic_on_lbug=0 permanently on soak-8(MGS), Here is what I did 1. lctl set_param -P panic_on_lbug=0 2. umount and remount as ldiskfs and check the config log, and the value was set as 0 3. mount lustre, check the panic_on_lbug=1, it didn't change. I am not sure if this is related to the llog issue here, can you please check? Do you need any log for this? If it is unrelated, I will create a new ticket, and may need delete bad stuff and restart. Thanks

Sarah Liu added a comment - 23/Apr/20 4:11 PM

Hi Alex, I will restart with the debug on

Sarah Liu added a comment - 23/Apr/20 4:11 PM Hi Alex, I will restart with the debug on

People

Assignee:: Alex Zhuravlev

Reporter:: Sarah Liu

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/Apr/20 5:22 PM

Updated:: 27/Sep/23 8:33 AM

Resolved:: 27/Sep/23 8:33 AM