[LU-13469] MDS hung during mount - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.14.0
Labels:
- soak
Environment:
lustre-master-ib #404

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

1 MDS hung during mount during failover process.

soak-9 console

[ 3961.086008] mount.lustre    D ffff8f5730291070     0  5206   5205 0x00000082
[ 3961.093940] Call Trace:
[ 3961.096752]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.105419]  [<ffffffff99380a09>] schedule+0x29/0x70
[ 3961.110980]  [<ffffffff9937e511>] schedule_timeout+0x221/0x2d0
[ 3961.117509]  [<ffffffff98ce10f6>] ? select_task_rq_fair+0x5a6/0x760
[ 3961.124565]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.133226]  [<ffffffff99380dbd>] wait_for_completion+0xfd/0x140
[ 3961.139955]  [<ffffffff98cdb4c0>] ? wake_up_state+0x20/0x20
[ 3961.146222]  [<ffffffffc12f8b84>] llog_process_or_fork+0x254/0x520 [obdclass]
[ 3961.154226]  [<ffffffffc12f8e64>] llog_process+0x14/0x20 [obdclass]
[ 3961.161271]  [<ffffffffc132b055>] class_config_parse_llog+0x125/0x350 [obdclass]
[ 3961.169552]  [<ffffffffc15beaf8>] mgc_process_cfg_log+0x788/0xc40 [mgc]
[ 3961.176961]  [<ffffffffc15c223f>] mgc_process_log+0x3bf/0x920 [mgc]
[ 3961.184004]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.192673]  [<ffffffffc15c3cc3>] mgc_process_config+0xc63/0x1870 [mgc]
[ 3961.200110]  [<ffffffffc1336f27>] lustre_process_log+0x2d7/0xad0 [obdclass]
[ 3961.207925]  [<ffffffffc136a064>] server_start_targets+0x12d4/0x2970 [obdclass]
[ 3961.216133]  [<ffffffffc1339fe7>] ? lustre_start_mgc+0x257/0x2420 [obdclass]
[ 3961.224020]  [<ffffffff98e23db6>] ? kfree+0x106/0x140
[ 3961.229698]  [<ffffffffc1333360>] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass]
[ 3961.238396]  [<ffffffffc136c7cc>] server_fill_super+0x10cc/0x1890 [obdclass]
[ 3961.246314]  [<ffffffffc133cd88>] lustre_fill_super+0x498/0x990 [obdclass]
[ 3961.254033]  [<ffffffffc133c8f0>] ? lustre_common_put_super+0x270/0x270 [obdclass]
[ 3961.262511]  [<ffffffff98e4e7df>] mount_nodev+0x4f/0xb0
[ 3961.268390]  [<ffffffffc1334d98>] lustre_mount+0x18/0x20 [obdclass]
[ 3961.275401]  [<ffffffff98e4f35e>] mount_fs+0x3e/0x1b0
[ 3961.281064]  [<ffffffff98e6d507>] vfs_kern_mount+0x67/0x110
[ 3961.287299]  [<ffffffff98e6fc5f>] do_mount+0x1ef/0xce0
[ 3961.293070]  [<ffffffff98e4737a>] ? __check_object_size+0x1ca/0x250
[ 3961.300073]  [<ffffffff98e250ec>] ? kmem_cache_alloc_trace+0x3c/0x200
[ 3961.307276]  [<ffffffff98e70a93>] SyS_mount+0x83/0xd0
[ 3961.312939]  [<ffffffff9938dede>] system_call_fastpath+0x25/0x2a
[ 3961.319665]  [<ffffffff9938de21>] ? system_call_after_swapgs+0xae/0x146
[ 4024.321554] Lustre: soaked-MDT0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
[ 4024.360505] Lustre: soaked-MDT0001: in recovery but waiting for the first client to connect
[ 4025.087731] Lustre: soaked-MDT0001: Will be in recovery for at least 2:30, or until 27 clients reconnect

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre-log.1588133843.6068-soak-8
124.12 MB
29/Apr/20 5:53 PM
soak-11.log-051120
944 kB
11/May/20 10:19 PM
soak-9.log-20200419.gz
184 kB
22/Apr/20 12:54 AM
trace-8
1002 kB
22/Apr/20 5:25 PM
trace-s-11-051120
997 kB
11/May/20 10:19 PM
trace-soak8
976 kB
29/Apr/20 5:51 PM

Issue Links

is related to

LU-15644 failed llog cancel should not generate an error

Resolved

is related to

LU-13402 sanity test_252: Invalid number of mdtlov clients returned by /usr/sbin/lr_reader

Resolved

LU-13195 replay-single test_118: dt_declare_record_write() ASSERTION( dt->do_body_ops ) failed

Resolved

Activity

[LU-13469] MDS hung during mount

Sarah Liu added a comment - 28/May/20 5:19 PM

Here is the update. 1 MDS hit LBUG again but a different one when running with master build including the fix of ~~LU-13402~~

soak-8 console

[11360.312904] Lustre: Failing over soaked-MDT0001
[11360.312929] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -19
[11360.312937] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recove
ry to complete
[11360.349350] LustreError: 4582:0:(import.c:705:ptlrpc_connect_import_locked()) can't connect to a closed import
[11360.853248] Lustre: server umount soaked-MDT0001 complete
[11361.443391] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.102@o2ib (no target). If you are running an HA pair check that th
e target is mounted on the other server.
[11361.463245] LustreError: Skipped 93 previous similar messages
[11363.753573] Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.109@o2ib, removing former export from 0@lo
[11363.765487] Lustre: Skipped 1 previous similar message
[11393.837251] LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail.
[11393.853109] LustreError: 4733:0:(llog_osd.c:263:llog_osd_read_header()) soaked-MDT0001-osp-MDT0000: bad log  [0x240000401:0x1:0x0] header magic: 0x0 (expected 
0x10645539)
[11393.871120] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) ASSERTION( ctxt != ((void *)0) ) failed: 
[11393.882886] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) LBUG
[11393.891077] Pid: 4733, comm: lod0000_rec0001 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Thu Apr 23 02:14:20 UTC 2020
[11393.902641] Call Trace:
[11393.905394]  [<ffffffffc0ad10cc>] libcfs_call_trace+0x8c/0xc0 [libcfs]
[11393.912739]  [<ffffffffc0ad117c>] lbug_with_loc+0x4c/0xa0 [libcfs]
[11393.919677]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
[11393.926941]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
[11393.934759]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
[11393.940234]  [<ffffffffa518dd37>] ret_from_fork_nospec_end+0x0/0x39
[11393.947259]  [<ffffffffffffffff>] 0xffffffffffffffff
[11393.953057] LustreError: dumping log to /tmp/lustre-log.1590658382.4733
[11520.707452] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
[11520.715960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11520.724727] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
[11520.732665] Call Trace:
[11520.735438]  [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20
[11520.741684]  [<ffffffffa5180a09>] schedule+0x29/0x70
[11520.747261]  [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs]
[11520.754205]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
[11520.761429]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
[11520.769234]  [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod]
[11520.776734]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
[11520.782200]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
[11520.789020]  [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21
[11520.796318]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
[11550.607865] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reached hard limit: 900, extend: 1
[11550.622272] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) Skipped 31 previous similar messages
[11640.796481] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
[11640.804564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11640.813320] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080
[11640.821289] Call Trace:
[11640.824059]  [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20
[11640.830319]  [<ffffffffa5180a09>] schedule+0x29/0x70
[11640.835909]  [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs]
[11640.842839]  [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod]
[11640.850079]  [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod]
[11640.857882]  [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod]
[11640.865394]  [<ffffffffa4ac61f1>] kthread+0xd1/0xe0
[11640.870885]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
[11640.877723]  [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21
[11640.885019]  [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40
[11760.885726] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds.
[11760.893864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11760.902634] lod0000_rec0001 D ffff9fe7563b5230     0  4733      2 0x00000080

Sarah Liu added a comment - 28/May/20 5:19 PM Here is the update. 1 MDS hit LBUG again but a different one when running with master build including the fix of LU-13402 soak-8 console [11360.312904] Lustre: Failing over soaked-MDT0001 [11360.312929] LustreError: 11-0: soaked-MDT0001-osp-MDT0000: operation out_update to node 0@lo failed: rc = -19 [11360.312937] Lustre: soaked-MDT0001-osp-MDT0000: Connection to soaked-MDT0001 (at 0@lo) was lost; in progress operations using this service will wait for recove ry to complete [11360.349350] LustreError: 4582:0:(import.c:705:ptlrpc_connect_import_locked()) can't connect to a closed import [11360.853248] Lustre: server umount soaked-MDT0001 complete [11361.443391] LustreError: 137-5: soaked-MDT0001_UUID: not available for connect from 192.168.1.102@o2ib (no target). If you are running an HA pair check that th e target is mounted on the other server. [11361.463245] LustreError: Skipped 93 previous similar messages [11363.753573] Lustre: soaked-MDT0000: Received LWP connection from 192.168.1.109@o2ib, removing former export from 0@lo [11363.765487] Lustre: Skipped 1 previous similar message [11393.837251] LustreError: 167-0: soaked-MDT0001-osp-MDT0000: This client was evicted by soaked-MDT0001; in progress operations using this service will fail. [11393.853109] LustreError: 4733:0:(llog_osd.c:263:llog_osd_read_header()) soaked-MDT0001-osp-MDT0000: bad log [0x240000401:0x1:0x0] header magic: 0x0 (expected 0x10645539) [11393.871120] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) ASSERTION( ctxt != ((void *)0) ) failed: [11393.882886] LustreError: 4733:0:(lod_sub_object.c:938:lod_sub_prep_llog()) LBUG [11393.891077] Pid: 4733, comm: lod0000_rec0001 3.10.0-1062.9.1.el7_lustre.x86_64 #1 SMP Thu Apr 23 02:14:20 UTC 2020 [11393.902641] Call Trace: [11393.905394] [<ffffffffc0ad10cc>] libcfs_call_trace+0x8c/0xc0 [libcfs] [11393.912739] [<ffffffffc0ad117c>] lbug_with_loc+0x4c/0xa0 [libcfs] [11393.919677] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11393.926941] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11393.934759] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11393.940234] [<ffffffffa518dd37>] ret_from_fork_nospec_end+0x0/0x39 [11393.947259] [<ffffffffffffffff>] 0xffffffffffffffff [11393.953057] LustreError: dumping log to /tmp/lustre-log.1590658382.4733 [11520.707452] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11520.715960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11520.724727] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080 [11520.732665] Call Trace: [11520.735438] [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20 [11520.741684] [<ffffffffa5180a09>] schedule+0x29/0x70 [11520.747261] [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs] [11520.754205] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11520.761429] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11520.769234] [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod] [11520.776734] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11520.782200] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11520.789020] [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21 [11520.796318] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11550.607865] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) soaked-MDT0000: extended recovery timer reached hard limit: 900, extend: 1 [11550.622272] Lustre: 4736:0:(ldlm_lib.c:1863:extend_recovery_timer()) Skipped 31 previous similar messages [11640.796481] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11640.804564] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11640.813320] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080 [11640.821289] Call Trace: [11640.824059] [<ffffffffa4adb4c0>] ? wake_up_state+0x20/0x20 [11640.830319] [<ffffffffa5180a09>] schedule+0x29/0x70 [11640.835909] [<ffffffffc0ad11ad>] lbug_with_loc+0x7d/0xa0 [libcfs] [11640.842839] [<ffffffffc14bd168>] lod_sub_prep_llog+0x708/0x783 [lod] [11640.850079] [<ffffffffc1479ac3>] lod_sub_recovery_thread+0x263/0xc10 [lod] [11640.857882] [<ffffffffc1479860>] ? lod_connect_to_osd+0xc00/0xc00 [lod] [11640.865394] [<ffffffffa4ac61f1>] kthread+0xd1/0xe0 [11640.870885] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11640.877723] [<ffffffffa518dd37>] ret_from_fork_nospec_begin+0x21/0x21 [11640.885019] [<ffffffffa4ac6120>] ? insert_kthread_work+0x40/0x40 [11760.885726] INFO: task lod0000_rec0001:4733 blocked for more than 120 seconds. [11760.893864] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [11760.902634] lod0000_rec0001 D ffff9fe7563b5230 0 4733 2 0x00000080

Peter Jones added a comment - 23/May/20 1:51 PM

bzzz I believe that the testing has been delayed due to hardware issues

Peter Jones added a comment - 23/May/20 1:51 PM bzzz I believe that the testing has been delayed due to hardware issues

Alex Zhuravlev added a comment - 20/May/20 4:54 PM

hello, any updates on this issue?

Alex Zhuravlev added a comment - 20/May/20 4:54 PM hello, any updates on this issue?

Alex Zhuravlev added a comment - 13/May/20 5:53 PM

sarah I think you should try with the recent master which has ~~LU-13402~~

Alex Zhuravlev added a comment - 13/May/20 5:53 PM sarah I think you should try with the recent master which has LU-13402

Sarah Liu added a comment - 11/May/20 10:19 PM

restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120

Sarah Liu added a comment - 11/May/20 10:19 PM restarted the test, not seeing the LBUG, but MDS failover still failed. The secondary MDS didn't failback the device, please check the 2 attachments ending with 051120 soak-11.log-051120 trace-s-11-051120

Sarah Liu added a comment - 07/May/20 10:40 PM

Ok, I will restart the tests and post logs.
The quoted log seems hardware related, not expected during the test.

Sarah Liu added a comment - 07/May/20 10:40 PM Ok, I will restart the tests and post logs. The quoted log seems hardware related, not expected during the test.

Alex Zhuravlev added a comment - 06/May/20 6:44 PM

sorry, it's not quite enough information..
it would be very helpful if you can start the test and then grab logs (let's start with messages and/or consoles) from all the nodes.
one interesting thing from the log attached:

[ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0)
[ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00)
[ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00
[ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)
[ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) 
[ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00)
[ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840)
[ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00
[ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0)

I guess this shouldn't happen during this test?

Alex Zhuravlev added a comment - 06/May/20 6:44 PM sorry, it's not quite enough information.. it would be very helpful if you can start the test and then grab logs (let's start with messages and/or consoles) from all the nodes. one interesting thing from the log attached: [ 1279.175117] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626abc0) [ 1279.182085] sd 0:0:1:1: attempting task abort! scmd(ffff99512626aa00) [ 1279.189301] sd 0:0:1:1: [sdi] tag#96 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 90 00 00 00 08 00 00 [ 1279.199423] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) [ 1279.208168] scsi target0:0:1: enclosure logical id(0x500605b005d6e9a0), slot(3) [ 1279.367751] sd 0:0:1:1: task abort: SUCCESS scmd(ffff99512626aa00) [ 1279.374697] sd 0:0:1:1: attempting task abort! scmd(ffff99512626a840) [ 1279.381918] sd 0:0:1:1: [sdi] tag#95 CDB: Write(16) 8a 00 00 00 00 00 02 a8 01 70 00 00 00 08 00 00 [ 1279.392037] scsi target0:0:1: handle(0x0009), sas_address(0x50080e52ff4f0004), phy(0) I guess this shouldn't happen during this test?

Sarah Liu added a comment - 06/May/20 5:23 PM

there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover
1. mds1 failover
reboot mds1
mount the disks to failover pair mds2
after mds1 up, fail back the disks to mds1

2. mds restart
this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up

Sarah Liu added a comment - 06/May/20 5:23 PM there are 2 kinds of mds fault injections, I think when the crash happened, it was in the middle of mds_failover 1. mds1 failover reboot mds1 mount the disks to failover pair mds2 after mds1 up, fail back the disks to mds1 2. mds restart this is similar to mds failover, just not mounting the disk to the failover pair but wait and mount the disk back when the server is up

Alex Zhuravlev added a comment - 06/May/20 6:51 AM

thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this.
can you please explain what the test is doing?

Alex Zhuravlev added a comment - 06/May/20 6:51 AM thanks.. looking at the logs - there were lots of invalidations in OSP which shouldn't be common - regular failover shouldn't cause this. can you please explain what the test is doing?

Sarah Liu added a comment - 29/Apr/20 5:55 PM

I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed.

Sarah Liu added a comment - 29/Apr/20 5:55 PM I just uploaded the lustre log and trace of soak-8, with panic_on_lbug=0. Please let me know if anything else needed.

Alex Zhuravlev added a comment - 26/Apr/20 6:03 PM

sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file

Alex Zhuravlev added a comment - 26/Apr/20 6:03 PM sarah I don't think there is any relation here. I think you can either modify the source or set panic_on_lbug=0 in the scripts? or in modules conf file

People

Assignee:: Alex Zhuravlev

Reporter:: Sarah Liu

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 21/Apr/20 5:22 PM

Updated:: 27/Sep/23 8:33 AM

Resolved:: 27/Sep/23 8:33 AM