Details
-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
Lustre 2.10.1, Lustre 2.10.2, Lustre 2.10.3, Lustre 2.10.5
-
None
-
Older Version: b_ieel3_0, build 222
Newer Version: b2_10 build 26
-
3
-
9223372036854775807
Description
MDS hung when trying to mount MDS trying to downgrade it from 2.10.1 RC1 to b_ieel3_0 build 222.
This happens for both ldiskfs as well as zfs
ldiskfs
[root@onyx-63 ~]# mount -t lustre -o acl,user_xattr /dev/sdb1 /mnt/mds0 mount.lustre: increased /sys/block/sdb/queue/max_sectors_kb from 512 to 32767 mount.lustre: change scheduler of /sys/block/sdb[ 2004.864821] libcfs: loading out-of-tree module taints kernel. /queue/scheduler[ 2004.871947] libcfs: module verification failed: signature and/or required key missing - tainting kernel from cfq to deadline [ 2004.888945] LNet: HW CPU cores: 32, npartitions: 4 [ 2004.898521] alg: No test for adler32 (adler32-zlib) [ 2004.904129] alg: No test for crc32 (crc32-table) [ 2009.942002] sha512_ssse3: Using AVX optimized SHA-512 implementation [ 2012.937978] Lustre: Lustre: Build Version: 2.7.19.10--PRISTINE-3.10.0-514.10.2.el7_lustre.x86_64 [ 2012.975643] LNet: Added LNI 10.2.5.135@tcp [8/256/0/180] [ 2012.981679] LNet: Accept secure, port 988 [ 2013.034692] LDISKFS-fs (sdb1): mounted filesystem with ordered data mode. Opts: acl,user_xattr,user_xattr,errors=remount-ro,no_mbcache [ 2065.369959] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.51@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2067.236139] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.18@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2067.255509] LustreError: Skipped 1 previous similar message [ 2076.647185] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.19@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2090.368245] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.51@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2101.644288] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.19@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2101.663651] LustreError: Skipped 2 previous similar messages [ 2115.367485] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.51@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2140.366559] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.51@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2140.385928] LustreError: Skipped 4 previous similar messages [ 2160.784144] INFO: task ll_mgs_0002:21386 blocked for more than 120 seconds. [ 2160.791943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2160.800724] ll_mgs_0002 D ffffffffa0f78660 0 21386 2 0x00000080 [ 2160.808691] ffff88040aa43a10 0000000000000046 ffff880429e04e70 ffff88040aa43fd8 [ 2160.818118] ffff88040aa43fd8 ffff88040aa43fd8 ffff880429e04e70 ffff88081df59d48 [ 2160.827408] ffff88081df59d50 7fffffffffffffff ffff880429e04e70 ffffffffa0f78660 [ 2160.836732] Call Trace: [ 2160.840486] [<ffffffffa0f78660>] ? mgs_steal_client_llog_handler+0x1110/0x1110 [mgs] [ 2160.850251] [<ffffffff8168bd09>] schedule+0x29/0x70 [ 2160.856792] [<ffffffff81689759>] schedule_timeout+0x239/0x2d0 [ 2160.864288] [<ffffffff810c7f05>] ? sched_clock_cpu+0x85/0xc0 [ 2160.871684] [<ffffffff810c1b15>] ? check_preempt_curr+0x75/0xa0 [ 2160.879360] [<ffffffff810c1b59>] ? ttwu_do_wakeup+0x19/0xd0 [ 2160.886649] [<ffffffffa0f78660>] ? mgs_steal_client_llog_handler+0x1110/0x1110 [mgs] [ 2160.896367] [<ffffffff8168c0e6>] wait_for_completion+0x116/0x170 [ 2160.904146] [<ffffffff810c5080>] ? wake_up_state+0x20/0x20 [ 2160.911392] [<ffffffffa095a047>] llog_process_or_fork+0x1d7/0x590 [obdclass] [ 2160.920354] [<ffffffffa095a414>] llog_process+0x14/0x20 [obdclass] [ 2160.928335] [<ffffffffa0f80c9f>] mgs_find_or_make_fsdb+0x72f/0x7e0 [mgs] [ 2160.936899] [<ffffffffa0f80db2>] mgs_check_index+0x62/0x2f0 [mgs] [ 2160.944786] [<ffffffffa0f689fe>] mgs_target_reg+0x38e/0x1320 [mgs] [ 2160.952844] [<ffffffffa0c74adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc] [ 2160.961512] [<ffffffffa0c18b8b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 2160.971041] [<ffffffffa084fce8>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 2160.979619] [<ffffffffa0c15c58>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc] [ 2160.988168] [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90 [ 2160.995664] [<ffffffffa0c1c4b0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc] [ 2161.003618] [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0 [ 2161.010710] [<ffffffffa0c1b8b0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc] [ 2161.020100] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 2161.026468] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 2161.034663] [<ffffffff81696b98>] ret_from_fork+0x58/0x90 [ 2161.041573] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 2176.641784] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.19@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2176.662876] LustreError: Skipped 5 previous similar messages [ 2214.124574] LNet: Service thread pid 21386 was inactive for 200.36s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: [ 2214.145286] Pid: 21386, comm: ll_mgs_0002 [ 2214.150532] [ 2214.150532] Call Trace: [ 2214.156430] [<ffffffffa0f78660>] ? mgs_fsdb_handler+0x0/0x10a0 [mgs] [ 2214.164352] [<ffffffff8168bd09>] schedule+0x29/0x70 [ 2214.170643] [<ffffffff81689759>] schedule_timeout+0x239/0x2d0 [ 2214.177911] [<ffffffff810c7f05>] ? sched_clock_cpu+0x85/0xc0 [ 2214.185069] [<ffffffff810c1b15>] ? check_preempt_curr+0x75/0xa0 [ 2214.192491] [<ffffffff810c1b59>] ? ttwu_do_wakeup+0x19/0xd0 [ 2214.199520] [<ffffffffa0f78660>] ? mgs_fsdb_handler+0x0/0x10a0 [mgs] [ 2214.207414] [<ffffffff8168c0e6>] wait_for_completion+0x116/0x170 [ 2214.214917] [<ffffffff810c5080>] ? default_wake_function+0x0/0x20 [ 2214.222514] [<ffffffffa095a047>] llog_process_or_fork+0x1d7/0x590 [obdclass] [ 2214.231181] [<ffffffffa095a414>] llog_process+0x14/0x20 [obdclass] [ 2214.238870] [<ffffffffa0f80c9f>] mgs_find_or_make_fsdb+0x72f/0x7e0 [mgs] [ 2214.247145] [<ffffffffa0f80db2>] mgs_check_index+0x62/0x2f0 [mgs] [ 2214.254738] [<ffffffffa0f689fe>] mgs_target_reg+0x38e/0x1320 [mgs] [ 2214.262432] [<ffffffffa0c74adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc] [ 2214.270794] [<ffffffffa0c18b8b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 2214.280035] [<ffffffffa084fce8>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 2214.288315] [<ffffffffa0c15c58>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc] [ 2214.296584] [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90 [ 2214.303799] [<ffffffffa0c1c4b0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc] [ 2214.311475] [<ffffffff81029569>] ? __switch_to+0xd9/0x4c0 [ 2214.318290] [<ffffffffa0c1b8b0>] ? ptlrpc_main+0x0/0x1f60 [ptlrpc] [ 2214.325964] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 2214.332082] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 2214.338288] [<ffffffff81696b98>] ret_from_fork+0x58/0x90 [ 2214.344981] [<ffffffff810b0630>] ? kthread+0x0/0xe0 [ 2214.351182]
ZFS:
[root@onyx-78 ~]# mount -t lustre -o acl,user_xattr lustre-mdt1/mdt1 /mnt/mds0 [ 2214.245461] LNet: HW CPU cores: 72, npartitions: 8 [ 2214.253142] alg: No test for adler32 (adler32-zlib) [ 2214.258723] alg: No test for crc32 (crc32-table) [ 2219.277933] sha512_ssse3: Using AVX2 optimized SHA-512 implementation [ 2222.279787] Lustre: Lustre: Build Version: 2.7.19.10--PRISTINE-3.10.0-514.10.2.el7_lustre.x86_64 [ 2222.309748] LNet: Added LNI 10.2.2.50@tcp [8/512/0/180] [ 2222.315678] LNet: Accept secure, port 988 [ 2273.432429] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2275.484816] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.16@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2275.504279] LustreError: Skipped 1 previous similar message [ 2298.432064] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2298.451527] LustreError: Skipped 1 previous similar message [ 2300.482820] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.4.16@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2300.502281] LustreError: Skipped 1 previous similar message [ 2323.431994] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2323.451546] LustreError: Skipped 1 previous similar message [ 2348.432062] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2348.455763] LustreError: Skipped 3 previous similar messages [ 2373.432088] LustreError: 137-5: lustre-MDT0000_UUID: not available for connect from 10.2.2.52@tcp (no target). If you are running an HA pair check that the target is mounted on the other server. [ 2373.455409] LustreError: Skipped 3 previous similar messages [ 2401.707691] INFO: task ll_mgs_0001:32818 blocked for more than 120 seconds. [ 2401.717283] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2401.727712] ll_mgs_0001 D ffff8810583fae30 0 32818 2 0x00000080 [ 2401.737119] ffff88103479bba0 0000000000000046 ffff881058a91f60 ffff88103479bfd8 [ 2401.746898] ffff88103479bfd8 ffff88103479bfd8 ffff881058a91f60 ffff8810583fae28 [ 2401.756643] ffff8810583fae2c ffff881058a91f60 00000000ffffffff ffff8810583fae30 [ 2401.766335] Call Trace: [ 2401.770395] [<ffffffff8168cdf9>] schedule_preempt_disabled+0x29/0x70 [ 2401.778859] [<ffffffff8168aa55>] __mutex_lock_slowpath+0xc5/0x1c0 [ 2401.786989] [<ffffffff81689ebf>] mutex_lock+0x1f/0x2f [ 2401.793954] [<ffffffffa11d829f>] mgs_fsc_attach+0x22f/0x600 [mgs] [ 2401.802032] [<ffffffffa11b247a>] mgs_llog_open+0x1fa/0x430 [mgs] [ 2401.810052] [<ffffffffa0fd6adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc] [ 2401.818939] [<ffffffffa0f7ab8b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 2401.828681] [<ffffffffa0be7ce8>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 2401.837460] [<ffffffffa0f77c58>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc] [ 2401.846192] [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90 [ 2401.853882] [<ffffffffa0f7e4b0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc] [ 2401.862057] [<ffffffffa0f7d8b0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc] [ 2401.871654] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 2401.878227] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 2401.886661] [<ffffffff81696b98>] ret_from_fork+0x58/0x90 [ 2401.893816] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 2401.902243] INFO: task ll_mgs_0002:32819 blocked for more than 120 seconds. [ 2401.911156] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2401.921038] ll_mgs_0002 D ffffffffa11c4660 0 32819 2 0x00000080 [ 2401.930084] ffff88103523ba10 0000000000000046 ffff881058a94e70 ffff88103523bfd8 [ 2401.939561] ffff88103523bfd8 ffff88103523bfd8 ffff881058a94e70 ffff88085a687748 [ 2401.949014] ffff88085a687750 7fffffffffffffff ffff881058a94e70 ffffffffa11c4660 [ 2401.958448] Call Trace: [ 2401.962252] [<ffffffffa11c4660>] ? mgs_steal_client_llog_handler+0x1110/0x1110 [mgs] [ 2401.972072] [<ffffffff8168bd09>] schedule+0x29/0x70 [ 2401.978673] [<ffffffff81689759>] schedule_timeout+0x239/0x2d0 [ 2401.986230] [<ffffffff810c7f05>] ? sched_clock_cpu+0x85/0xc0 [ 2401.993663] [<ffffffff810c1b15>] ? check_preempt_curr+0x75/0xa0 [ 2402.001376] [<ffffffff810c1b59>] ? ttwu_do_wakeup+0x19/0xd0 [ 2402.008685] [<ffffffffa11c4660>] ? mgs_steal_client_llog_handler+0x1110/0x1110 [mgs] [ 2402.018404] [<ffffffff8168c0e6>] wait_for_completion+0x116/0x170 [ 2402.026180] [<ffffffff810c5080>] ? wake_up_state+0x20/0x20 [ 2402.033374] [<ffffffffa0cbc047>] llog_process_or_fork+0x1d7/0x590 [obdclass] [ 2402.042291] [<ffffffffa0cbc414>] llog_process+0x14/0x20 [obdclass] [ 2402.050215] [<ffffffffa11ccc9f>] mgs_find_or_make_fsdb+0x72f/0x7e0 [mgs] [ 2402.058710] [<ffffffffa11ccdb2>] mgs_check_index+0x62/0x2f0 [mgs] [ 2402.066494] [<ffffffffa11b49fe>] mgs_target_reg+0x38e/0x1320 [mgs] [ 2402.074414] [<ffffffffa0fd6adb>] tgt_request_handle+0x8fb/0x11f0 [ptlrpc] [ 2402.082981] [<ffffffffa0f7ab8b>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] [ 2402.092385] [<ffffffffa0be7ce8>] ? lc_watchdog_touch+0x68/0x180 [libcfs] [ 2402.100831] [<ffffffffa0f77c58>] ? ptlrpc_wait_event+0x98/0x330 [ptlrpc] [ 2402.109235] [<ffffffff810ba2e8>] ? __wake_up_common+0x58/0x90 [ 2402.116588] [<ffffffffa0f7e4b0>] ptlrpc_main+0xc00/0x1f60 [ptlrpc] [ 2402.124421] [<ffffffffa0f7d8b0>] ? ptlrpc_register_service+0x1070/0x1070 [ptlrpc] [ 2402.133680] [<ffffffff810b06ff>] kthread+0xcf/0xe0 [ 2402.139934] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 2402.148035] [<ffffffff81696b98>] ret_from_fork+0x58/0x90 [ 2402.154864] [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140 [ 2402.162962] INFO: task ll_mgs_0003:32907 blocked for more than 120 seconds. [ 2402.171544] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2402.181108] ll_mgs_0003 D ffff8810583fae30 0 32907 2 0x00000080 [ 2402.189851] ffff88102ae2bba0 0000000000000046 ffff881047b58000 ffff88102ae2bfd8 [ 2402.199037] ffff88102ae2bfd8 ffff88102ae2bfd8 ffff881047b58000 ffff8810583fae28 [ 2402.208210] ffff8810583fae2c ffff881047b58000 00000000ffffffff ffff8810583fae30
Steps followed for Rolling Upgrade Downgrade:
1. Started with all clients and servers with b_ieel3_0 build 222 and build the Lustre File system
2. Upgraded OSS to 2.10.1RC1 re-mounted and ran sanity.sh
3. Upgraded MDS to 2.10.1 RC1 , remounted and ran sanity.sh
4. Upgraded Clients to 2.10.1RC1, remounted and ran sanity.sh
5. Downgraded Clients to b_ieel3_0 build 222 remounted and ran sanity.sh
6. Downgraded MDS to b_ieel3_0 build 222.
For both ldiskfs and zfs when trying to mount , the system hangs
Attachments
Issue Links
- is related to
-
LU-10039 ioctl error after downgrade
- Open
- is related to
-
LU-7050 llog_skip_over skip the record by too little minimum record size.
- Resolved
-
LU-9764 recovery-double-scale_pairwise_fail test failed: mount.lustre: mount /dev/vdb at /mnt/mds3 failed: Bad file descriptor
- Resolved
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...