[LU-5573] Test timeout conf-sanity test_41c Created: 02/Sep/14  Updated: 01/May/15  Resolved: 27/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0, Lustre 2.5.4

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Bruno Faccini (Inactive)
Resolution: Fixed Votes: 0
Labels: MB

Issue Links:
Duplicate
is duplicated by LU-5736 (mgc_request.c:745:mgc_fs_setup()) AS... Resolved
Related
is related to LU-5299 osd_start() LBUG when doing parallel ... Resolved
is related to LU-6553 Recurrence of LU-5299: obd_mount_serv... Resolved
Severity: 3
Rank (Obsolete): 15545

 Description   

This issue was created by maloo for Nathaniel Clark <nathaniel.l.clark@intel.com>

This issue relates to the following test suite run:
https://testing.hpdd.intel.com/test_sets/4e98188c-1fe2-11e4-8610-5254006e85c2
https://testing.hpdd.intel.com/test_sets/6798b742-32d9-11e4-aefc-5254006e85c2
https://testing.hpdd.intel.com/test_sets/a8565b52-3286-11e4-aefc-5254006e85c2

The sub-test test_41c failed with the following error:

test failed to respond and timed out

Info required for matching: conf-sanity 41c



 Comments   
Comment by Nathaniel Clark [ 02/Sep/14 ]

This test was introduced in patch http://review.whamcloud.com/11139 and also failed it during autotest:

https://testing.hpdd.intel.com/test_sets/4e98188c-1fe2-11e4-8610-5254006e85c2

Comment by Bruno Faccini (Inactive) [ 02/Sep/14 ]

Hello Nathaniel,
I believe that "https://testing.hpdd.intel.com/test_sets/4e98188c-1fe2-11e4-8610-5254006e85c2" was a different problem (and LBUG!) and only occured at the time I was trying to introduce a delay to try better ensure a race between multiple+concurrent MDT/OST mounts.

BTW, seems that the way I try to delay the 1st mount is still having other side effects causing these new failures in auto-tests.
I think I will need to remove it and simply run new mount attempt as quick as possible to trigger the race.

Will push a patch soon.

Comment by Bruno Faccini (Inactive) [ 04/Sep/14 ]

Even if extensive testing of my patch for LU-5299 did not show any problem, seems that under some auto-tests conditions/configurations conf-sanity/test_41c sub-test, that has been introduced by patch, triggers quite frequent MDS/OSS LBUGs.

Comment by Bruno Faccini (Inactive) [ 04/Sep/14 ]

I checked conf-sanity/test_41c failures since my patch has landed and found about 15 failures in about a month with less frequent MDS LBUG during concurrent MDT mounts/starts in test_41c that always looks like :

19:58:50:Lustre: DEBUG MARKER: == conf-sanity test 41c: concurent mounts of MDT/OST should all fail but one == 01:58:15 (1409623095)
19:58:50:Lustre: DEBUG MARKER: grep -c /mnt/mds1' ' /proc/mounts
19:58:50:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
19:58:50:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x703
19:58:50:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
19:58:50:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
19:58:50:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/mds1
19:58:50:LDISKFS-fs (dm-0): mounted filesystem with ordered data mode. quota=on. Opts: 
19:58:50:LustreError: 27017:0:(fail.c:133:__cfs_fail_timeout_set()) cfs_fail_timeout id 703 sleeping for 40000ms
19:58:50:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x0
19:58:50:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1
19:58:50:Lustre: DEBUG MARKER: test -b /dev/lvm-Role_MDS/P1
19:58:50:Lustre: DEBUG MARKER: mkdir -p /mnt/mds1; mount -t lustre   		                   /dev/lvm-Role_MDS/P1 /mnt/mds1
19:58:50:LustreError: 15d-9: The MGS service was already started from server
19:58:50:LustreError: 27234:0:(obd_mount_server.c:865:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
19:58:50:LustreError: 27234:0:(obd_mount_server.c:1443:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
19:58:50:LustreError: 27234:0:(obd_mount_server.c:1473:server_put_super()) no obd lustre-MDT0000
19:58:50:LustreError: 27234:0:(obd_mount_server.c:135:server_deregister_mount()) lustre-MDT0000 not registered
19:58:50:Lustre: MGS: Not available for connect from 0@lo (stopping)
19:58:50:LustreError: 26988:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff88006abf3400 x1478095423936308/t0(0) o253->MGC10.1.6.34@tcp@0@lo:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
19:58:50:LustreError: 26988:0:(obd_mount_server.c:1140:server_register_target()) lustre-MDT0000: error registering with the MGS: rc = -5 (not fatal)
19:58:50:LustreError: 26988:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880069a1f400 x1478095423936316/t0(0) o101->MGC10.1.6.34@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
19:58:50:LustreError: 26988:0:(client.c:1052:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff880069a1f800 x1478095423936324/t0(0) o101->MGC10.1.6.34@tcp@0@lo:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
19:58:50:LustreError: 15c-8: MGC10.1.6.34@tcp: The configuration from log 'lustre-MDT0000' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
19:58:50:LustreError: 26988:0:(obd_mount_server.c:1274:server_start_targets()) failed to start server lustre-MDT0000: -5
19:58:50:LustreError: 26988:0:(obd_mount_server.c:1716:server_fill_super()) Unable to start targets: -5
19:58:50:LustreError: 26988:0:(obd_mount_server.c:865:lustre_disconnect_lwp()) lustre-MDT0000-lwp-MDT0000: Can't end config log lustre-client.
19:58:50:LustreError: 26988:0:(obd_mount_server.c:1443:server_put_super()) lustre-MDT0000: failed to disconnect lwp. (rc=-2)
19:59:12:LustreError: 26988:0:(obd_mount_server.c:1473:server_put_super()) no obd lustre-MDT0000
19:59:12:LustreError: 26988:0:(obd_config.c:626:class_cleanup()) OBD 1 already stopping
19:59:12:LustreError: 26988:0:(obd_config.c:585:class_detach()) OBD device 1 still set up
19:59:12:LustreError: 26988:0:(obd_mount.c:1289:lustre_fill_super()) Unable to mount /dev/mapper/lvm--Role_MDS-P1 (-5)
19:59:12:Lustre: DEBUG MARKER: lctl set_param -n mdt.lustre*.enable_remote_dir=1
19:59:12:LustreError: 27017:0:(fail.c:137:__cfs_fail_timeout_set()) cfs_fail_timeout id 703 awake
19:59:12:LustreError: 27017:0:(obd_class.h:1008:obd_connect()) Device 1 not setup
19:59:12:Lustre: 27017:0:(service.c:2031:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20:20s); client may timeout.  req@ffff880058849050 x1478095423936304/t0(0) o250-><?>@<?>:0/0 lens 400/264 e 0 to 0 dl 1409623123 ref 1 fl Complete:/0/0 rc -19/-19
19:59:12:LNet: Service thread pid 27017 stopped after 40.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources).
19:59:12:LustreError: 27234:0:(genops.c:1570:obd_exports_barrier()) ASSERTION( list_empty(&obd->obd_exports) ) failed: 
19:59:12:LustreError: 27234:0:(genops.c:1570:obd_exports_barrier()) LBUG
19:59:12:Pid: 27234, comm: mount.lustre
19:59:12:
19:59:12:Call Trace:
19:59:12: [<ffffffffa0b29895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
19:59:12: [<ffffffffa0b29e97>] lbug_with_loc+0x47/0xb0 [libcfs]
19:59:12: [<ffffffffa0c5794a>] obd_exports_barrier+0x16a/0x170 [obdclass]
19:59:12: [<ffffffffa0474a06>] mgs_device_fini+0xf6/0x5a0 [mgs]
19:59:12: [<ffffffff81510ee2>] ? _spin_lock+0x12/0x30
19:59:12: [<ffffffffa0c84c07>] class_cleanup+0x577/0xda0 [obdclass]
19:59:12: [<ffffffffa0c59b36>] ? class_name2dev+0x56/0xe0 [obdclass]
19:59:12: [<ffffffffa0c864ec>] class_process_config+0x10bc/0x1c80 [obdclass]
19:59:12: [<ffffffffa0b34d98>] ? libcfs_log_return+0x28/0x40 [libcfs]
19:59:12: [<ffffffffa0c7fd51>] ? lustre_cfg_new+0x391/0x7e0 [obdclass]
19:59:12: [<ffffffffa0c87229>] class_manual_cleanup+0x179/0x6f0 [obdclass]
19:59:12: [<ffffffffa0c59b36>] ? class_name2dev+0x56/0xe0 [obdclass]
19:59:12: [<ffffffffa0cbc0cd>] server_put_super+0x46d/0xf00 [obdclass]
19:59:12: [<ffffffffa0cc0a88>] server_fill_super+0x668/0x1580 [obdclass]
19:59:12: [<ffffffffa0c91958>] lustre_fill_super+0x1d8/0x530 [obdclass]
19:59:12: [<ffffffffa0c91780>] ? lustre_fill_super+0x0/0x530 [obdclass]
19:59:12: [<ffffffff811845df>] get_sb_nodev+0x5f/0xa0
19:59:12: [<ffffffffa0c89135>] lustre_get_sb+0x25/0x30 [obdclass]
19:59:12: [<ffffffff81183c1b>] vfs_kern_mount+0x7b/0x1b0
19:59:12: [<ffffffff81183dc2>] do_kern_mount+0x52/0x130
19:59:12: [<ffffffff811a3f82>] do_mount+0x2d2/0x8d0
19:59:12: [<ffffffff811a4610>] sys_mount+0x90/0xe0
19:59:12: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

and more frequent OSS LBUG during concurrent OST mounts/starts in test_41c always looks like :

04:19:32:Lustre: DEBUG MARKER: == conf-sanity test 41c: concurent mounts of MDT/OST should all fail but one == 10:19:01 (1409739541)
04:19:32:Lustre: DEBUG MARKER: grep -c /mnt/ost1' ' /proc/mounts
04:19:32:Lustre: DEBUG MARKER: lsmod | grep lnet > /dev/null && lctl dl | grep ' ST '
04:19:32:Lustre: DEBUG MARKER: ! zpool list -H lustre-ost1 >/dev/null 2>&1 ||
04:19:32:			grep -q ^lustre-ost1/ /proc/mounts ||
04:19:32:			zpool export  lustre-ost1
04:19:32:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x703
04:19:32:Lustre: DEBUG MARKER: mkdir -p /mnt/ost1
04:19:32:Lustre: DEBUG MARKER: zpool list -H lustre-ost1 >/dev/null 2>&1 ||
04:19:32:			zpool import -f -o cachefile=none -d /dev/lvm-Role_OSS lustre-ost1
04:19:32:Lustre: DEBUG MARKER: lctl set_param fail_loc=0x0
04:19:32:Lustre: DEBUG MARKER: mkdir -p /mnt/ost1
04:19:32:Lustre: DEBUG MARKER: zpool list -H lustre-ost1 >/dev/null 2>&1 ||
04:19:32:			zpool import -f -o cachefile=none -d /dev/lvm-Role_OSS lustre-ost1
04:19:32:Lustre: DEBUG MARKER: mkdir -p /mnt/ost1; mount -t lustre   		                   lustre-ost1/ost1 /mnt/ost1
04:19:32:Lustre: DEBUG MARKER: mkdir -p /mnt/ost1; mount -t lustre   		                   lustre-ost1/ost1 /mnt/ost1
04:19:32:LustreError: 31255:0:(obd_mount_server.c:1753:server_fill_super()) Unable to start osd on lustre-ost1/ost1: -114
04:19:32:LustreError: 31255:0:(obd_mount.c:1340:lustre_fill_super()) Unable to mount  (-114)
04:19:32:LustreError: 31163:0:(llog_osd.c:918:llog_osd_open()) ASSERTION( dt ) failed: 
04:19:32:LustreError: 31163:0:(llog_osd.c:918:llog_osd_open()) LBUG
04:19:32:Pid: 31163, comm: mount.lustre
04:19:32:
04:19:32:Call Trace:
04:19:32: [<ffffffffa05ef895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
04:19:32: [<ffffffffa05efe97>] lbug_with_loc+0x47/0xb0 [libcfs]
04:19:32: [<ffffffffa071c574>] llog_osd_open+0x844/0xb30 [obdclass]
04:19:32: [<ffffffffa0600181>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
04:19:32: [<ffffffffa070a1da>] llog_open+0xba/0x2c0 [obdclass]
04:19:32: [<ffffffffa070dd01>] llog_backup+0x61/0x500 [obdclass]
04:19:32: [<ffffffff8128daa0>] ? sprintf+0x40/0x50
04:19:32: [<ffffffffa0ee581d>] mgc_process_log+0x12fd/0x1970 [mgc]
04:19:32: [<ffffffffa0edf260>] ? mgc_blocking_ast+0x0/0x810 [mgc]
04:19:32: [<ffffffffa096d710>] ? ldlm_completion_ast+0x0/0x930 [ptlrpc]
04:19:32: [<ffffffffa0ee6da8>] mgc_process_config+0x658/0x1210 [mgc]
04:19:32: [<ffffffffa0750a1f>] lustre_process_log+0x20f/0xad0 [obdclass]
04:19:32: [<ffffffffa0600181>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
04:19:32: [<ffffffffa05fa3a8>] ? libcfs_log_return+0x28/0x40 [libcfs]
04:19:32: [<ffffffffa07857e7>] server_start_targets+0x767/0x1af0 [obdclass]
04:19:32: [<ffffffffa05fa3a8>] ? libcfs_log_return+0x28/0x40 [libcfs]
04:19:32: [<ffffffffa0754246>] ? lustre_start_mgc+0x4b6/0x1e00 [obdclass]
04:19:32: [<ffffffffa0600181>] ? libcfs_debug_msg+0x41/0x50 [libcfs]
04:19:32: [<ffffffffa074bfa0>] ? class_config_llog_handler+0x0/0x18c0 [obdclass]
04:19:32: [<ffffffffa078aa35>] server_fill_super+0xc95/0x1740 [obdclass]
04:19:32: [<ffffffffa05fa3a8>] ? libcfs_log_return+0x28/0x40 [libcfs]
04:19:32: [<ffffffffa0755d68>] lustre_fill_super+0x1d8/0x550 [obdclass]
04:19:32: [<ffffffffa0755b90>] ? lustre_fill_super+0x0/0x550 [obdclass]
04:19:32: [<ffffffff8118c5df>] get_sb_nodev+0x5f/0xa0
04:19:32: [<ffffffffa074d965>] lustre_get_sb+0x25/0x30 [obdclass]
04:19:32: [<ffffffff8118bc3b>] vfs_kern_mount+0x7b/0x1b0
04:19:32: [<ffffffff8118bde2>] do_kern_mount+0x52/0x130
04:19:32: [<ffffffff8119e9e2>] ? vfs_ioctl+0x22/0xa0
04:19:32: [<ffffffff811ad7bb>] do_mount+0x2fb/0x930
04:19:32: [<ffffffff811ade80>] sys_mount+0x90/0xe0
04:19:32: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
04:19:32:
Comment by Bruno Faccini (Inactive) [ 29/Sep/14 ]

Master patch, to strengthen previous change for LU-5299 against the same issues/races during Server devices concurrent mounts/starts, is at http://review.whamcloud.com/12114.

Comment by Jian Yu [ 08/Oct/14 ]

One more instance on master branch:
https://testing.hpdd.intel.com/test_sets/d687f1f4-4e81-11e4-ae94-5254006e85c2

Comment by Jodi Levi (Inactive) [ 13/Oct/14 ]

Li Wei is verifying this fix.

Comment by Jian Yu [ 21/Oct/14 ]

Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12353

Comment by Peter Jones [ 21/Oct/14 ]

This fix has landed on master. Is that all that is needed to correct this issue for 2.7?

Comment by Bruno Faccini (Inactive) [ 27/Oct/14 ]

Yes Peter, I think this ticket can be closed

Comment by Peter Jones [ 27/Oct/14 ]

ok thanks Bruno

Comment by nasf (Inactive) [ 11/Nov/14 ]

What plan for the http://review.whamcloud.com/12353 (b2_5)?
We hit the failure on b2_5:
https://testing.hpdd.intel.com/test_sets/024235d2-683d-11e4-a449-5254006e85c2

Comment by Gerrit Updater [ 01/Dec/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12353/
Subject: LU-5573 obdclass: strengthen against concurrent server mounts
Project: fs/lustre-release
Branch: b2_5
Current Patch Set:
Commit: 684a7db576eb03ec2c74c89dabcef7991010ee11

Generated at Sat Feb 10 01:52:39 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.