[LU-2200] Test failure on test suite conf-sanity, subtest test_32a Created: 16/Oct/12 Updated: 15/Aug/13 Resolved: 11/Jul/13 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.4.0, Lustre 2.4.1 |
| Fix Version/s: | Lustre 2.4.1, Lustre 2.5.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Nathaniel Clark |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||
| Rank (Obsolete): | 5244 | ||||||||||||||||||||
| Description |
|
This issue was created by maloo for Oleg Drokin <green@whamcloud.com> This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/1e253cd6-17c2-11e2-a41f-52540035b04c. The sub-test test_32a failed with the following error:
Info required for matching: conf-sanity 32a |
| Comments |
| Comment by Alex Zhuravlev [ 17/Oct/12 ] |
|
conf-sanity/32 does not work with IB. |
| Comment by Oleg Drokin [ 07/Nov/12 ] |
|
This bug seems to heavily affect our testing, so I am upgrading it to blocker status |
| Comment by Nathaniel Clark [ 16/Nov/12 ] |
|
Patch to skip test |
| Comment by Andreas Dilger [ 16/Nov/12 ] |
|
Once the workaround patch lands, this bug needs to be kept open until the actual fix is landed to allow testing o2iblnd. This might be fixed by |
| Comment by Li Wei (Inactive) [ 16/Nov/12 ] |
|
Is the "exclude=t32fs-OST0000" mount option functioning correctly? I thought if it is the test should work with arbitrary NIDs. |
| Comment by Peter Jones [ 26/Nov/12 ] |
|
Landed for 2.4 |
| Comment by Andreas Dilger [ 23/Apr/13 ] |
|
This test needs to be fixed so that it is able to run on IB. Unfortunately, skipping conf-sanity test_32[ab] allowed patch http://review.whamcloud.com/4819 to land, but this caused https://maloo.whamcloud.com/test_sets/b38bfebe-a862-11e2-9f50-52540035b04c |
| Comment by Nathaniel Clark [ 26/Apr/13 ] |
|
Log from MDS console trying to initially mount: 06:47:50:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust 06:48:02:Lustre: DEBUG MARKER: mount -t lustre -o loop,exclude=t32fs-OST0000 /tmp/t32/mdt /tmp/t32/mnt/mdt 06:48:02:LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=off. Opts: 06:48:02:Lustre: MGC192.168.4.20@o2ib: Reactivating import 06:48:02:Lustre: Found index 0 for t32fs-MDT0000, updating log 06:48:02:Lustre: Modifying parameter t32fs-MDT0000-mdtlov.lov.stripesize in log t32fs-MDT0000 06:48:02:Lustre: Modifying parameter t32fs-clilov.lov.stripesize in log t32fs-client 06:48:02:Lustre: t32fs-MDT0000: used disk, loading 06:48:02:LustreError: 27989:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context 06:48:02:LustreError: 27989:0:(ldlm_lib.c:418:client_obd_setup()) can't add initial connection 06:48:02:LustreError: 27989:0:(osp_dev.c:493:osp_init0()) t32fs-OST0000-osc-MDT0000: can't setup obd: -2 06:48:02:LustreError: 27989:0:(obd_config.c:572:class_setup()) setup t32fs-OST0000-osc-MDT0000 failed (-2) 06:48:02:LustreError: 27989:0:(obd_config.c:1546:class_config_llog_handler()) MGC192.168.4.20@o2ib: cfg command failed: rc = -2 06:48:02:Lustre: cmd=cf003 0:t32fs-OST0000-osc-MDT0000 1:t32fs-OST0000_UUID 2:10.10.4.12@tcp 06:48:02:LustreError: 15c-8: MGC192.168.4.20@o2ib: The configuration from log 't32fs-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. 06:48:02:LustreError: 27939:0:(obd_mount.c:1850:server_start_targets()) failed to start server t32fs-MDT0000: -2 06:48:02:LustreError: 27939:0:(obd_mount.c:2401:server_fill_super()) Unable to start targets: -2 06:48:02:LustreError: 27939:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log t32fs 06:48:02:LustreError: 27939:0:(obd_mount.c:2114:server_put_super()) t32fs-MDT0000: failed to disconnect osp-on-ost (rc=-2)! 06:48:02:Lustre: Failing over t32fs-MDT0000 06:48:02:LustreError: 27939:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost t32fs-MDT0000-osp-MDT0000 06:48:02:LustreError: 27939:0:(obd_mount.c:2159:server_put_super()) t32fs-MDT0000: Fail to stop osp-on-ost! 06:48:02:LustreError: 27939:0:(ldlm_request.c:1181:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway 06:48:02:LustreError: 27939:0:(ldlm_request.c:1811:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108 06:48:02:Lustre: 27939:0:(client.c:1909:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1350395276/real 1350395276] req@ffff88030f5bf800 x1415992054906891/t0(0) o251->MGC192.168.4.20@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1350395282 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1 06:48:02:Lustre: server umount t32fs-MDT0000 complete 06:48:02:LustreError: 27939:0:(obd_mount.c:2989:lustre_fill_super()) Unable to mount (-2) 06:48:02:Lustre: DEBUG MARKER: /usr/sbin/lctl mark conf-sanity test_32a: @@@@@@ FAIL: Mounting the MDT |
| Comment by Nathaniel Clark [ 26/Apr/13 ] |
|
The exclude option to mount.lustre doesn't seem to be wired up. |
| Comment by Nathaniel Clark [ 26/Apr/13 ] |
|
Correction, exclude is wired up, but not working correctly MDT debug_log: 00000020:00000001:0.0:1353824997.397533:0:15323:0:(obd_mount.c:2574:lmd_make_exclusion()) Process entered 00000020:00000010:0.0:1353824997.397535:0:15323:0:(obd_mount.c:2582:lmd_make_exclusion()) kmalloced 'exclude_list': 116 at ffff88032ec78cc0. 00000020:01000004:0.0:1353824997.397537:0:15323:0:(obd_mount.c:2597:lmd_make_exclusion()) ignoring exclude t32fs-O 00000020:00000010:0.0:1353824997.397538:0:15323:0:(obd_mount.c:2619:lmd_make_exclusion()) kfreed 'exclude_list': 116 at ffff88032ec78cc0. |
| Comment by Nathaniel Clark [ 29/Apr/13 ] |
| Comment by Nathaniel Clark [ 03/May/13 ] |
|
With most recent patch (6197,5) exclude now works but I'm getting a different error (MDS Console log): 12:40:27:Lustre: 5173:0:(obd_mount.c:837:lustre_check_exclusion()) Excluding t32fs-OST0000 (on exclusion list) 12:40:27:LustreError: 5173:0:(ldlm_lib.c:429:client_obd_setup()) can't add initial connection 12:40:27:LustreError: 5173:0:(osp_dev.c:686:osp_init0()) t32fs-OST0000-osc-MDT0000: can't setup obd: -2 12:40:27:LustreError: 5173:0:(obd_config.c:572:class_setup()) setup t32fs-OST0000-osc-MDT0000 failed (-2) 12:40:27:LustreError: 5173:0:(obd_config.c:1550:class_config_llog_handler()) MGC192.168.4.20@o2ib: cfg command failed: rc = -2 12:40:27:Lustre: cmd=cf003 0:t32fs-OST0000-osc-MDT0000 1:t32fs-OST0000_UUID 2:192.168.203.128@tcp 12:40:27:LustreError: 15c-8: MGC192.168.4.20@o2ib: The configuration from log 't32fs-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information. 12:40:27:LustreError: 5123:0:(obd_mount_server.c:1257:server_start_targets()) failed to start server t32fs-MDT0000: -2 12:40:27:LustreError: 5123:0:(obd_mount_server.c:1699:server_fill_super()) Unable to start targets: -2 12:40:27:LustreError: 5123:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) t32fs-MDT0000-lwp-MDT0000: Can't end config log t32fs-client. 12:40:27:LustreError: 5123:0:(obd_mount_server.c:1426:server_put_super()) t32fs-MDT0000: failed to disconnect lwp. (rc=-2) |
| Comment by Nathaniel Clark [ 03/May/13 ] |
|
Adding an OST to the exlude list seems to only affect the LCFG_LOV_ADD_OBD command (turning it into LCFG_LOV_ADD_INA), the problem (from my rough reading) seems to stem from the fact that the command in question is LCFG_ADD_UUID, and the NID in question is 192.168.203.129@tcp which fails on IB. |
| Comment by Nathaniel Clark [ 08/May/13 ] |
|
This seems to be the order that llog_process_thread sends LCFG commands through class_config_llog_handler(). 1) LCFG_MARKER (10) - find if excluded 2) LCFG_ADD_UUID (5) 3) LCFG_ATTACH (1) 4) LCFG_SETUP (3) - ERROR 5) LCFG_LOV_ADD_OBD (d) - Only interation with EXCLUDED flag: Change to LCFG_LOV_ADD_INA (13) Step 5 would be where the OST is excluded by adding it as inactive, but during step 4 OBD tries to setup a connection and that fails The call chain for LCFG_SETUP is: class_setup
obd_setup
osp_device_alloc (as osp::ldto_device_alloc)
osp_init0
client_obd_setup
client_import_add_conn
import_set_conn
ptlrpc_uuid_to_connection
ptrlrpc_uuid_to_peer -- Fails to find peer
|
| Comment by Keith Mannthey (Inactive) [ 15/May/13 ] |
|
I opend a possible related LU. It is a timeout error for test_32a. |
| Comment by Nathaniel Clark [ 16/May/13 ] |
|
Because the tcp based nids are in the config logs, a writeconf is needed before the existing filesystem tar balls can be mounted on IB nodes. I hope that any conf-sanity/32 test that does writeconf will be able to work on IB, but the non-writeconf test (32a), I believe, has no chance of passing without major revisions to the lustre stack. |
| Comment by Andreas Dilger [ 16/May/13 ] |
|
What about using "lctl replace_nids" to fix up the configuration for IB testing? |
| Comment by Nathaniel Clark [ 18/May/13 ] |
|
Testing lctl replace_nids now. Skipping non-writeconf tests will does allow 32b to be run successfully on IB nodes (See patch set 10) |
| Comment by Nathaniel Clark [ 20/May/13 ] |
|
Because of |
| Comment by Nathaniel Clark [ 11/Jul/13 ] |
|
Patch merged to master (post 2.4.51) |
| Comment by Jinshan Xiong (Inactive) [ 05/Aug/13 ] |
|
reproduced again: https://maloo.whamcloud.com/test_sets/2b3a0a00-fcc8-11e2-9fdb-52540035b04c from the debug log of client 2: https://maloo.whamcloud.com/test_logs/b0e58a3a-fcc8-11e2-9fdb-52540035b04c/show_text 10000000:01000000:1.0:1375567124.192540:0:27905:0:(mgc_request.c:1820:mgc_process_log()) MGC192.168.4.20@o2ib: configuration from log 'lustre-sptlrpc' failed (-2). is it the same problem? |
| Comment by Jian Yu [ 15/Aug/13 ] |
|
Patch http://review.whamcloud.com/6197 was cherry-picked to Lustre b2_4 branch. |