[LU-2200] Test failure on test suite conf-sanity, subtest test_32a Created: 16/Oct/12  Updated: 15/Aug/13  Resolved: 11/Jul/13

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.4.1
Fix Version/s: Lustre 2.4.1, Lustre 2.5.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Nathaniel Clark
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocker
is blocked by LU-3357 lctl replace_nids loses Target UUID f... Resolved
Related
is related to LU-1430 Changing network address without --wr... Resolved
is related to LU-1997 'exclude' mount option does not work ... Resolved
Severity: 3
Rank (Obsolete): 5244

 Description   

This issue was created by maloo for Oleg Drokin <green@whamcloud.com>

This issue relates to the following test suite run: https://maloo.whamcloud.com/test_sets/1e253cd6-17c2-11e2-a41f-52540035b04c.

The sub-test test_32a failed with the following error:

CMD: client-20-ib mount -t lustre -o loop,exclude=t32fs-OST0000 /tmp/t32/mdt /tmp/t32/mnt/mdt
client-20-ib: mount.lustre: mount /dev/loop0 at /tmp/t32/mnt/mdt failed: No such file or directory
client-20-ib: Is the MGS specification correct?
client-20-ib: Is the filesystem name correct?
client-20-ib: If upgrading, is the copied client log valid? (see upgrade docs)
conf-sanity test_32a: @@@@@@ FAIL: Mounting the MDT
test_32a failed with 1

Info required for matching: conf-sanity 32a



 Comments   
Comment by Alex Zhuravlev [ 17/Oct/12 ]

conf-sanity/32 does not work with IB.

Comment by Oleg Drokin [ 07/Nov/12 ]

This bug seems to heavily affect our testing, so I am upgrading it to blocker status

Comment by Nathaniel Clark [ 16/Nov/12 ]

Patch to skip test
http://review.whamcloud.com/4607

Comment by Andreas Dilger [ 16/Nov/12 ]

Once the workaround patch lands, this bug needs to be kept open until the actual fix is landed to allow testing o2iblnd. This might be fixed by LU-1430 "lctl replace_nids" command (http://review.whamcloud.com/2896) allowing it to replace any NIDs in the configuration. That would be a good test of both the "replace_nids" code, and would allow this upgrade test to be run again.

Comment by Li Wei (Inactive) [ 16/Nov/12 ]

Is the "exclude=t32fs-OST0000" mount option functioning correctly? I thought if it is the test should work with arbitrary NIDs.

Comment by Peter Jones [ 26/Nov/12 ]

Landed for 2.4

Comment by Andreas Dilger [ 23/Apr/13 ]

This test needs to be fixed so that it is able to run on IB. Unfortunately, skipping conf-sanity test_32[ab] allowed patch http://review.whamcloud.com/4819 to land, but this caused LU-3198 to fail every test_32[ab] that didn't run on IB.

https://maloo.whamcloud.com/test_sets/b38bfebe-a862-11e2-9f50-52540035b04c

Comment by Nathaniel Clark [ 26/Apr/13 ]

Log from MDS console trying to initially mount:

06:47:50:Lustre: DEBUG MARKER: PATH=/usr/lib64/lustre/tests:/usr/lib/lustre/tests:/usr/lib64/lustre/tests:/opt/iozone/bin:/usr/lib64/lustre/tests//usr/lib64/lustre/tests:/usr/lib64/lustre/tests:/usr/lib64/lustre/tests/../utils:/opt/iozone/bin:/usr/lib64/lustre/tests/mpi:/usr/lib64/lust
06:48:02:Lustre: DEBUG MARKER: mount -t lustre -o loop,exclude=t32fs-OST0000 /tmp/t32/mdt /tmp/t32/mnt/mdt
06:48:02:LDISKFS-fs (loop0): mounted filesystem with ordered data mode. quota=off. Opts: 
06:48:02:Lustre: MGC192.168.4.20@o2ib: Reactivating import
06:48:02:Lustre: Found index 0 for t32fs-MDT0000, updating log
06:48:02:Lustre: Modifying parameter t32fs-MDT0000-mdtlov.lov.stripesize in log t32fs-MDT0000
06:48:02:Lustre: Modifying parameter t32fs-clilov.lov.stripesize in log t32fs-client
06:48:02:Lustre: t32fs-MDT0000: used disk, loading
06:48:02:LustreError: 27989:0:(sec_config.c:1024:sptlrpc_target_local_copy_conf()) missing llog context
06:48:02:LustreError: 27989:0:(ldlm_lib.c:418:client_obd_setup()) can't add initial connection
06:48:02:LustreError: 27989:0:(osp_dev.c:493:osp_init0()) t32fs-OST0000-osc-MDT0000: can't setup obd: -2
06:48:02:LustreError: 27989:0:(obd_config.c:572:class_setup()) setup t32fs-OST0000-osc-MDT0000 failed (-2)
06:48:02:LustreError: 27989:0:(obd_config.c:1546:class_config_llog_handler()) MGC192.168.4.20@o2ib: cfg command failed: rc = -2
06:48:02:Lustre:    cmd=cf003 0:t32fs-OST0000-osc-MDT0000  1:t32fs-OST0000_UUID  2:10.10.4.12@tcp  
06:48:02:LustreError: 15c-8: MGC192.168.4.20@o2ib: The configuration from log 't32fs-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
06:48:02:LustreError: 27939:0:(obd_mount.c:1850:server_start_targets()) failed to start server t32fs-MDT0000: -2
06:48:02:LustreError: 27939:0:(obd_mount.c:2401:server_fill_super()) Unable to start targets: -2
06:48:02:LustreError: 27939:0:(obd_mount.c:1350:lustre_disconnect_osp()) Can't end config log t32fs
06:48:02:LustreError: 27939:0:(obd_mount.c:2114:server_put_super()) t32fs-MDT0000: failed to disconnect osp-on-ost (rc=-2)!
06:48:02:Lustre: Failing over t32fs-MDT0000
06:48:02:LustreError: 27939:0:(obd_mount.c:1418:lustre_stop_osp()) Can not find osp-on-ost t32fs-MDT0000-osp-MDT0000
06:48:02:LustreError: 27939:0:(obd_mount.c:2159:server_put_super()) t32fs-MDT0000: Fail to stop osp-on-ost!
06:48:02:LustreError: 27939:0:(ldlm_request.c:1181:ldlm_cli_cancel_req()) Got rc -108 from cancel RPC: canceling anyway
06:48:02:LustreError: 27939:0:(ldlm_request.c:1811:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -108
06:48:02:Lustre: 27939:0:(client.c:1909:ptlrpc_expire_one_request()) @@@ Request  sent has timed out for slow reply: [sent 1350395276/real 1350395276]  req@ffff88030f5bf800 x1415992054906891/t0(0) o251->MGC192.168.4.20@o2ib@0@lo:26/25 lens 224/224 e 0 to 1 dl 1350395282 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
06:48:02:Lustre: server umount t32fs-MDT0000 complete
06:48:02:LustreError: 27939:0:(obd_mount.c:2989:lustre_fill_super()) Unable to mount  (-2)
06:48:02:Lustre: DEBUG MARKER: /usr/sbin/lctl mark  conf-sanity test_32a: @@@@@@ FAIL: Mounting the MDT 
Comment by Nathaniel Clark [ 26/Apr/13 ]

The exclude option to mount.lustre doesn't seem to be wired up.

Comment by Nathaniel Clark [ 26/Apr/13 ]

Correction, exclude is wired up, but not working correctly

MDT debug_log:

00000020:00000001:0.0:1353824997.397533:0:15323:0:(obd_mount.c:2574:lmd_make_exclusion()) Process entered
00000020:00000010:0.0:1353824997.397535:0:15323:0:(obd_mount.c:2582:lmd_make_exclusion()) kmalloced 'exclude_list': 116 at ffff88032ec78cc0.
00000020:01000004:0.0:1353824997.397537:0:15323:0:(obd_mount.c:2597:lmd_make_exclusion()) ignoring exclude t32fs-O
00000020:00000010:0.0:1353824997.397538:0:15323:0:(obd_mount.c:2619:lmd_make_exclusion()) kfreed 'exclude_list': 116 at ffff88032ec78cc0.
Comment by Nathaniel Clark [ 29/Apr/13 ]

http://review.whamcloud.com/6197

Comment by Nathaniel Clark [ 03/May/13 ]

With most recent patch (6197,5) exclude now works but I'm getting a different error (MDS Console log):
https://maloo.whamcloud.com/test_sets/15ee3d6c-b3ce-11e2-b208-52540035b04c

12:40:27:Lustre: 5173:0:(obd_mount.c:837:lustre_check_exclusion()) Excluding t32fs-OST0000 (on exclusion list)
12:40:27:LustreError: 5173:0:(ldlm_lib.c:429:client_obd_setup()) can't add initial connection
12:40:27:LustreError: 5173:0:(osp_dev.c:686:osp_init0()) t32fs-OST0000-osc-MDT0000: can't setup obd: -2
12:40:27:LustreError: 5173:0:(obd_config.c:572:class_setup()) setup t32fs-OST0000-osc-MDT0000 failed (-2)
12:40:27:LustreError: 5173:0:(obd_config.c:1550:class_config_llog_handler()) MGC192.168.4.20@o2ib: cfg command failed: rc = -2
12:40:27:Lustre:    cmd=cf003 0:t32fs-OST0000-osc-MDT0000  1:t32fs-OST0000_UUID  2:192.168.203.128@tcp  
12:40:27:LustreError: 15c-8: MGC192.168.4.20@o2ib: The configuration from log 't32fs-MDT0000' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
12:40:27:LustreError: 5123:0:(obd_mount_server.c:1257:server_start_targets()) failed to start server t32fs-MDT0000: -2
12:40:27:LustreError: 5123:0:(obd_mount_server.c:1699:server_fill_super()) Unable to start targets: -2
12:40:27:LustreError: 5123:0:(obd_mount_server.c:844:lustre_disconnect_lwp()) t32fs-MDT0000-lwp-MDT0000: Can't end config log t32fs-client.
12:40:27:LustreError: 5123:0:(obd_mount_server.c:1426:server_put_super()) t32fs-MDT0000: failed to disconnect lwp. (rc=-2)
Comment by Nathaniel Clark [ 03/May/13 ]

Adding an OST to the exlude list seems to only affect the LCFG_LOV_ADD_OBD command (turning it into LCFG_LOV_ADD_INA), the problem (from my rough reading) seems to stem from the fact that the command in question is LCFG_ADD_UUID, and the NID in question is 192.168.203.129@tcp which fails on IB.

Comment by Nathaniel Clark [ 08/May/13 ]

This seems to be the order that llog_process_thread sends LCFG commands through class_config_llog_handler().

1) LCFG_MARKER (10)	- find if excluded
2) LCFG_ADD_UUID (5)	
3) LCFG_ATTACH (1)		
4) LCFG_SETUP (3)	- ERROR
5) LCFG_LOV_ADD_OBD (d)	- Only interation with EXCLUDED flag: Change to LCFG_LOV_ADD_INA (13)

Step 5 would be where the OST is excluded by adding it as inactive, but during step 4 OBD tries to setup a connection and that fails

The call chain for LCFG_SETUP is:

class_setup
 obd_setup
  osp_device_alloc (as osp::ldto_device_alloc)
   osp_init0
    client_obd_setup
     client_import_add_conn
      import_set_conn
       ptlrpc_uuid_to_connection
        ptrlrpc_uuid_to_peer -- Fails to find peer
Comment by Keith Mannthey (Inactive) [ 15/May/13 ]

I opend a possible related LU. LU-3347 (local_storage.c:872:local_oid_storage_init()) ASSERTION( (*los)->los_last_oid >= first_oid ) failed: 0 < 1

It is a timeout error for test_32a.

Comment by Nathaniel Clark [ 16/May/13 ]

Because the tcp based nids are in the config logs, a writeconf is needed before the existing filesystem tar balls can be mounted on IB nodes. I hope that any conf-sanity/32 test that does writeconf will be able to work on IB, but the non-writeconf test (32a), I believe, has no chance of passing without major revisions to the lustre stack.

Comment by Andreas Dilger [ 16/May/13 ]

What about using "lctl replace_nids" to fix up the configuration for IB testing?

Comment by Nathaniel Clark [ 18/May/13 ]

Testing lctl replace_nids now. Skipping non-writeconf tests will does allow 32b to be run successfully on IB nodes (See patch set 10)

Comment by Nathaniel Clark [ 20/May/13 ]

Because of LU-3357, using replace_nids causes the first lctl conf_param ($fsname-OST0000.osc.max_dirty_mb=15) to fail.

Comment by Nathaniel Clark [ 11/Jul/13 ]

Patch merged to master (post 2.4.51)

Comment by Jinshan Xiong (Inactive) [ 05/Aug/13 ]

reproduced again: https://maloo.whamcloud.com/test_sets/2b3a0a00-fcc8-11e2-9fdb-52540035b04c

from the debug log of client 2: https://maloo.whamcloud.com/test_logs/b0e58a3a-fcc8-11e2-9fdb-52540035b04c/show_text

10000000:01000000:1.0:1375567124.192540:0:27905:0:(mgc_request.c:1820:mgc_process_log()) MGC192.168.4.20@o2ib: configuration from log 'lustre-sptlrpc' failed (-2).

is it the same problem?

Comment by Jian Yu [ 15/Aug/13 ]

Patch http://review.whamcloud.com/6197 was cherry-picked to Lustre b2_4 branch.

Generated at Sat Feb 10 01:23:12 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.