[LU-5654] MDT 0 failed to set up some OST OSPs when all OSTs were started in parallel for the first time Created: 24/Sep/14  Updated: 04/Jun/15  Resolved: 09/Oct/14

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.7.0

Type: Bug Priority: Major
Reporter: Li Wei (Inactive) Assignee: Li Wei (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Related
Severity: 3
Rank (Obsolete): 15850

 Description   

A file system with 2 MDTs and 163 OSTs were formatted and started like this:

  1. Mount MDT 0/MGT.
  2. Mount MDT 1.
  3. Mount all 163 OSTs in parallel.

After all mount commands succeeded, MDT 0 failed to set up OSPs for some OSTs:

Sep 23 07:02:43 lola-8 kernel: LDISKFS-fs (dm-1): mounted filesystem with ordered data mode. quota=on. Opts:
Sep 23 07:02:44 lola-8 kernel: Lustre: ctl-soaked-MDT0000: No data found on store. Initialize space
Sep 23 07:02:44 lola-8 kernel: Lustre: soaked-MDT0000: new disk, initializing
Sep 23 07:02:44 lola-8 kernel: LustreError: 11-0: soaked-MDT0000-lwp-MDT0000: Communicating with 0@lo, operation mds_connect failed with -11.
Sep 23 07:02:57 lola-8 kernel: LustreError: 11-0: soaked-MDT0001-osp-MDT0000: Communicating with 192.168.1.109@o2ib10, operation mds_connect failed with -11.
Sep 23 07:03:09 lola-8 kernel: Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000200000400-0x0000000240000400):a2:ost
Sep 23 07:03:14 lola-8 kernel: LustreError: 6677:0:(osd_io.c:1361:osd_ldiskfs_read()) soaked-MDT0000: can't read 32@512 on ino 222: rc = 0
Sep 23 07:03:14 lola-8 kernel: LustreError: 6677:0:(llog_osd.c:1466:llog_osd_get_cat_list()) soaked-MDT0000-osd: error reading CATALOGS: rc = -14
Sep 23 07:03:14 lola-8 kernel: LustreError: 6677:0:(osp_sync.c:1045:osp_sync_llog_init()) soaked-OST0010-osc-MDT0000: can't get id from catalogs: rc = -14
Sep 23 07:03:14 lola-8 kernel: LustreError: 6677:0:(osp_sync.c:1147:osp_sync_init()) soaked-OST0010-osc-MDT0000: can't initialize llog: rc = -14
Sep 23 07:03:14 lola-8 kernel: LustreError: 6677:0:(obd_config.c:561:class_setup()) setup soaked-OST0010-osc-MDT0000 failed (-14)
Sep 23 07:03:14 lola-8 kernel: LustreError: 6677:0:(obd_config.c:1609:class_config_llog_handler()) MGC192.168.1.108@o2ib10: cfg command failed: rc = -14
Sep 23 07:03:14 lola-8 kernel: Lustre:    cmd=cf003 0:soaked-OST0010-osc-MDT0000  1:soaked-OST0010_UUID  2:192.168.1.102@o2ib10
Sep 23 07:03:14 lola-8 kernel:
Sep 23 07:03:14 lola-8 kernel: LustreError: 6584:0:(mgc_request.c:517:do_requeue()) failed processing log: -14
Sep 23 07:03:17 lola-8 kernel: Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000000600000400-0x0000000640000400):9c:ost
Sep 23 07:03:17 lola-8 kernel: Lustre: Skipped 15 previous similar messages
Sep 23 07:03:22 lola-8 kernel: LustreError: 6699:0:(obd_config.c:776:class_add_conn()) try to add conn on immature client dev
Sep 23 07:03:22 lola-8 kernel: LustreError: 6699:0:(obd_class.h:938:obd_connect()) Device 13 not setup
Sep 23 07:03:22 lola-8 kernel: LustreError: 6699:0:(lod_lov.c:268:lod_add_device()) soaked-OST0010-osc-MDT0000: cannot connect to next dev soaked-OST0010_UUID (-19)
Sep 23 07:03:22 lola-8 kernel: LustreError: 6699:0:(obd_config.c:1609:class_config_llog_handler()) MGC192.168.1.108@o2ib10: cfg command failed: rc = -19
Sep 23 07:03:22 lola-8 kernel: Lustre:    cmd=cf00d 0:soaked-MDT0000-mdtlov  1:soaked-OST0010_UUID  2:16  3:1
Sep 23 07:03:22 lola-8 kernel:
Sep 23 07:03:22 lola-8 kernel: LustreError: 6584:0:(mgc_request.c:517:do_requeue()) failed processing log: -19
Sep 23 07:03:27 lola-8 kernel: Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000002380000400-0x00000023c0000400):3b:ost
Sep 23 07:03:27 lola-8 kernel: Lustre: Skipped 117 previous similar messages
Sep 23 07:03:52 lola-8 kernel: Lustre: ctl-soaked-MDT0000: super-sequence allocation rc = 0 [0x0000002a40000400-0x0000002a80000400):7a:ost
Sep 23 07:03:52 lola-8 kernel: Lustre: Skipped 26 previous similar messages

MDT 0/MGT was ldiskfs-based.



 Comments   
Comment by Li Wei (Inactive) [ 24/Sep/14 ]

The first problem is that osd_ldiskfs_read() does not correctly handle the case in which a block to be read has not been allocated yet---a hole.

Comment by Li Wei (Inactive) [ 24/Sep/14 ]

http://review.whamcloud.com/12035 addresses the osd_ldiskfs_read() problem.

Comment by Li Wei (Inactive) [ 24/Sep/14 ]

http://review.whamcloud.com/12037 fixes a leak on osp_init0()'s error path.

Comment by Oleg Drokin [ 01/Oct/14 ]

patch 12035 was reverted from master as causing LU-5684 (I think the test added is buggy/racy).

Comment by Li Wei (Inactive) [ 01/Oct/14 ]

http://review.whamcloud.com/12145 is an updated version of 12035, addressing the racy conf-sanity test.

Comment by Jodi Levi (Inactive) [ 09/Oct/14 ]

Patch landed to Master.

Comment by Li Wei (Inactive) [ 17/Oct/14 ]

http://review.whamcloud.com/12319 (b2_5 port)

Generated at Sat Feb 10 01:53:22 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.