[LU-5148] OSTs won't mount following upgrade to 2.4.2 - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
- llnl

Severity:
3
Rank (Obsolete):
14209

Description

A production lustre cluster "porter" was upgraded from 2.4.0-28chaos to lustre-2.4.2-11chaos today. OSTs now will not start.

# porter1 /root > /etc/init.d/lustre start
Stopping snmpd:                                            [  OK  ]
Shutting down cerebrod:                                    [  OK  ]
Mounting porter1/lse-ost0 on /mnt/lustre/local/lse-OST0001
mount.lustre: mount porter1/lse-ost0 at /mnt/lustre/local/lse-OST0001 failed: Input/output error
Is the MGS running?
# porter1 /root >

Lustre: Lustre: Build Version: 2.4.2-11chaos-11chaos--PRISTINE-2.6.32-431.17.2.1chaos.ch5.2.x86_64
LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.115.67@o2ib10 (no target)
LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.120.38@o2ib7 (no target)
LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.120.101@o2ib7 (no target)
LustreError: 5426:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881026873800 x1470103660003336/t0(0) o253->MGC172.19.1.165@o2ib100@172.19.1.165@o2ib100:26/25 lens 4768/4768 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 5426:0:(obd_mount_server.c:1140:server_register_target()) lse-OST0001: error registering with the MGS: rc = -5 (not fatal)
LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.116.205@o2ib5 (no target)
LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.114.162@o2ib5 (no target)
LustreError: Skipped 19 previous similar messages
LustreError: 5426:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881026873800 x1470103660003340/t0(0) o101->MGC172.19.1.165@o2ib100@172.19.1.165@o2ib100:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 137-5: lse-OST0002_UUID: not available for connect from 192.168.120.162@o2ib7 (no target)
LustreError: Skipped 23 previous similar messages
LustreError: 5426:0:(client.c:1053:ptlrpc_import_delay_req()) @@@ send limit expired   req@ffff881026873800 x1470103660003344/t0(0) o101->MGC172.19.1.165@o2ib100@172.19.1.165@o2ib100:26/25 lens 328/344 e 0 to 0 dl 0 ref 2 fl Rpc:W/0/ffffffff rc 0/-1
LustreError: 15c-8: MGC172.19.1.165@o2ib100: The configuration from log 'lse-OST0001' failed (-5). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 5426:0:(obd_mount_server.c:1273:server_start_targets()) failed to start server lse-OST0001: -5
Lustre: lse-OST0001: Unable to start target: -5
LustreError: 5426:0:(obd_mount_server.c:865:lustre_disconnect_lwp()) lse-MDT0000-lwp-OST0001: Can't end config log lse-client.
LustreError: 5426:0:(obd_mount_server.c:1442:server_put_super()) lse-OST0001: failed to disconnect lwp. (rc=-2)
LustreError: 5426:0:(obd_mount_server.c:1472:server_put_super()) no obd lse-OST0001
Lustre: server umount lse-OST0001 complete
LustreError: 5426:0:(obd_mount.c:1290:lustre_fill_super()) Unable to mount  (-5)

# porter1 /root > lctl ping 172.19.1.165@o2ib100 # <-- MGS NID
12345-0@lo
12345-172.19.1.165@o2ib100

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lustre.log.porter1.1402001330.gz
4.22 MB
05/Jun/14 8:52 PM
lustre.log.porter-mds1.1402001323.gz
0.3 kB
05/Jun/14 8:52 PM
porter-mds1.console.txt
243 kB
09/Jun/14 9:21 PM

Issue Links

is related to

LU-2887 sanity-quota test_12a: slow due to ZFS VMs sharing single disk

Resolved

Activity

[LU-5148] OSTs won't mount following upgrade to 2.4.2

Oleg Drokin added a comment - 15/Jul/14 3:19 AM

Looking at the stacktraces in the logs, it seems everybody is either blocked on the transaction commit wait inside zfs or on the semaphore that is held by somebody that waits on the transaction commit.

So it really looks like some sort of in-zfs wait for me. There's no dump of all threads stacks in here, so I wonder if you have one where it is visible there's lustre induced deadlock of some sort above zfs?

Oleg Drokin added a comment - 15/Jul/14 3:19 AM Looking at the stacktraces in the logs, it seems everybody is either blocked on the transaction commit wait inside zfs or on the semaphore that is held by somebody that waits on the transaction commit. So it really looks like some sort of in-zfs wait for me. There's no dump of all threads stacks in here, so I wonder if you have one where it is visible there's lustre induced deadlock of some sort above zfs?

Hongchao Zhang added a comment - 01/Jul/14 8:19 AM

this debug patch changed the IR (Imperative Recovery) operations in MGS to update asynchronously, and if the issue won't occur again, we can isolate the problem
as related to the slow synchronization of ZFS, just as the problem shown in ~~LU-2887~~, then we can try to create corresponding patches to fix it.
Thanks.

Hongchao Zhang added a comment - 01/Jul/14 8:19 AM this debug patch changed the IR (Imperative Recovery) operations in MGS to update asynchronously, and if the issue won't occur again, we can isolate the problem as related to the slow synchronization of ZFS, just as the problem shown in LU-2887 , then we can try to create corresponding patches to fix it. Thanks.

Peter Jones added a comment - 30/Jun/14 10:00 PM

Hongchao

Could you please elaborate as to how this patch works?

Thanks

Peter

Peter Jones added a comment - 30/Jun/14 10:00 PM Hongchao Could you please elaborate as to how this patch works? Thanks Peter

Hongchao Zhang added a comment - 27/Jun/14 11:34 AM

Hi,

Could you please try the debug patch at http://review.whamcloud.com/#/c/10869/ to check whether this issue occurs again?

Thanks very much!

Hongchao Zhang added a comment - 27/Jun/14 11:34 AM Hi, Could you please try the debug patch at http://review.whamcloud.com/#/c/10869/ to check whether this issue occurs again? Thanks very much!

Ned Bass (Inactive) added a comment - 26/Jun/14 11:53 PM

Here you go.

(gdb) l *(mgs_handle_target_reg+0x40c)
0x18ac is in mgs_handle_target_reg (/usr/src/debug/lustre-2.4.2/lustre/mgs/mgs_handler.c:322).
317
318             if (opc == LDD_F_OPC_READY) {
319                     CDEBUG(D_MGS, "fs: %s index: %d is ready to reconnect.\n",
320                            mti->mti_fsname, mti->mti_stripe_index);
321                     rc = mgs_ir_update(env, mgs, mti);
322                     if (rc) {
323                             LASSERT(!(mti->mti_flags & LDD_F_IR_CAPABLE));
324                             CERROR("Update IR return with %d(ignore and IR "
325                                    "disabled)\n", rc);
326                     }

(gdb) l *( mgs_ir_update+0x244 )
0x234e4 is in mgs_ir_update (/usr/src/debug/lustre-2.4.2/lustre/mgs/mgs_nids.c:270).
265             rc = dt_record_write(env, fsdb, &buf, &off, th);
266
267     out:
268             dt_trans_stop(env, mgs->mgs_bottom, th);
269     out_put:
270             lu_object_put(env, &fsdb->do_lu);
271             RETURN(rc);
272     }
273
274     #define MGS_NIDTBL_VERSION_INIT 2

Ned Bass (Inactive) added a comment - 26/Jun/14 11:53 PM Here you go. (gdb) l *(mgs_handle_target_reg+0x40c) 0x18ac is in mgs_handle_target_reg (/usr/src/debug/lustre-2.4.2/lustre/mgs/mgs_handler.c:322). 317 318 if (opc == LDD_F_OPC_READY) { 319 CDEBUG(D_MGS, "fs: %s index: %d is ready to reconnect.\n", 320 mti->mti_fsname, mti->mti_stripe_index); 321 rc = mgs_ir_update(env, mgs, mti); 322 if (rc) { 323 LASSERT(!(mti->mti_flags & LDD_F_IR_CAPABLE)); 324 CERROR("Update IR return with %d(ignore and IR " 325 "disabled)\n", rc); 326 } (gdb) l *( mgs_ir_update+0x244 ) 0x234e4 is in mgs_ir_update (/usr/src/debug/lustre-2.4.2/lustre/mgs/mgs_nids.c:270). 265 rc = dt_record_write(env, fsdb, &buf, &off, th); 266 267 out: 268 dt_trans_stop(env, mgs->mgs_bottom, th); 269 out_put: 270 lu_object_put(env, &fsdb->do_lu); 271 RETURN(rc); 272 } 273 274 #define MGS_NIDTBL_VERSION_INIT 2

Hongchao Zhang added a comment - 26/Jun/14 3:56 PM

Could you please print the actual code lines of the following address,

2014-06-05 11:25:55  [<ffffffffa0db44b4>] mgs_ir_update+0x244/0xb00 [mgs]
2014-06-05 11:25:55  [<ffffffffa0d9287c>] mgs_handle_target_reg+0x40c/0xe30 [mgs]

And I can find the related codes in these functions in https://github.com/chaos/lustre/blob/2.4.2-11chaos/lustre/mgs/

Thanks very much!

Hongchao Zhang added a comment - 26/Jun/14 3:56 PM Hi Could you please print the actual code lines of the following address, 2014-06-05 11:25:55 [<ffffffffa0db44b4>] mgs_ir_update+0x244/0xb00 [mgs] 2014-06-05 11:25:55 [<ffffffffa0d9287c>] mgs_handle_target_reg+0x40c/0xe30 [mgs] And I can find the related codes in these functions in https://github.com/chaos/lustre/blob/2.4.2-11chaos/lustre/mgs/ Thanks very much!

Christopher Morrone (Inactive) added a comment - 11/Jun/14 9:21 PM

The version of ZFS installed at our site is quite a bit newer than 0.6.0.*. Our version of ZFS is very close to the tip of master, and what will soon be tagged as 0.6.3.

Christopher Morrone (Inactive) added a comment - 11/Jun/14 9:21 PM The version of ZFS installed at our site is quite a bit newer than 0.6.0.*. Our version of ZFS is very close to the tip of master, and what will soon be tagged as 0.6.3.

Hongchao Zhang added a comment - 11/Jun/14 3:40 AM

there is a similar issue of ZFS in https://github.com/zfsonlinux/zfs/issues/542, what is the version of ZFS installed on your site?

Hongchao Zhang added a comment - 11/Jun/14 3:40 AM there is a similar issue of ZFS in https://github.com/zfsonlinux/zfs/issues/542 , what is the version of ZFS installed on your site?

Ned Bass (Inactive) added a comment - 09/Jun/14 9:21 PM

Hongchao Zhang, I've attached the MDS console log: porter-mds1.console.txt.

Ned Bass (Inactive) added a comment - 09/Jun/14 9:21 PM Hongchao Zhang, I've attached the MDS console log: porter-mds1.console.txt .

Hongchao Zhang added a comment - 06/Jun/14 2:40 PM - edited

there is no "MGS_CONNECT" request found in MGS/MGS in the log "lustre.log.porter-mds1.1402001323.gz", and there is even no
"mgs_xxx" logs in the MGS/MDS log file (there should be some "ENTRY", "RETURN" logs at least), then the MGS should be stuck
in some way.

Hi Ned, could you please attach the logs containing the stack traces of the mgs service mentioned above, thanks!

Hongchao Zhang added a comment - 06/Jun/14 2:40 PM - edited there is no "MGS_CONNECT" request found in MGS/MGS in the log "lustre.log.porter-mds1.1402001323.gz", and there is even no "mgs_xxx" logs in the MGS/MDS log file (there should be some "ENTRY", "RETURN" logs at least), then the MGS should be stuck in some way. Hi Ned, could you please attach the logs containing the stack traces of the mgs service mentioned above, thanks!

Ned Bass (Inactive) added a comment - 06/Jun/14 2:27 PM

Also, I was finally able to get the OSTs to mount by unmounting and remounting the MDT, but leaving the MGT (which is a separate dataset) mounted.

Ned Bass (Inactive) added a comment - 06/Jun/14 2:27 PM Also, I was finally able to get the OSTs to mount by unmounting and remounting the MDT, but leaving the MGT (which is a separate dataset) mounted.

People

Assignee:: Hongchao Zhang

Reporter:: Ned Bass (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 05/Jun/14 8:30 PM

Updated:: 18/Jul/16 9:50 PM

Resolved:: 29/Apr/16 12:38 AM