[LU-3645] Interop 2.1.5 <--> 2.4 Write operations during failover errors out instead of stalling - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: None
Affects Version/s: Lustre 2.4.0, Lustre 2.1.5
Labels:
- mn1

Severity:
3
Rank (Obsolete):
9382

Description

During acceptance testing for KIT, they tried OSS failover while running several applications. And applications got IO errors (can't create file and similar messages). This should not happen and IO should just stall till failover happens.
The clients were running 2.4 and servers were 2.1.5. We tried with 2.1.5 clients and did not see this issue. I have attached the client and server logs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

client_lctl_dk_20130911.tgz
18 kB
13/Sep/13 2:31 PM
client_messages_20130911.tgz
228 kB
13/Sep/13 2:31 PM
ll10987.out.gz
0.2 kB
23/Oct/13 9:05 PM
LU-XXXX.tgz
463 kB
26/Jul/13 8:26 AM
mds1.llog.gz
224 kB
30/Jul/13 5:27 PM
mds2.llog.gz
0.2 kB
30/Jul/13 5:27 PM
server_lctl_dk_20130911.tgz
392 kB
13/Sep/13 2:31 PM
ucbn003.localdomain.llog.gz
0.2 kB
30/Jul/13 5:27 PM

Activity

[LU-3645] Interop 2.1.5 <--> 2.4 Write operations during failover errors out instead of stalling

Kit Westneat (Inactive) added a comment - 23/Oct/13 8:34 PM

should it actually be:
+ if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV && rc != -EAGAIN)

I built a version with that just to test. I didn't get any IO errors, but I did get:

Oct 23 22:27:56 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_create operation failed with -19
Oct 23 22:27:56 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 23 22:27:56 pfscn2 kernel: : LustreError: 9255:0:(osc_create.c:175:osc_interpret_create()) @@@ Unknown rc -19 from async create: failing oscc  req@ffff88081f6c3800 x1449716075693976/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/400 e 0 to 0 dl 1382560083 ref 1 fl Interpret:RN/0/0 rc -19/-19
Oct 23 22:28:53 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
Oct 23 22:28:53 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
Oct 23 22:28:53 pfscn2 kernel: : Lustre: Skipped 1 previous similar message
Oct 23 22:31:16 pfscn2 kernel: : Lustre: Service thread pid 10987 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
Oct 23 22:31:16 pfscn2 kernel: : Pid: 10987, comm: mdt_13
Oct 23 22:31:16 pfscn2 kernel: :
Oct 23 22:31:16 pfscn2 kernel: : Call Trace:
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de923>] ? cfs_alloc+0x63/0x90 [libcfs]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0808a45>] ? ptlrpc_next_xid+0x15/0x40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de60e>] cfs_waitq_wait+0xe/0x10 [libcfs]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ac650a>] lov_create+0xbaa/0x1400 [lov]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d5c0d6>] ? mdd_get_md+0x96/0x2f0 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7c916>] ? mdd_read_unlock+0x26/0x30 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d6090e>] mdd_lov_create+0x9ee/0x1ba0 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d72871>] mdd_create+0xf81/0x1a90 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0e72414>] ? osd_object_init+0xe4/0x420 [osd_ldiskfs]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0eaa3f7>] cml_create+0x97/0x250 [cmm]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de55b1>] ? mdt_version_get_save+0x91/0xd0 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dfb156>] mdt_reint_open+0x1aa6/0x2940 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa081d754>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7556e>] ? md_ucred+0x1e/0x60 [mdd]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de3c51>] mdt_reint_rec+0x41/0xe0 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddaed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddb53d>] mdt_intent_reint+0x1ed/0x500 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd9c09>] mdt_intent_policy+0x379/0x690 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07d9391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07ff1dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dda586>] mdt_enqueue+0x46/0x130 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dcf772>] mdt_handle_common+0x932/0x1750 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd0665>] mdt_regular_handle+0x15/0x20 [mdt]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082db9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0ca>] child_rip+0xa/0x20
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Kit Westneat (Inactive) added a comment - 23/Oct/13 8:34 PM should it actually be: + if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV && rc != -EAGAIN ) I built a version with that just to test. I didn't get any IO errors, but I did get: Oct 23 22:27:56 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_create operation failed with -19 Oct 23 22:27:56 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 23 22:27:56 pfscn2 kernel: : LustreError: 9255:0:(osc_create.c:175:osc_interpret_create()) @@@ Unknown rc -19 from async create: failing oscc req@ffff88081f6c3800 x1449716075693976/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/400 e 0 to 0 dl 1382560083 ref 1 fl Interpret:RN/0/0 rc -19/-19 Oct 23 22:28:53 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib) Oct 23 22:28:53 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans Oct 23 22:28:53 pfscn2 kernel: : Lustre: Skipped 1 previous similar message Oct 23 22:31:16 pfscn2 kernel: : Lustre: Service thread pid 10987 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Oct 23 22:31:16 pfscn2 kernel: : Pid: 10987, comm: mdt_13 Oct 23 22:31:16 pfscn2 kernel: : Oct 23 22:31:16 pfscn2 kernel: : Call Trace: Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de923>] ? cfs_alloc+0x63/0x90 [libcfs] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0808a45>] ? ptlrpc_next_xid+0x15/0x40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de60e>] cfs_waitq_wait+0xe/0x10 [libcfs] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ac650a>] lov_create+0xbaa/0x1400 [lov] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d5c0d6>] ? mdd_get_md+0x96/0x2f0 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20 Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7c916>] ? mdd_read_unlock+0x26/0x30 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d6090e>] mdd_lov_create+0x9ee/0x1ba0 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d72871>] mdd_create+0xf81/0x1a90 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0e72414>] ? osd_object_init+0xe4/0x420 [osd_ldiskfs] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0eaa3f7>] cml_create+0x97/0x250 [cmm] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de55b1>] ? mdt_version_get_save+0x91/0xd0 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dfb156>] mdt_reint_open+0x1aa6/0x2940 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa081d754>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7556e>] ? md_ucred+0x1e/0x60 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de3c51>] mdt_reint_rec+0x41/0xe0 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddaed4>] mdt_reint_internal+0x544/0x8e0 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddb53d>] mdt_intent_reint+0x1ed/0x500 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd9c09>] mdt_intent_policy+0x379/0x690 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07d9391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07ff1dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dda586>] mdt_enqueue+0x46/0x130 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dcf772>] mdt_handle_common+0x932/0x1750 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd0665>] mdt_regular_handle+0x15/0x20 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082db9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0ca>] child_rip+0xa/0x20 Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

Kit Westneat (Inactive) added a comment - 23/Oct/13 3:02 PM

Hi Hongchao,

Sorry for the delay in getting the output:

Oct 23 10:47:48 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this 
service will wait for recovery to complete
Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -11
Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN!
Oct 23 10:48:39 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1
Oct 23 10:48:39 pfscn2 kernel: : Call Trace:
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Oct 23 10:48:39 pfscn2 kernel: : LustreError: 10250:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -5
Oct 23 10:49:21 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
Oct 23 10:49:21 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
Oct 23 10:53:38 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19
Oct 23 10:53:38 pfscn2 kernel: : LustreError: Skipped 1 previous similar message
Oct 23 10:53:38 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Oct 23 10:54:09 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_connect operation failed with -19
Oct 23 10:54:09 pfscn2 kernel: : LustreError: Skipped 1 previous similar message
Oct 23 10:54:16 pfscn2 kernel: : Lustre: 10169:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 6ede50b8-462e-c607-2c6d-f747541fabbd@172.26.17.4@o2ib t0 exp (null) cur 1382518456 last 0
Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -11
Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN!
Oct 23 10:54:28 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1
Oct 23 10:54:28 pfscn2 kernel: : Call Trace:
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Oct 23 10:54:28 pfscn2 kernel: : LustreError: 24178:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -5

Kit Westneat (Inactive) added a comment - 23/Oct/13 3:02 PM Hi Hongchao, Sorry for the delay in getting the output: Oct 23 10:47:48 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -11 Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN! Oct 23 10:48:39 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1 Oct 23 10:48:39 pfscn2 kernel: : Call Trace: Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0 Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20 Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Oct 23 10:48:39 pfscn2 kernel: : LustreError: 10250:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -5 Oct 23 10:49:21 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib) Oct 23 10:49:21 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans Oct 23 10:53:38 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19 Oct 23 10:53:38 pfscn2 kernel: : LustreError: Skipped 1 previous similar message Oct 23 10:53:38 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 23 10:54:09 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_connect operation failed with -19 Oct 23 10:54:09 pfscn2 kernel: : LustreError: Skipped 1 previous similar message Oct 23 10:54:16 pfscn2 kernel: : Lustre: 10169:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 6ede50b8-462e-c607-2c6d-f747541fabbd@172.26.17.4@o2ib t0 exp ( null ) cur 1382518456 last 0 Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -11 Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN! Oct 23 10:54:28 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1 Oct 23 10:54:28 pfscn2 kernel: : Call Trace: Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0 Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20 Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Oct 23 10:54:28 pfscn2 kernel: : LustreError: 24178:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -5

Hongchao Zhang added a comment - 10/Oct/13 10:43 AM

Hi Kit,

could you please test with the following debug patch

diff --git a/lustre/lov/lov_request.c b/lustre/lov/lov_request.c
index 0fc635e..f0476cb 100644
--- a/lustre/lov/lov_request.c
+++ b/lustre/lov/lov_request.c
@@ -577,6 +577,14 @@ static int lov_update_create_set(struct lov_request_set *se
                        "on OST idx %d/%d: rc = %d\n",
                        set->set_oi->oi_oa->o_id, req->rq_idx,
                        lsm->lsm_stripe_count, rc);
+
+#ifdef __KERNEL__
+               if (rc == -EAGAIN) {
+                       CDEBUG(D_ERROR, "shouldn't encounter -EAGAIN!\n");
+                       dump_stack();
+               }
+#endif
+
                 if (rc > 0) {
                         CERROR("obd_create returned invalid err %d\n", rc);
                         rc = -EIO;
diff --git a/lustre/osc/osc_create.c b/lustre/osc/osc_create.c
index ffc81d4..22265bd 100644
--- a/lustre/osc/osc_create.c
+++ b/lustre/osc/osc_create.c
@@ -412,7 +412,7 @@ static int handle_async_create(struct ptlrpc_request *req, i
 
         LASSERT_SPIN_LOCKED(&oscc->oscc_lock);
 
-        if(rc)
+        if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV)
                 GOTO(out_wake, rc);
 
         /* Handle the critical type errors first.

Hongchao Zhang added a comment - 10/Oct/13 10:43 AM Hi Kit, could you please test with the following debug patch diff --git a/lustre/lov/lov_request.c b/lustre/lov/lov_request.c index 0fc635e..f0476cb 100644 --- a/lustre/lov/lov_request.c +++ b/lustre/lov/lov_request.c @@ -577,6 +577,14 @@ static int lov_update_create_set(struct lov_request_set *se "on OST idx %d/%d: rc = %d\n" , set->set_oi->oi_oa->o_id, req->rq_idx, lsm->lsm_stripe_count, rc); + +#ifdef __KERNEL__ + if (rc == -EAGAIN) { + CDEBUG(D_ERROR, "shouldn't encounter -EAGAIN!\n" ); + dump_stack(); + } +#endif + if (rc > 0) { CERROR( "obd_create returned invalid err %d\n" , rc); rc = -EIO; diff --git a/lustre/osc/osc_create.c b/lustre/osc/osc_create.c index ffc81d4..22265bd 100644 --- a/lustre/osc/osc_create.c +++ b/lustre/osc/osc_create.c @@ -412,7 +412,7 @@ static int handle_async_create(struct ptlrpc_request *req, i LASSERT_SPIN_LOCKED(&oscc->oscc_lock); - if (rc) + if (rc != 0 && rc != -ENOTCONN && rc != -ENODEV) GOTO(out_wake, rc); /* Handle the critical type errors first.

Kit Westneat (Inactive) added a comment - 09/Oct/13 4:41 PM

The fix that worked on my test system didn't work on the customer system, so it might have been a fluke.

I haven't seen that message in the new logs. Here is the latest server dk's:

00000100:00100000:14.0:1381243911.199943:0:4598:0:(import.c:725:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff8808181fb400 x1448249214211642/t0(
0) o8->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:28/4 lens 368/512 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
00000100:00100000:13.0:1381243911.212619:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1:
1448249214211637:172.26.17.3@o2ib:-1
00020000:00020000:13.0:1381243911.212624:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -11
00000100:00100000:13.0:1381243911.225308:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1:
1448249214211638:172.26.17.3@o2ib:-1
00020000:00020000:13.0:1381243911.225312:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -11
00000100:00100000:13.0:1381243911.250579:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1:
1448249214211639:172.26.17.3@o2ib:-1
00020000:00020000:6.0:1381243911.250579:0:29944:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -5
00000100:00100000:13.0:1381243911.250586:0:4596:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:e76ce10d-1d27-cdc4-1091-649094c70331:4596:1448249214211641:172.26.17.1@o2ib:400
00020000:00020000:5.0:1381243911.250612:0:4755:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -5
00000100:00100000:13.0:1381243911.250638:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:0:1448249214211644:172.26.17.3@o2ib:13
00000100:00100000:5.0:1381243911.250741:0:4755:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_02:94b08aa2-54d7-b32e-019d-2561ed3286b5+6:58211:x1447047461344560:12345-172.26.4.3@o2ib:101 Request procesed in 49643595us (49643620us total) trans 0 rc 301/301

Kit Westneat (Inactive) added a comment - 09/Oct/13 4:41 PM The fix that worked on my test system didn't work on the customer system, so it might have been a fluke. I haven't seen that message in the new logs. Here is the latest server dk's: 00000100:00100000:14.0:1381243911.199943:0:4598:0:( import .c:725:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff8808181fb400 x1448249214211642/t0( 0) o8->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:28/4 lens 368/512 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:13.0:1381243911.212619:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1: 1448249214211637:172.26.17.3@o2ib:-1 00020000:00020000:13.0:1381243911.212624:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -11 00000100:00100000:13.0:1381243911.225308:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1: 1448249214211638:172.26.17.3@o2ib:-1 00020000:00020000:13.0:1381243911.225312:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -11 00000100:00100000:13.0:1381243911.250579:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1: 1448249214211639:172.26.17.3@o2ib:-1 00020000:00020000:6.0:1381243911.250579:0:29944:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -5 00000100:00100000:13.0:1381243911.250586:0:4596:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:e76ce10d-1d27-cdc4-1091-649094c70331:4596:1448249214211641:172.26.17.1@o2ib:400 00020000:00020000:5.0:1381243911.250612:0:4755:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -5 00000100:00100000:13.0:1381243911.250638:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:0:1448249214211644:172.26.17.3@o2ib:13 00000100:00100000:5.0:1381243911.250741:0:4755:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_02:94b08aa2-54d7-b32e-019d-2561ed3286b5+6:58211:x1447047461344560:12345-172.26.4.3@o2ib:101 Request procesed in 49643595us (49643620us total) trans 0 rc 301/301

Hongchao Zhang added a comment - 08/Oct/13 2:35 PM

Hi Kit,

Sorry for delayed response, we're in holiday in the last several days.

if "OSCC_FLAG_DEGRADED" is set, then this OST won't be used preferentially when creating object. do you only test the failover on one OSS?

BTW, is there something liken "Unknown rc XXX from async create: failing oscc" in the kernel logs?

Thanks

Hongchao Zhang added a comment - 08/Oct/13 2:35 PM Hi Kit, Sorry for delayed response, we're in holiday in the last several days. if "OSCC_FLAG_DEGRADED" is set, then this OST won't be used preferentially when creating object. do you only test the failover on one OSS? BTW, is there something liken "Unknown rc XXX from async create: failing oscc" in the kernel logs? Thanks

Kit Westneat (Inactive) added a comment - 03/Oct/13 8:49 PM

I tested adding the degraded flag to the oscc and it seemed to fix the problem on my test system:

        if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV)
                GOTO(out_wake, rc);
           
+        if (rc == -ENOTCONN || rc ==  -ENODEV)
+                oscc->oscc_flags |= OSCC_FLAG_DEGRADED;

        /* Handle the critical type errors first.

I ran into one issue where the OST ran out of inodes and it caused the lov_create threads on the MDT to deadlock. I'm not sure if that's a new behavior or in stock 2.1.6 as well. I'll investigate.

Kit Westneat (Inactive) added a comment - 03/Oct/13 8:49 PM I tested adding the degraded flag to the oscc and it seemed to fix the problem on my test system: if (rc != 0 && rc != -ENOTCONN && rc != -ENODEV) GOTO(out_wake, rc); + if (rc == -ENOTCONN || rc == -ENODEV) + oscc->oscc_flags |= OSCC_FLAG_DEGRADED; /* Handle the critical type errors first. I ran into one issue where the OST ran out of inodes and it caused the lov_create threads on the MDT to deadlock. I'm not sure if that's a new behavior or in stock 2.1.6 as well. I'll investigate.

Kit Westneat (Inactive) added a comment - 03/Oct/13 2:30 PM

Hi Hongchao,

Any updates?

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 03/Oct/13 2:30 PM Hi Hongchao, Any updates? Thanks, Kit

Kit Westneat (Inactive) added a comment - 01/Oct/13 3:47 PM

Ah ok, that makes sense. Should handle_async_create add the RECOVERING flag to the oscc? It seems like osc_interpret_create would add it, but is it possible that it could get to oscc_internal_create without that flag?

Kit Westneat (Inactive) added a comment - 01/Oct/13 3:47 PM Ah ok, that makes sense. Should handle_async_create add the RECOVERING flag to the oscc? It seems like osc_interpret_create would add it, but is it possible that it could get to oscc_internal_create without that flag?

Hongchao Zhang added a comment - 30/Sep/13 4:22 PM

Hi
the fake requests will be queued into osc_creater.oscc_wait_create_list in osc_create_async to wait the objects creation to complete, and it will be woken up
in osc_interpret_create with returned value from OST. in this case, -ENOTCONN or -ENODEV is returned and failed these fake requests, which should continue
waiting the recovery of OST to complete and the objects will be ready then.

I'll create a new debug patch with the http://review.whamcloud.com/#/c/7559/

Thanks!

Hongchao Zhang added a comment - 30/Sep/13 4:22 PM Hi the fake requests will be queued into osc_creater.oscc_wait_create_list in osc_create_async to wait the objects creation to complete, and it will be woken up in osc_interpret_create with returned value from OST. in this case, -ENOTCONN or -ENODEV is returned and failed these fake requests, which should continue waiting the recovery of OST to complete and the objects will be ready then. I'll create a new debug patch with the http://review.whamcloud.com/#/c/7559/ Thanks!

Kit Westneat (Inactive) added a comment - 27/Sep/13 5:59 PM

Hi Hongchao,

I was trying to follow the create logic some and it's confused me some. It seems like the fake requests are handled by oscc_internal_create, but then if the oscc is recovering (due to the EAGAIN) all it does is return 0. Am I reading that right?

Thanks,
Kit

Kit Westneat (Inactive) added a comment - 27/Sep/13 5:59 PM Hi Hongchao, I was trying to follow the create logic some and it's confused me some. It seems like the fake requests are handled by oscc_internal_create, but then if the oscc is recovering (due to the EAGAIN) all it does is return 0. Am I reading that right? Thanks, Kit

Kit Westneat (Inactive) added a comment - 26/Sep/13 5:18 PM

Hi Hongchao,

The patch fixed the data corruption, but we are still seeing the same errors on failover. Should we reproduce the issue with the original debugging patch + fix? Here are the kernel messages from the fix (patch 7559):

Sep 25 15:16:49 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19
Sep 25 15:16:49 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
Sep 25 15:17:39 pfscn2 kernel: : LustreError: 9001:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -11
Sep 25 15:17:39 pfscn2 kernel: : LustreError: 7308:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -5
Sep 25 15:17:40 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib)
Sep 25 15:17:40 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans

Kit Westneat (Inactive) added a comment - 26/Sep/13 5:18 PM Hi Hongchao, The patch fixed the data corruption, but we are still seeing the same errors on failover. Should we reproduce the issue with the original debugging patch + fix? Here are the kernel messages from the fix (patch 7559): Sep 25 15:16:49 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19 Sep 25 15:16:49 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 25 15:17:39 pfscn2 kernel: : LustreError: 9001:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -11 Sep 25 15:17:39 pfscn2 kernel: : LustreError: 7308:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -5 Sep 25 15:17:40 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib) Sep 25 15:17:40 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans

People

Assignee:: Hongchao Zhang

Reporter:: Girish Shilamkar (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 26/Jul/13 8:26 AM

Updated:: 15/Mar/14 10:48 PM

Resolved:: 22/Nov/13 10:55 PM