Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3645

Interop 2.1.5 <--> 2.4 Write operations during failover errors out instead of stalling

Details

    • Bug
    • Resolution: Fixed
    • Major
    • None
    • Lustre 2.4.0, Lustre 2.1.5
    • 3
    • 9382

    Description

      During acceptance testing for KIT, they tried OSS failover while running several applications. And applications got IO errors (can't create file and similar messages). This should not happen and IO should just stall till failover happens.
      The clients were running 2.4 and servers were 2.1.5. We tried with 2.1.5 clients and did not see this issue. I have attached the client and server logs.

      Attachments

        1. client_lctl_dk_20130911.tgz
          18 kB
        2. client_messages_20130911.tgz
          228 kB
        3. ll10987.out.gz
          0.2 kB
        4. LU-XXXX.tgz
          463 kB
        5. mds1.llog.gz
          224 kB
        6. mds2.llog.gz
          0.2 kB
        7. server_lctl_dk_20130911.tgz
          392 kB
        8. ucbn003.localdomain.llog.gz
          0.2 kB

        Activity

          [LU-3645] Interop 2.1.5 <--> 2.4 Write operations during failover errors out instead of stalling

          should it actually be:
          + if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV && rc != -EAGAIN)

          I built a version with that just to test. I didn't get any IO errors, but I did get:

          Oct 23 22:27:56 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_create operation failed with -19
          Oct 23 22:27:56 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete
          Oct 23 22:27:56 pfscn2 kernel: : LustreError: 9255:0:(osc_create.c:175:osc_interpret_create()) @@@ Unknown rc -19 from async create: failing oscc  req@ffff88081f6c3800 x1449716075693976/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/400 e 0 to 0 dl 1382560083 ref 1 fl Interpret:RN/0/0 rc -19/-19
          Oct 23 22:28:53 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
          Oct 23 22:28:53 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
          Oct 23 22:28:53 pfscn2 kernel: : Lustre: Skipped 1 previous similar message
          Oct 23 22:31:16 pfscn2 kernel: : Lustre: Service thread pid 10987 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes:
          Oct 23 22:31:16 pfscn2 kernel: : Pid: 10987, comm: mdt_13
          Oct 23 22:31:16 pfscn2 kernel: :
          Oct 23 22:31:16 pfscn2 kernel: : Call Trace:
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de923>] ? cfs_alloc+0x63/0x90 [libcfs]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0808a45>] ? ptlrpc_next_xid+0x15/0x40 [ptlrpc]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de60e>] cfs_waitq_wait+0xe/0x10 [libcfs]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ac650a>] lov_create+0xbaa/0x1400 [lov]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d5c0d6>] ? mdd_get_md+0x96/0x2f0 [mdd]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7c916>] ? mdd_read_unlock+0x26/0x30 [mdd]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d6090e>] mdd_lov_create+0x9ee/0x1ba0 [mdd]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d72871>] mdd_create+0xf81/0x1a90 [mdd]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0e72414>] ? osd_object_init+0xe4/0x420 [osd_ldiskfs]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0eaa3f7>] cml_create+0x97/0x250 [cmm]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de55b1>] ? mdt_version_get_save+0x91/0xd0 [mdt]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dfb156>] mdt_reint_open+0x1aa6/0x2940 [mdt]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa081d754>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7556e>] ? md_ucred+0x1e/0x60 [mdd]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de3c51>] mdt_reint_rec+0x41/0xe0 [mdt]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddaed4>] mdt_reint_internal+0x544/0x8e0 [mdt]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddb53d>] mdt_intent_reint+0x1ed/0x500 [mdt]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd9c09>] mdt_intent_policy+0x379/0x690 [mdt]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07d9391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07ff1dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dda586>] mdt_enqueue+0x46/0x130 [mdt]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dcf772>] mdt_handle_common+0x932/0x1750 [mdt]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd0665>] mdt_regular_handle+0x15/0x20 [mdt]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082db9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0ca>] child_rip+0xa/0x20
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc]
          Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
          
          kitwestneat Kit Westneat (Inactive) added a comment - should it actually be: + if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV && rc != -EAGAIN ) I built a version with that just to test. I didn't get any IO errors, but I did get: Oct 23 22:27:56 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_create operation failed with -19 Oct 23 22:27:56 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 23 22:27:56 pfscn2 kernel: : LustreError: 9255:0:(osc_create.c:175:osc_interpret_create()) @@@ Unknown rc -19 from async create: failing oscc req@ffff88081f6c3800 x1449716075693976/t0(0) o5->pfscdat2-OST0000-osc-MDT0000@172.26.17.4@o2ib:7/4 lens 400/400 e 0 to 0 dl 1382560083 ref 1 fl Interpret:RN/0/0 rc -19/-19 Oct 23 22:28:53 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib) Oct 23 22:28:53 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans Oct 23 22:28:53 pfscn2 kernel: : Lustre: Skipped 1 previous similar message Oct 23 22:31:16 pfscn2 kernel: : Lustre: Service thread pid 10987 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Oct 23 22:31:16 pfscn2 kernel: : Pid: 10987, comm: mdt_13 Oct 23 22:31:16 pfscn2 kernel: : Oct 23 22:31:16 pfscn2 kernel: : Call Trace: Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de923>] ? cfs_alloc+0x63/0x90 [libcfs] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0808a45>] ? ptlrpc_next_xid+0x15/0x40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa04de60e>] cfs_waitq_wait+0xe/0x10 [libcfs] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ac650a>] lov_create+0xbaa/0x1400 [lov] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d5c0d6>] ? mdd_get_md+0x96/0x2f0 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20 Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7c916>] ? mdd_read_unlock+0x26/0x30 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d6090e>] mdd_lov_create+0x9ee/0x1ba0 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d72871>] mdd_create+0xf81/0x1a90 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0e72414>] ? osd_object_init+0xe4/0x420 [osd_ldiskfs] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0eaa3f7>] cml_create+0x97/0x250 [cmm] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de55b1>] ? mdt_version_get_save+0x91/0xd0 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dfb156>] mdt_reint_open+0x1aa6/0x2940 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa081d754>] ? lustre_msg_add_version+0x74/0xd0 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0d7556e>] ? md_ucred+0x1e/0x60 [mdd] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0de3c51>] mdt_reint_rec+0x41/0xe0 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddaed4>] mdt_reint_internal+0x544/0x8e0 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0ddb53d>] mdt_intent_reint+0x1ed/0x500 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd9c09>] mdt_intent_policy+0x379/0x690 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07d9391>] ldlm_lock_enqueue+0x361/0x8f0 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa07ff1dd>] ldlm_handle_enqueue0+0x48d/0xf50 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dda586>] mdt_enqueue+0x46/0x130 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dcf772>] mdt_handle_common+0x932/0x1750 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa0dd0665>] mdt_regular_handle+0x15/0x20 [mdt] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082db9e>] ptlrpc_main+0xc4e/0x1a40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0ca>] child_rip+0xa/0x20 Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffffa082cf50>] ? ptlrpc_main+0x0/0x1a40 [ptlrpc] Oct 23 22:31:16 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

          Hi Hongchao,

          Sorry for the delay in getting the output:

          Oct 23 10:47:48 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this 
          service will wait for recovery to complete
          Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -11
          Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN!
          Oct 23 10:48:39 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1
          Oct 23 10:48:39 pfscn2 kernel: : Call Trace:
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
          Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
          Oct 23 10:48:39 pfscn2 kernel: : LustreError: 10250:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -5
          Oct 23 10:49:21 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib)
          Oct 23 10:49:21 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
          Oct 23 10:53:38 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19
          Oct 23 10:53:38 pfscn2 kernel: : LustreError: Skipped 1 previous similar message
          Oct 23 10:53:38 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
          Oct 23 10:54:09 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_connect operation failed with -19
          Oct 23 10:54:09 pfscn2 kernel: : LustreError: Skipped 1 previous similar message
          Oct 23 10:54:16 pfscn2 kernel: : Lustre: 10169:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 6ede50b8-462e-c607-2c6d-f747541fabbd@172.26.17.4@o2ib t0 exp (null) cur 1382518456 last 0
          Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -11
          Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN!
          Oct 23 10:54:28 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1
          Oct 23 10:54:28 pfscn2 kernel: : Call Trace:
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc]
          Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
          Oct 23 10:54:28 pfscn2 kernel: : LustreError: 24178:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -5
          
          kitwestneat Kit Westneat (Inactive) added a comment - Hi Hongchao, Sorry for the delay in getting the output: Oct 23 10:47:48 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.4@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -11 Oct 23 10:48:39 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN! Oct 23 10:48:39 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1 Oct 23 10:48:39 pfscn2 kernel: : Call Trace: Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0 Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20 Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:48:39 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Oct 23 10:48:39 pfscn2 kernel: : LustreError: 10250:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10949 sub-object on OST idx 0/1: rc = -5 Oct 23 10:49:21 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.3@o2ib) Oct 23 10:49:21 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans Oct 23 10:53:38 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19 Oct 23 10:53:38 pfscn2 kernel: : LustreError: Skipped 1 previous similar message Oct 23 10:53:38 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete Oct 23 10:54:09 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.4@o2ib. The ost_connect operation failed with -19 Oct 23 10:54:09 pfscn2 kernel: : LustreError: Skipped 1 previous similar message Oct 23 10:54:16 pfscn2 kernel: : Lustre: 10169:0:(ldlm_lib.c:952:target_handle_connect()) MGS: connection from 6ede50b8-462e-c607-2c6d-f747541fabbd@172.26.17.4@o2ib t0 exp ( null ) cur 1382518456 last 0 Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -11 Oct 23 10:54:28 pfscn2 kernel: : LustreError: 9651:0:(lov_request.c:583:lov_update_create_set()) shouldn't encounter -EAGAIN! Oct 23 10:54:28 pfscn2 kernel: : Pid: 9651, comm: ptlrpcd Not tainted 2.6.32-358.11.1.el6_lustre.es124.x86_64 #1 Oct 23 10:54:28 pfscn2 kernel: : Call Trace: Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad3505>] ? lov_update_create_set+0x4f5/0x500 [lov] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0ad78cd>] ? cb_create_update+0x2d/0x100 [lov] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6f28b>] ? handle_async_create+0x7b/0x390 [osc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0a6fad9>] ? async_create_interpret+0x39/0x50 [osc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0811c9b>] ? ptlrpc_check_set+0x29b/0x1b00 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8150ecda>] ? schedule_timeout+0x19a/0x2e0 Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842f20>] ? ptlrpcd_check+0x1a0/0x230 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa08431be>] ? ptlrpcd+0x20e/0x370 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff81063310>] ? default_wake_function+0x0/0x20 Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffffa0842fb0>] ? ptlrpcd+0x0/0x370 [ptlrpc] Oct 23 10:54:28 pfscn2 kernel: : [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 Oct 23 10:54:28 pfscn2 kernel: : LustreError: 24178:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0xd948 sub-object on OST idx 0/1: rc = -5

          Hi Kit,

          could you please test with the following debug patch

          diff --git a/lustre/lov/lov_request.c b/lustre/lov/lov_request.c
          index 0fc635e..f0476cb 100644
          --- a/lustre/lov/lov_request.c
          +++ b/lustre/lov/lov_request.c
          @@ -577,6 +577,14 @@ static int lov_update_create_set(struct lov_request_set *se
                                  "on OST idx %d/%d: rc = %d\n",
                                  set->set_oi->oi_oa->o_id, req->rq_idx,
                                  lsm->lsm_stripe_count, rc);
          +
          +#ifdef __KERNEL__
          +               if (rc == -EAGAIN) {
          +                       CDEBUG(D_ERROR, "shouldn't encounter -EAGAIN!\n");
          +                       dump_stack();
          +               }
          +#endif
          +
                           if (rc > 0) {
                                   CERROR("obd_create returned invalid err %d\n", rc);
                                   rc = -EIO;
          diff --git a/lustre/osc/osc_create.c b/lustre/osc/osc_create.c
          index ffc81d4..22265bd 100644
          --- a/lustre/osc/osc_create.c
          +++ b/lustre/osc/osc_create.c
          @@ -412,7 +412,7 @@ static int handle_async_create(struct ptlrpc_request *req, i
           
                   LASSERT_SPIN_LOCKED(&oscc->oscc_lock);
           
          -        if(rc)
          +        if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV)
                           GOTO(out_wake, rc);
           
                   /* Handle the critical type errors first.
          
          hongchao.zhang Hongchao Zhang added a comment - Hi Kit, could you please test with the following debug patch diff --git a/lustre/lov/lov_request.c b/lustre/lov/lov_request.c index 0fc635e..f0476cb 100644 --- a/lustre/lov/lov_request.c +++ b/lustre/lov/lov_request.c @@ -577,6 +577,14 @@ static int lov_update_create_set(struct lov_request_set *se "on OST idx %d/%d: rc = %d\n" , set->set_oi->oi_oa->o_id, req->rq_idx, lsm->lsm_stripe_count, rc); + +#ifdef __KERNEL__ + if (rc == -EAGAIN) { + CDEBUG(D_ERROR, "shouldn't encounter -EAGAIN!\n" ); + dump_stack(); + } +#endif + if (rc > 0) { CERROR( "obd_create returned invalid err %d\n" , rc); rc = -EIO; diff --git a/lustre/osc/osc_create.c b/lustre/osc/osc_create.c index ffc81d4..22265bd 100644 --- a/lustre/osc/osc_create.c +++ b/lustre/osc/osc_create.c @@ -412,7 +412,7 @@ static int handle_async_create(struct ptlrpc_request *req, i LASSERT_SPIN_LOCKED(&oscc->oscc_lock); - if (rc) + if (rc != 0 && rc != -ENOTCONN && rc != -ENODEV) GOTO(out_wake, rc); /* Handle the critical type errors first.

          The fix that worked on my test system didn't work on the customer system, so it might have been a fluke.

          I haven't seen that message in the new logs. Here is the latest server dk's:

          00000100:00100000:14.0:1381243911.199943:0:4598:0:(import.c:725:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5)  req@ffff8808181fb400 x1448249214211642/t0(
          0) o8->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:28/4 lens 368/512 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1
          00000100:00100000:13.0:1381243911.212619:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1:
          1448249214211637:172.26.17.3@o2ib:-1
          00020000:00020000:13.0:1381243911.212624:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -11
          00000100:00100000:13.0:1381243911.225308:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1:
          1448249214211638:172.26.17.3@o2ib:-1
          00020000:00020000:13.0:1381243911.225312:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -11
          00000100:00100000:13.0:1381243911.250579:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1:
          1448249214211639:172.26.17.3@o2ib:-1
          00020000:00020000:6.0:1381243911.250579:0:29944:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -5
          00000100:00100000:13.0:1381243911.250586:0:4596:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:e76ce10d-1d27-cdc4-1091-649094c70331:4596:1448249214211641:172.26.17.1@o2ib:400
          00020000:00020000:5.0:1381243911.250612:0:4755:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -5
          00000100:00100000:13.0:1381243911.250638:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:0:1448249214211644:172.26.17.3@o2ib:13
          00000100:00100000:5.0:1381243911.250741:0:4755:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_02:94b08aa2-54d7-b32e-019d-2561ed3286b5+6:58211:x1447047461344560:12345-172.26.4.3@o2ib:101 Request procesed in 49643595us (49643620us total) trans 0 rc 301/301
          
          kitwestneat Kit Westneat (Inactive) added a comment - The fix that worked on my test system didn't work on the customer system, so it might have been a fluke. I haven't seen that message in the new logs. Here is the latest server dk's: 00000100:00100000:14.0:1381243911.199943:0:4598:0:( import .c:725:ptlrpc_connect_import()) @@@ (re)connect request (timeout 5) req@ffff8808181fb400 x1448249214211642/t0( 0) o8->pfscdat2-OST0000-osc-MDT0000@172.26.17.3@o2ib:28/4 lens 368/512 e 0 to 0 dl 0 ref 1 fl New:N/0/ffffffff rc 0/-1 00000100:00100000:13.0:1381243911.212619:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1: 1448249214211637:172.26.17.3@o2ib:-1 00020000:00020000:13.0:1381243911.212624:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -11 00000100:00100000:13.0:1381243911.225308:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1: 1448249214211638:172.26.17.3@o2ib:-1 00020000:00020000:13.0:1381243911.225312:0:4596:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -11 00000100:00100000:13.0:1381243911.250579:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:-1: 1448249214211639:172.26.17.3@o2ib:-1 00020000:00020000:6.0:1381243911.250579:0:29944:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x10f5a sub-object on OST idx 0/1: rc = -5 00000100:00100000:13.0:1381243911.250586:0:4596:0:(client.c:1434:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:e76ce10d-1d27-cdc4-1091-649094c70331:4596:1448249214211641:172.26.17.1@o2ib:400 00020000:00020000:5.0:1381243911.250612:0:4755:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x18c7 sub-object on OST idx 0/1: rc = -5 00000100:00100000:13.0:1381243911.250638:0:4596:0:(client.c:1773:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:pfscdat2-MDT0000-mdtlov_UUID:0:1448249214211644:172.26.17.3@o2ib:13 00000100:00100000:5.0:1381243911.250741:0:4755:0:(service.c:1771:ptlrpc_server_handle_request()) Handled RPC pname:cluuid+ref:pid:xid:nid:opc mdt_02:94b08aa2-54d7-b32e-019d-2561ed3286b5+6:58211:x1447047461344560:12345-172.26.4.3@o2ib:101 Request procesed in 49643595us (49643620us total) trans 0 rc 301/301

          Hi Kit,

          Sorry for delayed response, we're in holiday in the last several days.

          if "OSCC_FLAG_DEGRADED" is set, then this OST won't be used preferentially when creating object. do you only test the failover on one OSS?

          BTW, is there something liken "Unknown rc XXX from async create: failing oscc" in the kernel logs?

          Thanks

          hongchao.zhang Hongchao Zhang added a comment - Hi Kit, Sorry for delayed response, we're in holiday in the last several days. if "OSCC_FLAG_DEGRADED" is set, then this OST won't be used preferentially when creating object. do you only test the failover on one OSS? BTW, is there something liken "Unknown rc XXX from async create: failing oscc" in the kernel logs? Thanks

          I tested adding the degraded flag to the oscc and it seemed to fix the problem on my test system:

                  if(rc != 0 && rc != -ENOTCONN && rc != -ENODEV)
                          GOTO(out_wake, rc);
                     
          +        if (rc == -ENOTCONN || rc ==  -ENODEV)
          +                oscc->oscc_flags |= OSCC_FLAG_DEGRADED;
          
                  /* Handle the critical type errors first.
          

          I ran into one issue where the OST ran out of inodes and it caused the lov_create threads on the MDT to deadlock. I'm not sure if that's a new behavior or in stock 2.1.6 as well. I'll investigate.

          kitwestneat Kit Westneat (Inactive) added a comment - I tested adding the degraded flag to the oscc and it seemed to fix the problem on my test system: if (rc != 0 && rc != -ENOTCONN && rc != -ENODEV) GOTO(out_wake, rc); + if (rc == -ENOTCONN || rc == -ENODEV) + oscc->oscc_flags |= OSCC_FLAG_DEGRADED; /* Handle the critical type errors first. I ran into one issue where the OST ran out of inodes and it caused the lov_create threads on the MDT to deadlock. I'm not sure if that's a new behavior or in stock 2.1.6 as well. I'll investigate.

          Hi Hongchao,

          Any updates?

          Thanks,
          Kit

          kitwestneat Kit Westneat (Inactive) added a comment - Hi Hongchao, Any updates? Thanks, Kit

          Ah ok, that makes sense. Should handle_async_create add the RECOVERING flag to the oscc? It seems like osc_interpret_create would add it, but is it possible that it could get to oscc_internal_create without that flag?

          kitwestneat Kit Westneat (Inactive) added a comment - Ah ok, that makes sense. Should handle_async_create add the RECOVERING flag to the oscc? It seems like osc_interpret_create would add it, but is it possible that it could get to oscc_internal_create without that flag?

          Hi
          the fake requests will be queued into osc_creater.oscc_wait_create_list in osc_create_async to wait the objects creation to complete, and it will be woken up
          in osc_interpret_create with returned value from OST. in this case, -ENOTCONN or -ENODEV is returned and failed these fake requests, which should continue
          waiting the recovery of OST to complete and the objects will be ready then.

          I'll create a new debug patch with the http://review.whamcloud.com/#/c/7559/

          Thanks!

          hongchao.zhang Hongchao Zhang added a comment - Hi the fake requests will be queued into osc_creater.oscc_wait_create_list in osc_create_async to wait the objects creation to complete, and it will be woken up in osc_interpret_create with returned value from OST. in this case, -ENOTCONN or -ENODEV is returned and failed these fake requests, which should continue waiting the recovery of OST to complete and the objects will be ready then. I'll create a new debug patch with the http://review.whamcloud.com/#/c/7559/ Thanks!

          Hi Hongchao,

          I was trying to follow the create logic some and it's confused me some. It seems like the fake requests are handled by oscc_internal_create, but then if the oscc is recovering (due to the EAGAIN) all it does is return 0. Am I reading that right?

          Thanks,
          Kit

          kitwestneat Kit Westneat (Inactive) added a comment - Hi Hongchao, I was trying to follow the create logic some and it's confused me some. It seems like the fake requests are handled by oscc_internal_create, but then if the oscc is recovering (due to the EAGAIN) all it does is return 0. Am I reading that right? Thanks, Kit

          Hi Hongchao,

          The patch fixed the data corruption, but we are still seeing the same errors on failover. Should we reproduce the issue with the original debugging patch + fix? Here are the kernel messages from the fix (patch 7559):

          Sep 25 15:16:49 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19
          Sep 25 15:16:49 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete
          Sep 25 15:17:39 pfscn2 kernel: : LustreError: 9001:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -11
          Sep 25 15:17:39 pfscn2 kernel: : LustreError: 7308:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -5
          Sep 25 15:17:40 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib)
          Sep 25 15:17:40 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans
          
          kitwestneat Kit Westneat (Inactive) added a comment - Hi Hongchao, The patch fixed the data corruption, but we are still seeing the same errors on failover. Should we reproduce the issue with the original debugging patch + fix? Here are the kernel messages from the fix (patch 7559): Sep 25 15:16:49 pfscn2 kernel: : LustreError: 11-0: an error occurred while communicating with 172.26.17.3@o2ib. The ost_setattr operation failed with -19 Sep 25 15:16:49 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection to pfscdat2-OST0000 (at 172.26.17.3@o2ib) was lost; in progress operations using this service will wait for recovery to complete Sep 25 15:17:39 pfscn2 kernel: : LustreError: 9001:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -11 Sep 25 15:17:39 pfscn2 kernel: : LustreError: 7308:0:(lov_request.c:579:lov_update_create_set()) error creating fid 0x3288 sub-object on OST idx 0/1: rc = -5 Sep 25 15:17:40 pfscn2 kernel: : Lustre: pfscdat2-OST0000-osc-MDT0000: Connection restored to pfscdat2-OST0000 (at 172.26.17.4@o2ib) Sep 25 15:17:40 pfscn2 kernel: : Lustre: MDS mdd_obd-pfscdat2-MDT0000: pfscdat2-OST0000_UUID now active, resetting orphans

          People

            hongchao.zhang Hongchao Zhang
            gshilamkar Girish Shilamkar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: