Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3274

osc_cache.c:1774:osc_dec_unstable_pages()) ASSERTION( atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0 ) failed

Details

    • Bug
    • Resolution: Fixed
    • Major
    • Lustre 2.6.0
    • Lustre 2.4.0
    • None
    • 3
    • 8111

    Description

      Hit this running replay-single in a loop

      [123247.989106] Lustre: DEBUG MARKER: == replay-single test 87b: write replay with changed data (checksum resend) == 00:41:34 (1367728894)
      [123248.537768] Turning device loop1 (0x700001) read-only
      [123248.562834] Lustre: DEBUG MARKER: ost1 REPLAY BARRIER on lustre-OST0000
      [123248.580423] Lustre: DEBUG MARKER: local REPLAY BARRIER on lustre-OST0000
      [123250.719913] Lustre: DEBUG MARKER: cancel_lru_locks osc start
      [123250.880677] Lustre: DEBUG MARKER: cancel_lru_locks osc stop
      [123251.389028] Removing read-only on unknown block (0x700001)
      [123263.314546] LDISKFS-fs (loop1): recovery complete
      [123263.357806] LDISKFS-fs (loop1): mounted filesystem with ordered data mode. quota=on. Opts: 
      [123268.399242] LustreError: 168-f: BAD WRITE CHECKSUM: lustre-OST0000 from 12345-0@lo inode [0x2000061c0:0x5:0x0] object 0x0:7458 extent [0-1048575]: client csum 7945acf6, server csum b9e6e441
      [123268.477228] LustreError: 17404:0:(osc_cache.c:1774:osc_dec_unstable_pages()) ASSERTION( atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0 ) failed: 
      [123268.477757] LustreError: 17404:0:(osc_cache.c:1774:osc_dec_unstable_pages()) LBUG
      [123268.478168] Pid: 17404, comm: ptlrpcd_rcv
      [123268.478381] 
      [123268.478381] Call Trace:
      [123268.478752]  [<ffffffffa0e018a5>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
      [123268.479019]  [<ffffffffa0e01ea7>] lbug_with_loc+0x47/0xb0 [libcfs]
      [123268.479981]  [<ffffffffa0458bcc>] osc_dec_unstable_pages+0x12c/0x190 [osc]
      [123268.480270]  [<ffffffffa114d76b>] ptlrpc_free_committed+0x14b/0x620 [ptlrpc]
      [123268.480593]  [<ffffffffa114f4e3>] after_reply+0x7a3/0xd90 [ptlrpc]
      [123268.480871]  [<ffffffffa1154493>] ptlrpc_check_set+0x1093/0x1da0 [ptlrpc]
      [123268.481150]  [<ffffffffa1180e2b>] ptlrpcd_check+0x55b/0x590 [ptlrpc]
      [123268.481420]  [<ffffffffa1181373>] ptlrpcd+0x233/0x390 [ptlrpc]
      [123268.481665]  [<ffffffff8105ad10>] ? default_wake_function+0x0/0x20
      [123268.481963]  [<ffffffffa1181140>] ? ptlrpcd+0x0/0x390 [ptlrpc]
      [123268.482220]  [<ffffffff8100c10a>] child_rip+0xa/0x20
      [123268.482465]  [<ffffffffa1181140>] ? ptlrpcd+0x0/0x390 [ptlrpc]
      [123268.482733]  [<ffffffffa1181140>] ? ptlrpcd+0x0/0x390 [ptlrpc]
      [123268.482978]  [<ffffffff8100c100>] ? child_rip+0x0/0x20
      [123268.483209] 
      [123268.543751] Kernel panic - not syncing: LBUG
      

      This is somewhat current master plust 3 more lu-2139 patches applied on top.
      I have a crashdump in /exports/crashdumps/192.168.10.221-2013-05-05-00\:41\:57/
      code tag: master-20130505

      Attachments

        Issue Links

          Activity

            [LU-3274] osc_cache.c:1774:osc_dec_unstable_pages()) ASSERTION( atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0 ) failed
            yujian Jian Yu added a comment -

            Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12613
            The patch depends on the patches for LU-2139 on Lustre b2_5 branch.

            yujian Jian Yu added a comment - Here is the back-ported patch for Lustre b2_5 branch: http://review.whamcloud.com/12613 The patch depends on the patches for LU-2139 on Lustre b2_5 branch.

            Patch landed to Master. Please reopen ticket if more work is needed.

            jlevi Jodi Levi (Inactive) added a comment - Patch landed to Master. Please reopen ticket if more work is needed.

            Jinshan, please see http://review.whamcloud.com/8219 . I tried to make a more understandable version of your fix from 8215. What do you think?

            prakash Prakash Surya (Inactive) added a comment - Jinshan, please see http://review.whamcloud.com/8219 . I tried to make a more understandable version of your fix from 8215. What do you think?
            jay Jinshan Xiong (Inactive) added a comment - patch is at: http://review.whamcloud.com/8215
            jay Jinshan Xiong (Inactive) added a comment - - edited

            it occurred in current master at: https://maloo.whamcloud.com/sub_tests/6f5c1b80-47cb-11e3-a71f-52540035b04c in the console message of client 2:

            23:26:13:LustreError: 25093:0:(osc_cache.c:1807:osc_dec_unstable_pages()) ASSERTION( atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0 ) failed: 
            23:26:13:LustreError: 25093:0:(osc_cache.c:1807:osc_dec_unstable_pages()) LBUG
            23:26:13:Pid: 25093, comm: ptlrpcd_rcv
            23:26:13:
            23:26:13:Call Trace:
            23:26:13: [<ffffffffa0b58895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs]
            23:26:14: [<ffffffffa0b58e97>] lbug_with_loc+0x47/0xb0 [libcfs]
            23:26:14: [<ffffffffa0f64f33>] osc_dec_unstable_pages+0x133/0x1a0 [osc]
            23:26:14: [<ffffffffa0d601c9>] ptlrpc_free_committed+0x149/0x620 [ptlrpc]
            23:26:14: [<ffffffffa0d61a74>] after_reply+0x7a4/0xd90 [ptlrpc]
            23:26:14: [<ffffffffa0d66ab1>] ptlrpc_check_set+0xf71/0x1b40 [ptlrpc]
            23:26:15: [<ffffffffa0d9120b>] ptlrpcd_check+0x53b/0x560 [ptlrpc]
            23:26:15: [<ffffffffa0d9185b>] ptlrpcd+0x33b/0x3f0 [ptlrpc]
            23:26:15: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
            23:26:15: [<ffffffffa0d91520>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
            23:26:15: [<ffffffff81096a36>] kthread+0x96/0xa0
            23:26:15: [<ffffffff8100c0ca>] child_rip+0xa/0x20
            23:26:15: [<ffffffff810969a0>] ? kthread+0x0/0xa0
            23:26:16: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            23:26:16:
            23:26:16:Kernel panic - not syncing: LBUG
            23:26:16:Pid: 25093, comm: ptlrpcd_rcv Not tainted 2.6.32-358.23.2.el6.x86_64 #1
            23:26:16:Call Trace:
            23:26:16: [<ffffffff8150daac>] ? panic+0xa7/0x16f
            23:26:17: [<ffffffffa0b58eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs]
            23:26:17: [<ffffffffa0f64f33>] ? osc_dec_unstable_pages+0x133/0x1a0 [osc]
            23:26:17: [<ffffffffa0d601c9>] ? ptlrpc_free_committed+0x149/0x620 [ptlrpc]
            23:26:17: [<ffffffffa0d61a74>] ? after_reply+0x7a4/0xd90 [ptlrpc]
            23:26:17: [<ffffffffa0d66ab1>] ? ptlrpc_check_set+0xf71/0x1b40 [ptlrpc]
            23:26:17: [<ffffffffa0d9120b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc]
            23:26:17: [<ffffffffa0d9185b>] ? ptlrpcd+0x33b/0x3f0 [ptlrpc]
            23:26:18: [<ffffffff81063990>] ? default_wake_function+0x0/0x20
            23:26:18: [<ffffffffa0d91520>] ? ptlrpcd+0x0/0x3f0 [ptlrpc]
            23:26:18: [<ffffffff81096a36>] ? kthread+0x96/0xa0
            23:26:18: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20
            23:26:19: [<ffffffff810969a0>] ? kthread+0x0/0xa0
            23:26:19: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
            
            jay Jinshan Xiong (Inactive) added a comment - - edited it occurred in current master at: https://maloo.whamcloud.com/sub_tests/6f5c1b80-47cb-11e3-a71f-52540035b04c in the console message of client 2: 23:26:13:LustreError: 25093:0:(osc_cache.c:1807:osc_dec_unstable_pages()) ASSERTION( atomic_read(&cli->cl_cache->ccc_unstable_nr) >= 0 ) failed: 23:26:13:LustreError: 25093:0:(osc_cache.c:1807:osc_dec_unstable_pages()) LBUG 23:26:13:Pid: 25093, comm: ptlrpcd_rcv 23:26:13: 23:26:13:Call Trace: 23:26:13: [<ffffffffa0b58895>] libcfs_debug_dumpstack+0x55/0x80 [libcfs] 23:26:14: [<ffffffffa0b58e97>] lbug_with_loc+0x47/0xb0 [libcfs] 23:26:14: [<ffffffffa0f64f33>] osc_dec_unstable_pages+0x133/0x1a0 [osc] 23:26:14: [<ffffffffa0d601c9>] ptlrpc_free_committed+0x149/0x620 [ptlrpc] 23:26:14: [<ffffffffa0d61a74>] after_reply+0x7a4/0xd90 [ptlrpc] 23:26:14: [<ffffffffa0d66ab1>] ptlrpc_check_set+0xf71/0x1b40 [ptlrpc] 23:26:15: [<ffffffffa0d9120b>] ptlrpcd_check+0x53b/0x560 [ptlrpc] 23:26:15: [<ffffffffa0d9185b>] ptlrpcd+0x33b/0x3f0 [ptlrpc] 23:26:15: [<ffffffff81063990>] ? default_wake_function+0x0/0x20 23:26:15: [<ffffffffa0d91520>] ? ptlrpcd+0x0/0x3f0 [ptlrpc] 23:26:15: [<ffffffff81096a36>] kthread+0x96/0xa0 23:26:15: [<ffffffff8100c0ca>] child_rip+0xa/0x20 23:26:15: [<ffffffff810969a0>] ? kthread+0x0/0xa0 23:26:16: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20 23:26:16: 23:26:16:Kernel panic - not syncing: LBUG 23:26:16:Pid: 25093, comm: ptlrpcd_rcv Not tainted 2.6.32-358.23.2.el6.x86_64 #1 23:26:16:Call Trace: 23:26:16: [<ffffffff8150daac>] ? panic+0xa7/0x16f 23:26:17: [<ffffffffa0b58eeb>] ? lbug_with_loc+0x9b/0xb0 [libcfs] 23:26:17: [<ffffffffa0f64f33>] ? osc_dec_unstable_pages+0x133/0x1a0 [osc] 23:26:17: [<ffffffffa0d601c9>] ? ptlrpc_free_committed+0x149/0x620 [ptlrpc] 23:26:17: [<ffffffffa0d61a74>] ? after_reply+0x7a4/0xd90 [ptlrpc] 23:26:17: [<ffffffffa0d66ab1>] ? ptlrpc_check_set+0xf71/0x1b40 [ptlrpc] 23:26:17: [<ffffffffa0d9120b>] ? ptlrpcd_check+0x53b/0x560 [ptlrpc] 23:26:17: [<ffffffffa0d9185b>] ? ptlrpcd+0x33b/0x3f0 [ptlrpc] 23:26:18: [<ffffffff81063990>] ? default_wake_function+0x0/0x20 23:26:18: [<ffffffffa0d91520>] ? ptlrpcd+0x0/0x3f0 [ptlrpc] 23:26:18: [<ffffffff81096a36>] ? kthread+0x96/0xa0 23:26:18: [<ffffffff8100c0ca>] ? child_rip+0xa/0x20 23:26:19: [<ffffffff810969a0>] ? kthread+0x0/0xa0 23:26:19: [<ffffffff8100c0c0>] ? child_rip+0x0/0x20

            I looked over the code some more and I'm still unsure how this would occur. I don't think it is a result of rq_commit_cb getting called twice for the same request (via after_reply and ptlrpc_free_committed). AFAICT it will get run directly in after_reply, or get added to the imp_replay_list and run later via ptlrpc_free_committed, but not both. I'm running replay-single in a loop in a single node VM setup, but haven't hit the bug yet. Oleg, if you have some spare cycles it might be interesting to add this patch:

            diff --git i/lustre/osc/osc_request.c w/lustre/osc/osc_request.c
            index a1c740c..ac2eaf5 100644
            --- i/lustre/osc/osc_request.c
            +++ w/lustre/osc/osc_request.c
            @@ -2165,6 +2165,8 @@ int osc_build_rpc(const struct lu_env *env, struct client_obd *cli,
                            GOTO(out, rc);
                    }
             
            +       LASSERT(req->rq_committed == 0);
            +       LASSERT(req->rq_unstable == 0);
                    req->rq_commit_cb = brw_commit;
                    req->rq_interpret_reply = brw_interpret;
             
            

            I assumed these values would be already initialized to 0, but perhaps I'm wrong. If we allocate a request and it happens to have rq_unstable set to 1, and then brw_commit is called via rq_commit_cb without osc_inc_unstable_pages first being called, I could see this failure happening. But I have no evidence of this yet.

            prakash Prakash Surya (Inactive) added a comment - I looked over the code some more and I'm still unsure how this would occur. I don't think it is a result of rq_commit_cb getting called twice for the same request (via after_reply and ptlrpc_free_committed ). AFAICT it will get run directly in after_reply , or get added to the imp_replay_list and run later via ptlrpc_free_committed , but not both. I'm running replay-single in a loop in a single node VM setup, but haven't hit the bug yet. Oleg, if you have some spare cycles it might be interesting to add this patch: diff --git i/lustre/osc/osc_request.c w/lustre/osc/osc_request.c index a1c740c..ac2eaf5 100644 --- i/lustre/osc/osc_request.c +++ w/lustre/osc/osc_request.c @@ -2165,6 +2165,8 @@ int osc_build_rpc(const struct lu_env *env, struct client_obd *cli, GOTO(out, rc); } + LASSERT(req->rq_committed == 0); + LASSERT(req->rq_unstable == 0); req->rq_commit_cb = brw_commit; req->rq_interpret_reply = brw_interpret; I assumed these values would be already initialized to 0, but perhaps I'm wrong. If we allocate a request and it happens to have rq_unstable set to 1, and then brw_commit is called via rq_commit_cb without osc_inc_unstable_pages first being called, I could see this failure happening. But I have no evidence of this yet.

            OK. Let me see if I can get a possible explanation out of reading the code first, and I might want to look at those later to back up my theory.

            prakash Prakash Surya (Inactive) added a comment - OK. Let me see if I can get a possible explanation out of reading the code first, and I might want to look at those later to back up my theory.
            green Oleg Drokin added a comment -

            no, it was not triggered on maloo, it's on my private burn-in cluster that promotes race conditions. But if you are interested, i can make the vmcore, modules and the code snapshot available somehow.

            green Oleg Drokin added a comment - no, it was not triggered on maloo, it's on my private burn-in cluster that promotes race conditions. But if you are interested, i can make the vmcore, modules and the code snapshot available somehow.

            Hmm, I never saw this failure during my testing. The asssertion was added because the unstable count for a given FS should never become negative, which looks to be happening here. I'm curious if the commit callback is getting called twice for certain requests, once directly after_reply, and then again in ptlrpc_free_committed (via after_reply). I'll read through the patch again and see if I can piece together what might be happening here. Was this failure triggered through Maloo?

            prakash Prakash Surya (Inactive) added a comment - Hmm, I never saw this failure during my testing. The asssertion was added because the unstable count for a given FS should never become negative, which looks to be happening here. I'm curious if the commit callback is getting called twice for certain requests, once directly after_reply , and then again in ptlrpc_free_committed (via after_reply ). I'll read through the patch again and see if I can piece together what might be happening here. Was this failure triggered through Maloo?
            green Oleg Drokin added a comment -

            Got another hit, replay-single again /exports/crashdumps/192.168.10.216-2013-05-06-02\:27\:44/vmcore
            This time in replay-single test 87, apparently there's some mismatch on replay? Double accounting by any chance?

            green Oleg Drokin added a comment - Got another hit, replay-single again /exports/crashdumps/192.168.10.216-2013-05-06-02\:27\:44/vmcore This time in replay-single test 87, apparently there's some mismatch on replay? Double accounting by any chance?

            People

              jay Jinshan Xiong (Inactive)
              green Oleg Drokin
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: