Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • Lustre 2.4.0
    • Lustre 2.4.0
    • 3
    • Orion
    • 3037

    Description

      Observed after rebooting the grove-mds2 on to our latest kernel and lustre-orion tag. The server was rebooted while under a fairly heavy test load. No detailed investigation has been done yet, the MDS was able to successfully restart after rebooting the node again and claims to have successfully completed recovery this time.

      — First boot —

      Mounting grove-mds2/mgs on /mnt/lustre/local/lstest-MGS0000
      Lustre: Lustre: Build Version: orion-2_2_49_57_3-49chaos-49chaos--PRISTINE-2.6.32-220.17.1.3chaos.ch5.x86_64
      Lustre: MGS: Mounted grove-mds2/mgs
      Mounting grove-mds2/mdt0 on /mnt/lustre/local/lstest-MDT0000
      LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 0@lo, operation llog_origin_handle_create failed with -2
      LustreError: 4503:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2
      Lustre: 4508:0:(fld_index.c:356:fld_index_init()) srv-lstest-MDT0000: File "fld" doesn't support range lookup, using stub. DNE and FIDs on OST will not work with this backend
      Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.48@o2ib500
      Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.2.199@o2ib500
      LNet: 26244:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 172.20.4.51@o2ib500
      LNet: 26244:0:(o2iblnd_cb.c:2340:kiblnd_passive_connect()) Conn race 172.20.4.44@o2ib500
      Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.53@o2ib500
      Lustre: Skipped 186 previous similar messages
      LustreError: 4567:0:(llog_cat.c:186:llog_cat_id2handle()) error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc -2
      LustreError: 4567:0:(llog_cat.c:503:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
      LustreError: 4567:0:(llog_cat.c:533:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
      LustreError: 4567:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST0281-osc-MDT0000: can't cancel record: -2
      LustreError: 4567:0:(llog_cat.c:186:llog_cat_id2handle()) error opening log id 0x5a5a5a5a5a5a5a5a:5a5a5a5a: rc -2
      LustreError: 4567:0:(llog_cat.c:503:llog_cat_cancel_records()) Cannot find log 0x5a5a5a5a5a5a5a5a
      LustreError: 4567:0:(llog_cat.c:533:llog_cat_cancel_records()) Cancel 0 of 1 llog-records failed: -2
      LustreError: 4567:0:(osp_sync.c:705:osp_sync_process_committed()) lstest-OST0281-osc-MDT0000: can't cancel record: -2
      general protection fault: 0000 1 SMP
      last sysfs file: /sys/module/sg/initstate
      CPU 7

      Pid: 4567, comm: osp-syn-641
      Tainted: P W ---------------- 2.6.32-220.17.1.3chaos.ch5.x86_64 #1 Supermicro X8DTH-i/6/iF/6F/X8DTH
      RIP: 0010:[<ffffffffa0694252>] [<ffffffffa0694252>] __llog_process_thread+0x2a2/0xc80 [obdclass]
      RSP: 0018:ffff882f66e9db60 EFLAGS: 00010206
      RAX: 5a5a5a5a5a5a5a5a RBX: 0000000000008701 RCX: 0000000000000000
      RDX: 000000000021e000 RSI: 0000000000000001 RDI: ffff882f63cfe000
      RBP: ffff882f66e9dc10 R08: ffff88179d8f1900 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: ffff88179caf4058
      R13: 000000000000fcff R14: ffff882f63cfc000 R15: ffff882f63cfe000
      FS: 00007ffff7fdc700(0000) GS:ffff881894820000(0000) knlGS:0000000000000000
      CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00007ffff7ff9000 CR3: 0000000001a85000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process osp-syn-641
      (pid: 4567, threadinfo ffff882f66e9c000, task ffff882ff9988ae0)
      Stack:
      0000000000002000 0000000000000050 0000000000000010 0000000000000000
      <0> ffff882f63cfc001 000086fe00000000 0000000000000000 ffff882f640124c0
      <0> ffff882f66e9de80 0000000000000000 000000000021e000 0000fd009aee4b80
      Call Trace:
      [<ffffffffa0dd4570>] ? osp_sync_process_queues+0x0/0xf60 [osp]
      [<ffffffffa0694d33>] __llog_process+0x103/0x4d0 [obdclass]
      [<ffffffffa06961bb>] llog_cat_process_cb+0x21b/0x290 [obdclass]
      [<ffffffffa06946fe>] __llog_process_thread+0x74e/0xc80 [obdclass]
      [<ffffffff810618d4>] ? enqueue_task_fair+0x64/0x100
      [<ffffffffa0695fa0>] ? llog_cat_process_cb+0x0/0x290 [obdclass]
      [<ffffffffa0694d33>] __llog_process+0x103/0x4d0 [obdclass]
      [<ffffffffa06953d8>] __llog_cat_process+0x98/0x260 [obdclass]
      [<ffffffffa0dd4570>] ? osp_sync_process_queues+0x0/0xf60 [osp]
      [<ffffffff81051ba3>] ? __wake_up+0x53/0x70
      [<ffffffffa0dd64f2>] osp_sync_thread+0x1c2/0x620 [osp]
      [<ffffffffa0dd6330>] ? osp_sync_thread+0x0/0x620 [osp]
      [<ffffffff8100c14a>] child_rip+0xa/0x20
      [<ffffffffa0dd6330>] ? osp_sync_thread+0x0/0x620 [osp]
      [<ffffffffa0dd6330>] ? osp_sync_thread+0x0/0x620 [osp]
      [<ffffffff8100c140>] ? child_rip+0x0/0x20

      — Second boot —

      Mounting grove-mds2/mgs on /mnt/lustre/local/lstest-MGS0000
      Lustre: Lustre: Build Version: orion-2_2_49_57_3-49chaos-49chaos--PRISTINE-2.6.32-220.17.1.3chaos.ch5.x86_64
      Lustre: MGS: Mounted grove-mds2/mgs
      Mounting grove-mds2/mdt0 on /mnt/lustre/local/lstest-MDT0000
      LustreError: 11-0: MGC172.20.5.2@o2ib500: Communicating with 0@lo, operation llog_origin_handle_create failed with -2
      LustreError: 4525:0:(mgc_request.c:250:do_config_log_add()) failed processing sptlrpc log: -2
      Lustre: 4530:0:(fld_index.c:356:fld_index_init()) srv-lstest-MDT0000: File "fld" doesn't support range lookup, using stub. DNE and FIDs on OST will not work with this backend
      Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.183@o2ib500
      Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.22@o2ib500
      Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.24@o2ib500
      Lustre: Skipped 3 previous similar messages
      Lustre: lstest-MDT0000: Temporarily refusing client connection from 172.20.3.36@o2ib500
      Lustre: Skipped 2 previous similar messages
      Lustre: lstest-MDT0000: Mounted grove-mds2/mdt0
      Lustre: lstest-MDT0000: Will be in recovery for at least 5:00, or until 255 clients reconnect.
      LustreError: 11-0: lstest-OST0282-osc-MDT0000: Communicating with 172.20.4.42@o2ib500, operation ost_connect failed with -16
      Lustre: lstest-MDT0000: Recovery over after 1:10, of 255 clients 255 recovered and 0 were evicted.
      LustreError: 11-0: lstest-OST0282-osc-MDT0000: Communicating with 172.20.4.42@o2ib500, operation ost_connect failed with -16

      Attachments

        Issue Links

          Activity

            [LU-2109] __llog_process_thread() GPF

            Mike, I can't recall hitting this since pulling in the patch. Thanks!

            prakash Prakash Surya (Inactive) added a comment - Mike, I can't recall hitting this since pulling in the patch. Thanks!

            closing for now, reopen if it will appear again

            tappro Mikhail Pershin added a comment - closing for now, reopen if it will appear again
            tappro Mikhail Pershin added a comment - - edited

            patch was landed. Prakash, are you OK to close this ticket?

            tappro Mikhail Pershin added a comment - - edited patch was landed. Prakash, are you OK to close this ticket?

            Mike, I pulled in that patch. I've also done some initial testing and it looks good so far. I've rebooted the MDS a few times while under light load, and haven't hit the issue yet (with the fix). Without the fix, I was able to hit it on the first try under the same load.

            prakash Prakash Surya (Inactive) added a comment - Mike, I pulled in that patch. I've also done some initial testing and it looks good so far. I've rebooted the MDS a few times while under light load, and haven't hit the issue yet (with the fix). Without the fix, I was able to hit it on the first try under the same load.

            http://review.whamcloud.com/#change,4745

            Attempt to introduce refcounter for llog handler. Please verify it is correct and works.

            tappro Mikhail Pershin added a comment - http://review.whamcloud.com/#change,4745 Attempt to introduce refcounter for llog handler. Please verify it is correct and works.

            Li Wei, this is exactly what OBD_FAIL_CHECK() is for - the ability to be able to trigger specific errors in the code at runtime instead of having to change it at compile time.

            adilger Andreas Dilger added a comment - Li Wei, this is exactly what OBD_FAIL_CHECK() is for - the ability to be able to trigger specific errors in the code at runtime instead of having to change it at compile time.

            http://review.whamcloud.com/4696

            A fix. For some reason, I wasn't able to get my new regression test working correctly today. Instead, I tested this patch with this hack:

            --- a/lustre/obdclass/llog.c
            +++ b/lustre/obdclass/llog.c
            @@ -137,7 +137,8 @@ int llog_cancel_rec(const struct lu_env *env, struct llog_ha
             
                     if ((llh->llh_flags & LLOG_F_ZAP_WHEN_EMPTY) &&
                         (llh->llh_count == 1) &&
            -            (loghandle->lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1)) {
            +            (loghandle->lgh_last_idx >= 1)) {
            +               loghandle->lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1;
                            cfs_spin_unlock(&loghandle->lgh_hdr_lock);
                            rc = llog_destroy(env, loghandle);
                            if (rc < 0) {
            
            liwei Li Wei (Inactive) added a comment - http://review.whamcloud.com/4696 A fix. For some reason, I wasn't able to get my new regression test working correctly today. Instead, I tested this patch with this hack: --- a/lustre/obdclass/llog.c +++ b/lustre/obdclass/llog.c @@ -137,7 +137,8 @@ int llog_cancel_rec( const struct lu_env *env, struct llog_ha if ((llh->llh_flags & LLOG_F_ZAP_WHEN_EMPTY) && (llh->llh_count == 1) && - (loghandle->lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1)) { + (loghandle->lgh_last_idx >= 1)) { + loghandle->lgh_last_idx == (LLOG_BITMAP_BYTES * 8) - 1; cfs_spin_unlock(&loghandle->lgh_hdr_lock); rc = llog_destroy(env, loghandle); if (rc < 0) {
            prakash Prakash Surya (Inactive) added a comment - - edited

            I opened LU-2394 for the cat_cancel_cb and a link to the patch to address the issue.

            prakash Prakash Surya (Inactive) added a comment - - edited I opened LU-2394 for the cat_cancel_cb and a link to the patch to address the issue.

            Prakash, that is different issue, could you add new ticket with data you have?

            tappro Mikhail Pershin added a comment - Prakash, that is different issue, could you add new ticket with data you have?

            as for cat_cancel_cb() I found that llog_cat_init_and_process() forks, so then it's running with main osp's llog_process() in parallel. I think llog_cat_init_and_process() should not fork.

            bzzz Alex Zhuravlev added a comment - as for cat_cancel_cb() I found that llog_cat_init_and_process() forks, so then it's running with main osp's llog_process() in parallel. I think llog_cat_init_and_process() should not fork.

            People

              tappro Mikhail Pershin
              behlendorf Brian Behlendorf
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: