Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.3
    • None
    • Fedora 19 x86_64 on Washington Pass nodes, 1GbE & FDR IB

    Description

      Lustre Version: 2.4.52
      Kernel: patchless_client
      Build: v2_4_52 0-gfdd4844-CHANGED-3.9.9-302.fc19.x86_64

      2nd Instance:
      -----------------
      From: Cledat, Romain E
      Sent: Monday, September 08, 2014 4:34 PM
      To: Bernel, BrianX D
      Subject: Error

      Message from syslogd@bar1 at Sep 8 16:08:51 ...
      kernel:[1235195.162972] LustreError: 85031:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4 r
      Message from syslogd@bar1 at Sep 8 16:08:51 ...
      kernel:[1235195.193211] LustreError: 85031:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG

      1st Instance:
      -----------------
      From: Nickerson, Brian R
      Sent: Thursday, August 14, 2014 3:44 PM
      To: Bernel, BrianX D; Cledat, Romain E
      Subject: Kernel crash details

      Message from syslogd@bar4 at Aug 14 15:34:57 ...
      kernel:[1216856.270451] LustreError: 42598:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4

      Message from syslogd@bar4 at Aug 14 15:34:57 ...
      kernel:[1216856.271008] LustreError: 42598:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG

      Message from syslogd@bar4 at Aug 14 15:34:57 ...
      kernel:[1216856.271830] Kernel panic - not syncing: LBUG

      Attachments

        Issue Links

          Activity

            [LU-5599] Lustre Error: Impossible state: 4
            pjones Peter Jones added a comment -

            Out of date ticket

            pjones Peter Jones added a comment - Out of date ticket

            Is there set of instructions somewhere on how to apply these patches?

            bdbernex Brian Bernel (Inactive) added a comment - Is there set of instructions somewhere on how to apply these patches?

            Peter,

            Will do.

            Jay

            jaylan Jay Lan (Inactive) added a comment - Peter, Will do. Jay
            pjones Peter Jones added a comment -

            Jay

            I think that it makes sense to open a separate ticket for the NASA issue because the environments are really quite different and the root cause could be different

            Peter

            pjones Peter Jones added a comment - Jay I think that it makes sense to open a separate ticket for the NASA issue because the environments are really quite different and the root cause could be different Peter
            bobijam Zhenyu Xu added a comment - Can you try apply these patches? http://review.whamcloud.com/#/c/8530/ http://review.whamcloud.com/#/c/9783/ http://review.whamcloud.com/#/c/9194/ http://review.whamcloud.com/#/c/9876/

            We had both of them in our 2.4.3-7nasC.

            jaylan Jay Lan (Inactive) added a comment - We had both of them in our 2.4.3-7nasC.
            bobijam Zhenyu Xu added a comment - - edited

            It looks like LU-3433, and patch for 2.4 is here http://review.whamcloud.com/#/c/9194/, we also better has LU-4558 patch land for 2.4 http://review.whamcloud.com/#/c/9876/

            2.5 already has them.

            bobijam Zhenyu Xu added a comment - - edited It looks like LU-3433 , and patch for 2.4 is here http://review.whamcloud.com/#/c/9194/ , we also better has LU-4558 patch land for 2.4 http://review.whamcloud.com/#/c/9876/ 2.5 already has them.

            NASA Ames hit this bug this morning on its 2.4.3 -7nasC client running sles11sp3 kernel.

            jaylan Jay Lan (Inactive) added a comment - NASA Ames hit this bug this morning on its 2.4.3 -7nasC client running sles11sp3 kernel.

            Related correspondence and screen errors:

            Yep, Vincent confirms that he was doing a checkout of the repo at the time… Definitely GIT + Lustre. And now they have a full kernel dump .

            Romain

            From: Romain Cledat <romain.e.cledat@intel.com>
            Date: Saturday, November 8, 2014 at 12:19 PM
            To: "Bernel, BrianX D" <brianx.d.bernel@intel.com>
            Subject: Your first kdump

            Hello,

            It seems bar1 rebooted by itself some 7h ago. After some investigation, I think it crashed and was rebooted due to kdump (yeah)

            Nov 8 04:51:12 bar1 systemd-logind[748]: New session 17884 of user vincentc.
            Nov 7 22:01:56 bar1 rsyslogd: [origin software="rsyslogd" swVersion="7.2.6" x-pid="737" x-info="http://www.rsyslog.com"] start

            I think the Nov 7 date is because the clock get reset to a bad value when rebooting. At the end of the reboot, you have:

            Nov 7 22:03:11 bar1 systemd[1]: Startup finished in 2.269s (kernel) + 3.678s (initrd) + 1min 19.515s (userspace) = 1min 25.463s.
            Nov 7 22:03:11 bar1 abrt-server[2784]: No actions are found for event 'notify'
            Nov 8 05:03:23 bar1 chronyd[717]: Selected source 149.20.68.17
            Nov 8 05:03:23 bar1 chronyd[717]: System clock wrong by 25198.383400 seconds, adjustment started
            Nov 8 05:03:23 bar1 chronyd[717]: System clock was stepped by 25198.383 seconds
            Nov 8 05:03:23 bar1 systemd[1]: Time has been changed

            So the machine seems to have been down between 4h51 and 5h03. And sure enough there is something in /var/crash. Relevant lines:
            [ 2.431054] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4
            [ 2.441581] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG
            [ 2.452016] Pid: 29800, comm: git
            [ 2.452017] \x0aCall Trace:

            What do you know, it happens to be Lustre again .

            Romain

            PS: I am asking Vincent to confirm it was him and it was git. 5 in the morning though…

            bdbernex Brian Bernel (Inactive) added a comment - Related correspondence and screen errors: Yep, Vincent confirms that he was doing a checkout of the repo at the time… Definitely GIT + Lustre. And now they have a full kernel dump . Romain From: Romain Cledat <romain.e.cledat@intel.com> Date: Saturday, November 8, 2014 at 12:19 PM To: "Bernel, BrianX D" <brianx.d.bernel@intel.com> Subject: Your first kdump Hello, It seems bar1 rebooted by itself some 7h ago. After some investigation, I think it crashed and was rebooted due to kdump (yeah) Nov 8 04:51:12 bar1 systemd-logind [748] : New session 17884 of user vincentc. Nov 7 22:01:56 bar1 rsyslogd: [origin software="rsyslogd" swVersion="7.2.6" x-pid="737" x-info="http://www.rsyslog.com"] start I think the Nov 7 date is because the clock get reset to a bad value when rebooting. At the end of the reboot, you have: Nov 7 22:03:11 bar1 systemd [1] : Startup finished in 2.269s (kernel) + 3.678s (initrd) + 1min 19.515s (userspace) = 1min 25.463s. Nov 7 22:03:11 bar1 abrt-server [2784] : No actions are found for event 'notify' Nov 8 05:03:23 bar1 chronyd [717] : Selected source 149.20.68.17 Nov 8 05:03:23 bar1 chronyd [717] : System clock wrong by 25198.383400 seconds, adjustment started Nov 8 05:03:23 bar1 chronyd [717] : System clock was stepped by 25198.383 seconds Nov 8 05:03:23 bar1 systemd [1] : Time has been changed So the machine seems to have been down between 4h51 and 5h03. And sure enough there is something in /var/crash. Relevant lines: [ 2.431054] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4 [ 2.441581] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG [ 2.452016] Pid: 29800, comm: git [ 2.452017] \x0aCall Trace: What do you know, it happens to be Lustre again . Romain PS: I am asking Vincent to confirm it was him and it was git. 5 in the morning though…

            Recent Kernel Dump information from bar1 on X-Stack:

            [ 3.911516] hpgmg-fv[96927]: segfault at d ip 00007fd22c315d3e sp 00007fd20796bd80 error 6 in libocr.so[7fd22c300000+2e000]
            [ 3.252141] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 3.828093] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 4.189668] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 4.197074] LustreError: Skipped 1 previous similar message
            [ 0.136898] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 3.395164] hpgmg-fv[5886]: segfault at 7f91ecb51d19 ip 00007f91ecb48d3e sp 00007f91450c0d10 error 7 in libocr.so[7f91ecb33000+2e000]
            [ 1.541619] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 1.549642] LustreError: Skipped 1 previous similar message
            [ 0.870999] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 1.648428] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 0.486506] hpgmg-fv[7547]: segfault at 7f5a27153d19 ip 00007f5a2714ad3e sp 00007f5978fb3d10 error 7 in libocr.so[7f5a27135000+2e000]
            [ 0.824354] hpgmg-fv[11206]: segfault at d ip 00007f4f43557d3e sp 00007f4e9bacfd10 error 6 in libocr.so[7f4f43542000+2e000]
            [ 3.258284] hpgmg-fv[11745]: segfault at d ip 00007fb77412ad3e sp 00007fb6cd6a4d10 error 6 in libocr.so[7fb774115000+2e000]
            [ 2.546239] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118057617399716 mdc close failed: rc = -13
            [ 2.559492] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144117957591658911 mdc close failed: rc = -13
            [ 1.467034] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052861374 mdc close failed: rc = -13
            [ 1.476398] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 40 previous similar messages
            [ 2.468531] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052858204 mdc close failed: rc = -13
            [ 2.478511] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 246 previous similar messages
            [ 0.176572] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052850460 mdc close failed: rc = -13
            [ 0.186522] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 473 previous similar messages
            [ 0.030234] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118024767594400 mdc close failed: rc = -13
            [ 0.039727] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 39 previous similar messages
            [ 2.904305] hpgmg-fv invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
            [ 2.904309] hpgmg-fv cpuset=/ mems_allowed=0-1
            [ 2.904311] Pid: 11573, comm: hpgmg-fv Tainted: GF O 3.9.9-302.fc19.x86_64 #1

            bdbernex Brian Bernel (Inactive) added a comment - Recent Kernel Dump information from bar1 on X-Stack: [ 3.911516] hpgmg-fv [96927] : segfault at d ip 00007fd22c315d3e sp 00007fd20796bd80 error 6 in libocr.so [7fd22c300000+2e000] [ 3.252141] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 3.828093] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 4.189668] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 4.197074] LustreError: Skipped 1 previous similar message [ 0.136898] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 3.395164] hpgmg-fv [5886] : segfault at 7f91ecb51d19 ip 00007f91ecb48d3e sp 00007f91450c0d10 error 7 in libocr.so [7f91ecb33000+2e000] [ 1.541619] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 1.549642] LustreError: Skipped 1 previous similar message [ 0.870999] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 1.648428] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 0.486506] hpgmg-fv [7547] : segfault at 7f5a27153d19 ip 00007f5a2714ad3e sp 00007f5978fb3d10 error 7 in libocr.so [7f5a27135000+2e000] [ 0.824354] hpgmg-fv [11206] : segfault at d ip 00007f4f43557d3e sp 00007f4e9bacfd10 error 6 in libocr.so [7f4f43542000+2e000] [ 3.258284] hpgmg-fv [11745] : segfault at d ip 00007fb77412ad3e sp 00007fb6cd6a4d10 error 6 in libocr.so [7fb774115000+2e000] [ 2.546239] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118057617399716 mdc close failed: rc = -13 [ 2.559492] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144117957591658911 mdc close failed: rc = -13 [ 1.467034] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052861374 mdc close failed: rc = -13 [ 1.476398] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 40 previous similar messages [ 2.468531] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052858204 mdc close failed: rc = -13 [ 2.478511] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 246 previous similar messages [ 0.176572] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052850460 mdc close failed: rc = -13 [ 0.186522] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 473 previous similar messages [ 0.030234] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118024767594400 mdc close failed: rc = -13 [ 0.039727] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 39 previous similar messages [ 2.904305] hpgmg-fv invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [ 2.904309] hpgmg-fv cpuset=/ mems_allowed=0-1 [ 2.904311] Pid: 11573, comm: hpgmg-fv Tainted: GF O 3.9.9-302.fc19.x86_64 #1

            People

              bogl Bob Glossman (Inactive)
              bdbernex Brian Bernel (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: