Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • None
    • Lustre 2.4.3
    • None
    • Fedora 19 x86_64 on Washington Pass nodes, 1GbE & FDR IB

    Description

      Lustre Version: 2.4.52
      Kernel: patchless_client
      Build: v2_4_52 0-gfdd4844-CHANGED-3.9.9-302.fc19.x86_64

      2nd Instance:
      -----------------
      From: Cledat, Romain E
      Sent: Monday, September 08, 2014 4:34 PM
      To: Bernel, BrianX D
      Subject: Error

      Message from syslogd@bar1 at Sep 8 16:08:51 ...
      kernel:[1235195.162972] LustreError: 85031:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4 r
      Message from syslogd@bar1 at Sep 8 16:08:51 ...
      kernel:[1235195.193211] LustreError: 85031:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG

      1st Instance:
      -----------------
      From: Nickerson, Brian R
      Sent: Thursday, August 14, 2014 3:44 PM
      To: Bernel, BrianX D; Cledat, Romain E
      Subject: Kernel crash details

      Message from syslogd@bar4 at Aug 14 15:34:57 ...
      kernel:[1216856.270451] LustreError: 42598:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4

      Message from syslogd@bar4 at Aug 14 15:34:57 ...
      kernel:[1216856.271008] LustreError: 42598:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG

      Message from syslogd@bar4 at Aug 14 15:34:57 ...
      kernel:[1216856.271830] Kernel panic - not syncing: LBUG

      Attachments

        Issue Links

          Activity

            [LU-5599] Lustre Error: Impossible state: 4
            pjones Peter Jones added a comment -

            Jay

            I think that it makes sense to open a separate ticket for the NASA issue because the environments are really quite different and the root cause could be different

            Peter

            pjones Peter Jones added a comment - Jay I think that it makes sense to open a separate ticket for the NASA issue because the environments are really quite different and the root cause could be different Peter
            bobijam Zhenyu Xu added a comment - Can you try apply these patches? http://review.whamcloud.com/#/c/8530/ http://review.whamcloud.com/#/c/9783/ http://review.whamcloud.com/#/c/9194/ http://review.whamcloud.com/#/c/9876/

            We had both of them in our 2.4.3-7nasC.

            jaylan Jay Lan (Inactive) added a comment - We had both of them in our 2.4.3-7nasC.
            bobijam Zhenyu Xu added a comment - - edited

            It looks like LU-3433, and patch for 2.4 is here http://review.whamcloud.com/#/c/9194/, we also better has LU-4558 patch land for 2.4 http://review.whamcloud.com/#/c/9876/

            2.5 already has them.

            bobijam Zhenyu Xu added a comment - - edited It looks like LU-3433 , and patch for 2.4 is here http://review.whamcloud.com/#/c/9194/ , we also better has LU-4558 patch land for 2.4 http://review.whamcloud.com/#/c/9876/ 2.5 already has them.

            NASA Ames hit this bug this morning on its 2.4.3 -7nasC client running sles11sp3 kernel.

            jaylan Jay Lan (Inactive) added a comment - NASA Ames hit this bug this morning on its 2.4.3 -7nasC client running sles11sp3 kernel.

            Related correspondence and screen errors:

            Yep, Vincent confirms that he was doing a checkout of the repo at the time… Definitely GIT + Lustre. And now they have a full kernel dump .

            Romain

            From: Romain Cledat <romain.e.cledat@intel.com>
            Date: Saturday, November 8, 2014 at 12:19 PM
            To: "Bernel, BrianX D" <brianx.d.bernel@intel.com>
            Subject: Your first kdump

            Hello,

            It seems bar1 rebooted by itself some 7h ago. After some investigation, I think it crashed and was rebooted due to kdump (yeah)

            Nov 8 04:51:12 bar1 systemd-logind[748]: New session 17884 of user vincentc.
            Nov 7 22:01:56 bar1 rsyslogd: [origin software="rsyslogd" swVersion="7.2.6" x-pid="737" x-info="http://www.rsyslog.com"] start

            I think the Nov 7 date is because the clock get reset to a bad value when rebooting. At the end of the reboot, you have:

            Nov 7 22:03:11 bar1 systemd[1]: Startup finished in 2.269s (kernel) + 3.678s (initrd) + 1min 19.515s (userspace) = 1min 25.463s.
            Nov 7 22:03:11 bar1 abrt-server[2784]: No actions are found for event 'notify'
            Nov 8 05:03:23 bar1 chronyd[717]: Selected source 149.20.68.17
            Nov 8 05:03:23 bar1 chronyd[717]: System clock wrong by 25198.383400 seconds, adjustment started
            Nov 8 05:03:23 bar1 chronyd[717]: System clock was stepped by 25198.383 seconds
            Nov 8 05:03:23 bar1 systemd[1]: Time has been changed

            So the machine seems to have been down between 4h51 and 5h03. And sure enough there is something in /var/crash. Relevant lines:
            [ 2.431054] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4
            [ 2.441581] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG
            [ 2.452016] Pid: 29800, comm: git
            [ 2.452017] \x0aCall Trace:

            What do you know, it happens to be Lustre again .

            Romain

            PS: I am asking Vincent to confirm it was him and it was git. 5 in the morning though…

            bdbernex Brian Bernel (Inactive) added a comment - Related correspondence and screen errors: Yep, Vincent confirms that he was doing a checkout of the repo at the time… Definitely GIT + Lustre. And now they have a full kernel dump . Romain From: Romain Cledat <romain.e.cledat@intel.com> Date: Saturday, November 8, 2014 at 12:19 PM To: "Bernel, BrianX D" <brianx.d.bernel@intel.com> Subject: Your first kdump Hello, It seems bar1 rebooted by itself some 7h ago. After some investigation, I think it crashed and was rebooted due to kdump (yeah) Nov 8 04:51:12 bar1 systemd-logind [748] : New session 17884 of user vincentc. Nov 7 22:01:56 bar1 rsyslogd: [origin software="rsyslogd" swVersion="7.2.6" x-pid="737" x-info="http://www.rsyslog.com"] start I think the Nov 7 date is because the clock get reset to a bad value when rebooting. At the end of the reboot, you have: Nov 7 22:03:11 bar1 systemd [1] : Startup finished in 2.269s (kernel) + 3.678s (initrd) + 1min 19.515s (userspace) = 1min 25.463s. Nov 7 22:03:11 bar1 abrt-server [2784] : No actions are found for event 'notify' Nov 8 05:03:23 bar1 chronyd [717] : Selected source 149.20.68.17 Nov 8 05:03:23 bar1 chronyd [717] : System clock wrong by 25198.383400 seconds, adjustment started Nov 8 05:03:23 bar1 chronyd [717] : System clock was stepped by 25198.383 seconds Nov 8 05:03:23 bar1 systemd [1] : Time has been changed So the machine seems to have been down between 4h51 and 5h03. And sure enough there is something in /var/crash. Relevant lines: [ 2.431054] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) ASSERTION( ols->ols_state == OLS_NEW ) failed: Impossible state: 4 [ 2.441581] LustreError: 29800:0:(osc_lock.c:1129:osc_lock_enqueue()) LBUG [ 2.452016] Pid: 29800, comm: git [ 2.452017] \x0aCall Trace: What do you know, it happens to be Lustre again . Romain PS: I am asking Vincent to confirm it was him and it was git. 5 in the morning though…

            Recent Kernel Dump information from bar1 on X-Stack:

            [ 3.911516] hpgmg-fv[96927]: segfault at d ip 00007fd22c315d3e sp 00007fd20796bd80 error 6 in libocr.so[7fd22c300000+2e000]
            [ 3.252141] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 3.828093] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 4.189668] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 4.197074] LustreError: Skipped 1 previous similar message
            [ 0.136898] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 3.395164] hpgmg-fv[5886]: segfault at 7f91ecb51d19 ip 00007f91ecb48d3e sp 00007f91450c0d10 error 7 in libocr.so[7f91ecb33000+2e000]
            [ 1.541619] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 1.549642] LustreError: Skipped 1 previous similar message
            [ 0.870999] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 1.648428] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11.
            [ 0.486506] hpgmg-fv[7547]: segfault at 7f5a27153d19 ip 00007f5a2714ad3e sp 00007f5978fb3d10 error 7 in libocr.so[7f5a27135000+2e000]
            [ 0.824354] hpgmg-fv[11206]: segfault at d ip 00007f4f43557d3e sp 00007f4e9bacfd10 error 6 in libocr.so[7f4f43542000+2e000]
            [ 3.258284] hpgmg-fv[11745]: segfault at d ip 00007fb77412ad3e sp 00007fb6cd6a4d10 error 6 in libocr.so[7fb774115000+2e000]
            [ 2.546239] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118057617399716 mdc close failed: rc = -13
            [ 2.559492] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144117957591658911 mdc close failed: rc = -13
            [ 1.467034] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052861374 mdc close failed: rc = -13
            [ 1.476398] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 40 previous similar messages
            [ 2.468531] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052858204 mdc close failed: rc = -13
            [ 2.478511] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 246 previous similar messages
            [ 0.176572] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052850460 mdc close failed: rc = -13
            [ 0.186522] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 473 previous similar messages
            [ 0.030234] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118024767594400 mdc close failed: rc = -13
            [ 0.039727] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 39 previous similar messages
            [ 2.904305] hpgmg-fv invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
            [ 2.904309] hpgmg-fv cpuset=/ mems_allowed=0-1
            [ 2.904311] Pid: 11573, comm: hpgmg-fv Tainted: GF O 3.9.9-302.fc19.x86_64 #1

            bdbernex Brian Bernel (Inactive) added a comment - Recent Kernel Dump information from bar1 on X-Stack: [ 3.911516] hpgmg-fv [96927] : segfault at d ip 00007fd22c315d3e sp 00007fd20796bd80 error 6 in libocr.so [7fd22c300000+2e000] [ 3.252141] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 3.828093] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 4.189668] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 4.197074] LustreError: Skipped 1 previous similar message [ 0.136898] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 3.395164] hpgmg-fv [5886] : segfault at 7f91ecb51d19 ip 00007f91ecb48d3e sp 00007f91450c0d10 error 7 in libocr.so [7f91ecb33000+2e000] [ 1.541619] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 1.549642] LustreError: Skipped 1 previous similar message [ 0.870999] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 1.648428] LustreError: 11-0: lustrefs-MDT0000-mdc-ffff881fdadfe000: Communicating with 192.168.2.201@o2ib, operation ldlm_enqueue failed with -11. [ 0.486506] hpgmg-fv [7547] : segfault at 7f5a27153d19 ip 00007f5a2714ad3e sp 00007f5978fb3d10 error 7 in libocr.so [7f5a27135000+2e000] [ 0.824354] hpgmg-fv [11206] : segfault at d ip 00007f4f43557d3e sp 00007f4e9bacfd10 error 6 in libocr.so [7f4f43542000+2e000] [ 3.258284] hpgmg-fv [11745] : segfault at d ip 00007fb77412ad3e sp 00007fb6cd6a4d10 error 6 in libocr.so [7fb774115000+2e000] [ 2.546239] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118057617399716 mdc close failed: rc = -13 [ 2.559492] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144117957591658911 mdc close failed: rc = -13 [ 1.467034] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052861374 mdc close failed: rc = -13 [ 1.476398] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 40 previous similar messages [ 2.468531] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052858204 mdc close failed: rc = -13 [ 2.478511] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 246 previous similar messages [ 0.176572] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118025052850460 mdc close failed: rc = -13 [ 0.186522] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 473 previous similar messages [ 0.030234] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) inode 144118024767594400 mdc close failed: rc = -13 [ 0.039727] LustreError: 1599:0:(file.c:163:ll_close_inode_openhandle()) Skipped 39 previous similar messages [ 2.904305] hpgmg-fv invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [ 2.904309] hpgmg-fv cpuset=/ mems_allowed=0-1 [ 2.904311] Pid: 11573, comm: hpgmg-fv Tainted: GF O 3.9.9-302.fc19.x86_64 #1

            We do not need a kernel dump to get this information.

            Can you upload /var/log/messages or "dmesg" right after this issue is hit?

            keith Keith Mannthey (Inactive) added a comment - We do not need a kernel dump to get this information. Can you upload /var/log/messages or "dmesg" right after this issue is hit?

            Brian,
            The stack dumps Cliff asked about a few comments ago is still needed. Any chance of getting those?

            bogl Bob Glossman (Inactive) added a comment - Brian, The stack dumps Cliff asked about a few comments ago is still needed. Any chance of getting those?

            Hi Bob,

            Thanks so much for your help on this. The Lustre kernel for X-Stack was put together by Gabrielle Paciucci. …As for what might have been going on at the time of the problem, the only thing I have to go on at this point is that git was involved. No kdump available, I’m afraid. A cursory look at the logs doesn’t show anything glaringly amiss, but it does corroborate that garret/git was in play in the minute leading up to the Lustre error.

            (Thanks Josh, for directing me to reply by comment vs email)

            Regards,

            Brian

            bdbernex Brian Bernel (Inactive) added a comment - Hi Bob, Thanks so much for your help on this. The Lustre kernel for X-Stack was put together by Gabrielle Paciucci. …As for what might have been going on at the time of the problem, the only thing I have to go on at this point is that git was involved. No kdump available, I’m afraid. A cursory look at the logs doesn’t show anything glaringly amiss, but it does corroborate that garret/git was in play in the minute leading up to the Lustre error. (Thanks Josh, for directing me to reply by comment vs email) Regards, Brian

            this may not be known but I wonder if there was any particular applications or load that was running near the time of the 2 reported panic instances.

            bogl Bob Glossman (Inactive) added a comment - this may not be known but I wonder if there was any particular applications or load that was running near the time of the 2 reported panic instances.

            People

              bogl Bob Glossman (Inactive)
              bdbernex Brian Bernel (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: