Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-3067

ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE))

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • None
    • Lustre 1.8.9
    • 3
    • 7466

    Description

      One of our OSSs had problems writing to disk (due to a raid card problem).

      Several clients have an LBUG and haven't recovered after OSS reboot.
      The error is:

      Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed
      Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) LBUG

      I attach the associated log file, and reproduce some lines of context in /var/log/messages

      Mar 29 05:57:03 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection restored to service lustre_0-OST0027 using nid 10.1.4.12
      0@tcp.
      Mar 29 05:57:03 cn492 kernel: Lustre: Skipped 1 previous similar message
      Mar 29 06:09:39 cn492 kernel: Lustre: 3004:0:(client.c:1529:ptlrpc_expire_one_request()) @@@ Request x1430341259304767 sent from lustre_0-OST002
      7-osc-ffff81021c041800 to NID 10.1.4.120@tcp 756s ago has timed out (756s prior to deadline).
      Mar 29 06:09:39 cn492 kernel: req@ffff8101145e6800 x1430341259304767/t0 o3->lustre_0-OST0027_UUID@10.1.4.120@tcp:6/4 lens 448/592 e 1 to 1 dl
      1364537379 ref 2 fl Rpc:/2/0 rc 0/0
      Mar 29 06:09:39 cn492 kernel: Lustre: 3004:0:(client.c:1529:ptlrpc_expire_one_request()) Skipped 1 previous similar message
      Mar 29 06:09:39 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection to service lustre_0-OST0027 via nid 10.1.4.120@tcp was l
      ost; in progress operations using this service will wait for recovery to complete.
      Mar 29 06:09:39 cn492 kernel: Lustre: Skipped 1 previous similar message
      Mar 29 06:09:39 cn492 kernel: Lustre: lustre_0-OST0027-osc-ffff81021c041800: Connection restored to service lustre_0-OST0027 using nid 10.1.4.12
      0@tcp.
      Mar 29 06:09:39 cn492 kernel: Lustre: Skipped 1 previous similar message
      Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE)) failed
      Mar 29 06:20:10 cn492 kernel: LustreError: 3004:0:(osc_request.c:2357:brw_interpret()) LBUG
      Mar 29 06:20:10 cn492 kernel: Pid: 3004, comm: ptlrpcd
      Mar 29 06:20:10 cn492 kernel:
      Mar 29 06:20:10 cn492 kernel: Call Trace:
      Mar 29 06:20:10 cn492 kernel: [<ffffffff885786a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs]
      Mar 29 06:20:10 cn492 kernel: [<ffffffff88578bda>] lbug_with_loc+0x7a/0xd0 [libcfs]
      Mar 29 06:20:10 cn492 kernel: [<ffffffff88580fc0>] tracefile_init+0x0/0x110 [libcfs]
      Mar 29 06:20:10 cn492 kernel: [<ffffffff8879c7e8>] brw_interpret+0x8e8/0xdb0 [osc]
      Mar 29 06:20:10 cn492 kernel: [<ffffffff886d36ac>] after_reply+0xcac/0xe30 [ptlrpc]
      Mar 29 06:20:10 cn492 kernel: [<ffffffff886d4b0b>] ptlrpc_check_set+0x12db/0x15a0 [ptlrpc]
      Mar 29 06:20:10 cn492 kernel: [<ffffffff8004b396>] try_to_del_timer_sync+0x7f/0x88
      Mar 29 06:20:10 cn492 kernel: [<ffffffff887095ad>] ptlrpcd_check+0xdd/0x1f0 [ptlrpc]
      Mar 29 06:20:10 cn492 kernel: [<ffffffff8009a98c>] process_timeout+0x0/0x5
      Mar 29 06:20:10 cn492 kernel: [<ffffffff88709ef1>] ptlrpcd+0x1b1/0x259 [ptlrpc]
      Mar 29 06:20:10 cn492 kernel: [<ffffffff8008f3ad>] default_wake_function+0x0/0xe
      Mar 29 06:20:10 cn492 kernel: [<ffffffff8005dfc1>] child_rip+0xa/0x11
      Mar 29 06:20:10 cn492 kernel: [<ffffffff88709d40>] ptlrpcd+0x0/0x259 [ptlrpc]
      Mar 29 06:20:10 cn492 kernel: [<ffffffff8005dfb7>] child_rip+0x0/0x11
      Mar 29 06:20:10 cn492 kernel:
      Mar 29 06:20:10 cn492 kernel: LustreError: dumping log to /tmp/lustre-log.1364538010.3004

      Attachments

        Issue Links

          Activity

            [LU-3067] ASSERTION(!(aa->aa_oa->o_valid & OBD_MD_FLHANDLE))
            pjones Peter Jones added a comment -

            No plans for further 1.8.x releases

            pjones Peter Jones added a comment - No plans for further 1.8.x releases

            We are using it in production and haven't noticed problems.

            walker Christopher J. Walker (Inactive) added a comment - We are using it in production and haven't noticed problems.

            Looks like we ran into this bug as well. As far as I can tell, we hit this on a (1.8) client after it had some network issues of unknown type... Is the patch ok to use as it is?

            ferner Frederik Ferner (Inactive) added a comment - Looks like we ran into this bug as well. As far as I can tell, we hit this on a (1.8) client after it had some network issues of unknown type... Is the patch ok to use as it is?
            efocht Erich Focht added a comment -

            We see this issue also at a customer where we deployed servers with Lustre 2.5.0 and his most stable setup seems to be with 1.8.9 clients. Up to this bug!

            efocht Erich Focht added a comment - We see this issue also at a customer where we deployed servers with Lustre 2.5.0 and his most stable setup seems to be with 1.8.9 clients. Up to this bug!

            We occasionally run into this issue. I see the patch set has been rebased at http://review.whamcloud.com/#change,5971 - is it only waiting approval at this point?

            prescott@hpc.ufl.edu Craig Prescott (Inactive) added a comment - We occasionally run into this issue. I see the patch set has been rebased at http://review.whamcloud.com/#change,5971 - is it only waiting approval at this point?

            We ran into the same issue on 1.8.9 at FNAL on client. Trace dump is the same. It happened after client communication error with OSS (the router had issues). Is it going to be fixed in 1.8.10 (or 1.8.9.1) ? Thanks, Alex.

            alex.ku Alex Kulyavtsev added a comment - We ran into the same issue on 1.8.9 at FNAL on client. Trace dump is the same. It happened after client communication error with OSS (the router had issues). Is it going to be fixed in 1.8.10 (or 1.8.9.1) ? Thanks, Alex.

            Hi I was wondering what the status of this issue was. We are running into this bug regularly at NOAA as well. Thanks.

            kitwestneat Kit Westneat (Inactive) added a comment - Hi I was wondering what the status of this issue was. We are running into this bug regularly at NOAA as well. Thanks.

            Hi Girish,
            Thanks for your investigation! the OBD_MD_FLHANDLE could be used in client side (OSC)!

            the patch is tracked at http://review.whamcloud.com/#change,5971

            hongchao.zhang Hongchao Zhang added a comment - Hi Girish, Thanks for your investigation! the OBD_MD_FLHANDLE could be used in client side (OSC)! the patch is tracked at http://review.whamcloud.com/#change,5971
            pjones Peter Jones added a comment -

            Hongchao

            Could you please comment?

            Thanks

            Peter

            pjones Peter Jones added a comment - Hongchao Could you please comment? Thanks Peter

            The problem also occurs when the the network connection to an OSS fails (eg today when a colleague managed to take out both of the resilient core switches).

            walker Christopher J. Walker (Inactive) added a comment - The problem also occurs when the the network connection to an OSS fails (eg today when a colleague managed to take out both of the resilient core switches).

            People

              hongchao.zhang Hongchao Zhang
              walker Christopher J. Walker (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: