Uploaded image for project: 'Lustre'
  1. Lustre
  2. LU-8370

ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size )

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • Lustre 2.9.0
    • Lustre 2.8.0
    • 3
    • 9223372036854775807

    Description

      We just had several MDS nodes crash in our 2.8.0 DNE testbed with the following assertion:

      2016-07-05 17:13:18 [421811.048147] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size ) failed:
      2016-07-05 17:13:18 [421811.064551] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) LBUG
      

      And here is a bit more of the console log showing some events leading up to the assertion:

      2016-07-05 17:00:01 [421013.769026] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
      2016-07-05 17:00:14 [421027.108636] Lustre: 13742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467763108/real 1467
      2016-07-05 17:00:14 [421027.143423] Lustre: 13742:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
      2016-07-05 17:00:14 [421027.155156] Lustre: lquake-MDT0004-osp-MDT0000: Connection to lquake-MDT0004 (at 172.19.1.115@o2ib100) was lost; in progress operations using thi
      2016-07-05 17:01:01 [421073.879154] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
      2016-07-05 17:01:41 [421113.695183] Lustre: lquake-MDT0000: haven't heard from client lquake-MDT0000-lwp-MDT0004_UUID (at 172.19.1.115@o2ib100) in 227 seconds. I think i
      2016-07-05 17:01:41 [421113.721418] Lustre: Skipped 2 previous similar messages
      2016-07-05 17:04:05 [421258.363863] Lustre: lquake-MDT000b-osp-MDT0000: Connection to lquake-MDT000b (at 172.19.1.122@o2ib100) was lost; in progress operations using thi
      2016-07-05 17:04:13 [421266.343370] Lustre: 121411:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
      2016-07-05 17:04:13 [421266.343370]   req@ffff887a33193600 x1538602753601316/t0(0) o36->9131ef49-c8bd-b5a0-9096-bf612b95d242@192.168.128.136@o2ib18:423/0 lens 736/440 e
      2016-07-05 17:04:13 [421266.379876] Lustre: 121411:0:(service.c:1336:ptlrpc_at_send_early_reply()) Skipped 6 previous similar messages
      2016-07-05 17:05:03 [421316.429903] Lustre: lquake-MDT0000: Client 7f8c5dbb-5c65-9fd0-386e-4acb17a569c1 (at 192.168.128.173@o2ib18) reconnecting
      2016-07-05 17:05:03 [421316.430031] Lustre: lquake-MDT0000: Connection restored to  (at 192.168.128.111@o2ib18)
      2016-07-05 17:05:03 [421316.430034] Lustre: lquake-MDT0000: Connection restored to  (at 192.168.128.132@o2ib18)
      2016-07-05 17:05:03 [421316.463306] Lustre: Skipped 6 previous similar messages
      2016-07-05 17:05:21 [421334.445759] Lustre: 13738:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467763414/real 1467
      2016-07-05 17:05:21 [421334.481039] Lustre: 13738:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
      2016-07-05 17:05:29 [421341.717463] Lustre: lquake-MDT0000: haven't heard from client lquake-MDT000b-mdtlov_UUID (at 172.19.1.122@o2ib100) in 227 seconds. I think it's d
      2016-07-05 17:05:29 [421341.743640] Lustre: Skipped 2 previous similar messages
      2016-07-05 17:05:56 [421368.821761] LNetError: 13719:0:(o2iblnd_cb.c:3127:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
      2016-07-05 17:05:56 [421368.834311] LNetError: 13719:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out RDMA with 172.19.1.115@o2ib100 (38): c: 0, oc: 0, rc: 8
      2016-07-05 17:07:31 [421464.551678] Lustre: lquake-MDT0009-osp-MDT0000: Connection to lquake-MDT0009 (at 172.19.1.120@o2ib100) was lost; in progress operations using thi
      2016-07-05 17:08:28 [421520.945572] LNetError: 13719:0:(o2iblnd_cb.c:3127:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 6 seconds
      2016-07-05 17:08:28 [421520.957950] LNetError: 13719:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out RDMA with 172.19.1.120@o2ib100 (62): c: 0, oc: 0, rc: 8
      2016-07-05 17:09:19 [421571.987133] LNetError: 13719:0:(o2iblnd_cb.c:3127:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 7 seconds
      2016-07-05 17:09:19 [421571.999514] LNetError: 13719:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out RDMA with 172.19.1.122@o2ib100 (38): c: 0, oc: 0, rc: 8
      2016-07-05 17:09:23 [421576.250103] Lustre: MGS: haven't heard from client 859165fd-8f11-0394-d9d3-83cdf15c6ed5 (at 172.19.1.120@o2ib100) in 227 seconds. I think it's de
      2016-07-05 17:09:23 [421576.276383] Lustre: Skipped 2 previous similar messages
      2016-07-05 17:10:01 [421614.394252] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
      2016-07-05 17:11:22 [421695.760858] Lustre: lquake-MDT000f-osp-MDT0000: Connection to lquake-MDT000f (at 172.19.1.126@o2ib100) was lost; in progress operations using thi
      2016-07-05 17:12:58 [421790.944230] Lustre: MGS: haven't heard from client 4d1129cf-ad52-0cb8-c58a-0602f03cb346 (at 172.19.1.126@o2ib100) in 229 seconds. I think it's de
      2016-07-05 17:12:58 [421790.970449] Lustre: Skipped 2 previous similar messages
      2016-07-05 17:13:18 [421811.048147] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size ) failed:
      2016-07-05 17:13:18 [421811.064551] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) LBUG
      2016-07-05 17:13:18 [421811.073795] Pid: 86030, comm: mdt02_055
      

      The above log was from NID 172.19.1.111@o2ib100.

      Note that NIDS 172.19.1.111 through 172.19.1.126 are MDS nodes (172.19.1.111 hosts the MGS as well). NIDs 172.19.1.127 through 172.19.1.130 are OSS nodes. (All @o2ib100).

      It looks to me like the MDS nodes are all hitting this assertion one by one. So far 10 out of 16 have hit it and are down.

      Here is another node's console messages (172.19.1.112@o2ib100):

      2016-07-05 17:24:35 [366721.789837] LustreError: Skipped 589 previous similar messages
      2016-07-05 17:25:48 [366795.199563] Lustre: 14874:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1467764693/real 0]  
      req@ffff883e59bc2700 x1538678743470044/t0(0) o38->lquake-MDT000b-osp-MDT0001@172.19.1.122@o2ib100:24/4 lens 520/544 e 0 to 1 dl 1467764748 ref 2 fl Rpc:XN/0/ffffffff rc 
      0/-1
      2016-07-05 17:25:48 [366795.234284] Lustre: 14874:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 60 previous similar messages
      2016-07-05 17:26:33 [366840.473716] Lustre: lquake-MDT0001: Client f3232a59-daa1-33eb-94f5-ae42e9a1f202 (at 192.168.128.143@o2ib18) reconnecting
      2016-07-05 17:26:34 [366840.487104] Lustre: lquake-MDT0001: Connection restored to f3232a59-daa1-33eb-94f5-ae42e9a1f202 (at 192.168.128.143@o2ib18)
      2016-07-05 17:26:39 [366846.241092] Lustre: lquake-MDT0008-osp-MDT0001: Connection to lquake-MDT0008 (at 172.19.1.119@o2ib100) was lost; in progress operations using thi
      s service will wait for recovery to complete
      2016-07-05 17:28:17 [366943.590377] Lustre: lquake-MDT0001: haven't heard from client lquake-MDT0008-mdtlov_UUID (at 172.19.1.119@o2ib100) in 227 seconds. I think it's d
      ead, and I am evicting it. exp ffff887e999b2000, cur 1467764897 expire 1467764747 last 1467764670
      2016-07-05 17:28:26 [366952.624959] LustreError: 14934:0:(update_trans.c:275:sub_updates_write()) ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size ) failed: 
      2016-07-05 17:28:26 [366952.641374] LustreError: 14934:0:(update_trans.c:275:sub_updates_write()) LBUG
      2016-07-05 17:28:26 [366952.650631] Pid: 14934, comm: mdt01_001
      

      Attachments

        Issue Links

          Activity

            People

              laisiyao Lai Siyao
              morrone Christopher Morrone (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: