[LU-8370] ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size ) - Whamcloud Community JIRA

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: Lustre 2.9.0
Affects Version/s: Lustre 2.8.0
Labels:
- llnl

Severity:
3
Rank (Obsolete):
9223372036854775807

Description

We just had several MDS nodes crash in our 2.8.0 DNE testbed with the following assertion:

2016-07-05 17:13:18 [421811.048147] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size ) failed:
2016-07-05 17:13:18 [421811.064551] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) LBUG

And here is a bit more of the console log showing some events leading up to the assertion:

2016-07-05 17:00:01 [421013.769026] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2016-07-05 17:00:14 [421027.108636] Lustre: 13742:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467763108/real 1467
2016-07-05 17:00:14 [421027.143423] Lustre: 13742:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 2 previous similar messages
2016-07-05 17:00:14 [421027.155156] Lustre: lquake-MDT0004-osp-MDT0000: Connection to lquake-MDT0004 (at 172.19.1.115@o2ib100) was lost; in progress operations using thi
2016-07-05 17:01:01 [421073.879154] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2016-07-05 17:01:41 [421113.695183] Lustre: lquake-MDT0000: haven't heard from client lquake-MDT0000-lwp-MDT0004_UUID (at 172.19.1.115@o2ib100) in 227 seconds. I think i
2016-07-05 17:01:41 [421113.721418] Lustre: Skipped 2 previous similar messages
2016-07-05 17:04:05 [421258.363863] Lustre: lquake-MDT000b-osp-MDT0000: Connection to lquake-MDT000b (at 172.19.1.122@o2ib100) was lost; in progress operations using thi
2016-07-05 17:04:13 [421266.343370] Lustre: 121411:0:(service.c:1336:ptlrpc_at_send_early_reply()) @@@ Couldn't add any time (5/5), not sending early reply
2016-07-05 17:04:13 [421266.343370]   req@ffff887a33193600 x1538602753601316/t0(0) o36->9131ef49-c8bd-b5a0-9096-bf612b95d242@192.168.128.136@o2ib18:423/0 lens 736/440 e
2016-07-05 17:04:13 [421266.379876] Lustre: 121411:0:(service.c:1336:ptlrpc_at_send_early_reply()) Skipped 6 previous similar messages
2016-07-05 17:05:03 [421316.429903] Lustre: lquake-MDT0000: Client 7f8c5dbb-5c65-9fd0-386e-4acb17a569c1 (at 192.168.128.173@o2ib18) reconnecting
2016-07-05 17:05:03 [421316.430031] Lustre: lquake-MDT0000: Connection restored to  (at 192.168.128.111@o2ib18)
2016-07-05 17:05:03 [421316.430034] Lustre: lquake-MDT0000: Connection restored to  (at 192.168.128.132@o2ib18)
2016-07-05 17:05:03 [421316.463306] Lustre: Skipped 6 previous similar messages
2016-07-05 17:05:21 [421334.445759] Lustre: 13738:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1467763414/real 1467
2016-07-05 17:05:21 [421334.481039] Lustre: 13738:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 13 previous similar messages
2016-07-05 17:05:29 [421341.717463] Lustre: lquake-MDT0000: haven't heard from client lquake-MDT000b-mdtlov_UUID (at 172.19.1.122@o2ib100) in 227 seconds. I think it's d
2016-07-05 17:05:29 [421341.743640] Lustre: Skipped 2 previous similar messages
2016-07-05 17:05:56 [421368.821761] LNetError: 13719:0:(o2iblnd_cb.c:3127:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 10 seconds
2016-07-05 17:05:56 [421368.834311] LNetError: 13719:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out RDMA with 172.19.1.115@o2ib100 (38): c: 0, oc: 0, rc: 8
2016-07-05 17:07:31 [421464.551678] Lustre: lquake-MDT0009-osp-MDT0000: Connection to lquake-MDT0009 (at 172.19.1.120@o2ib100) was lost; in progress operations using thi
2016-07-05 17:08:28 [421520.945572] LNetError: 13719:0:(o2iblnd_cb.c:3127:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 6 seconds
2016-07-05 17:08:28 [421520.957950] LNetError: 13719:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out RDMA with 172.19.1.120@o2ib100 (62): c: 0, oc: 0, rc: 8
2016-07-05 17:09:19 [421571.987133] LNetError: 13719:0:(o2iblnd_cb.c:3127:kiblnd_check_txs_locked()) Timed out tx: tx_queue, 7 seconds
2016-07-05 17:09:19 [421571.999514] LNetError: 13719:0:(o2iblnd_cb.c:3190:kiblnd_check_conns()) Timed out RDMA with 172.19.1.122@o2ib100 (38): c: 0, oc: 0, rc: 8
2016-07-05 17:09:23 [421576.250103] Lustre: MGS: haven't heard from client 859165fd-8f11-0394-d9d3-83cdf15c6ed5 (at 172.19.1.120@o2ib100) in 227 seconds. I think it's de
2016-07-05 17:09:23 [421576.276383] Lustre: Skipped 2 previous similar messages
2016-07-05 17:10:01 [421614.394252] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
2016-07-05 17:11:22 [421695.760858] Lustre: lquake-MDT000f-osp-MDT0000: Connection to lquake-MDT000f (at 172.19.1.126@o2ib100) was lost; in progress operations using thi
2016-07-05 17:12:58 [421790.944230] Lustre: MGS: haven't heard from client 4d1129cf-ad52-0cb8-c58a-0602f03cb346 (at 172.19.1.126@o2ib100) in 229 seconds. I think it's de
2016-07-05 17:12:58 [421790.970449] Lustre: Skipped 2 previous similar messages
2016-07-05 17:13:18 [421811.048147] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size ) failed:
2016-07-05 17:13:18 [421811.064551] LustreError: 86030:0:(update_trans.c:275:sub_updates_write()) LBUG
2016-07-05 17:13:18 [421811.073795] Pid: 86030, comm: mdt02_055

The above log was from NID 172.19.1.111@o2ib100.

Note that NIDS 172.19.1.111 through 172.19.1.126 are MDS nodes (172.19.1.111 hosts the MGS as well). NIDs 172.19.1.127 through 172.19.1.130 are OSS nodes. (All @o2ib100).

It looks to me like the MDS nodes are all hitting this assertion one by one. So far 10 out of 16 have hit it and are down.

Here is another node's console messages (172.19.1.112@o2ib100):

2016-07-05 17:24:35 [366721.789837] LustreError: Skipped 589 previous similar messages
2016-07-05 17:25:48 [366795.199563] Lustre: 14874:0:(client.c:2063:ptlrpc_expire_one_request()) @@@ Request sent has timed out for sent delay: [sent 1467764693/real 0]  
req@ffff883e59bc2700 x1538678743470044/t0(0) o38->lquake-MDT000b-osp-MDT0001@172.19.1.122@o2ib100:24/4 lens 520/544 e 0 to 1 dl 1467764748 ref 2 fl Rpc:XN/0/ffffffff rc 
0/-1
2016-07-05 17:25:48 [366795.234284] Lustre: 14874:0:(client.c:2063:ptlrpc_expire_one_request()) Skipped 60 previous similar messages
2016-07-05 17:26:33 [366840.473716] Lustre: lquake-MDT0001: Client f3232a59-daa1-33eb-94f5-ae42e9a1f202 (at 192.168.128.143@o2ib18) reconnecting
2016-07-05 17:26:34 [366840.487104] Lustre: lquake-MDT0001: Connection restored to f3232a59-daa1-33eb-94f5-ae42e9a1f202 (at 192.168.128.143@o2ib18)
2016-07-05 17:26:39 [366846.241092] Lustre: lquake-MDT0008-osp-MDT0001: Connection to lquake-MDT0008 (at 172.19.1.119@o2ib100) was lost; in progress operations using thi
s service will wait for recovery to complete
2016-07-05 17:28:17 [366943.590377] Lustre: lquake-MDT0001: haven't heard from client lquake-MDT0008-mdtlov_UUID (at 172.19.1.119@o2ib100) in 227 seconds. I think it's d
ead, and I am evicting it. exp ffff887e999b2000, cur 1467764897 expire 1467764747 last 1467764670
2016-07-05 17:28:26 [366952.624959] LustreError: 14934:0:(update_trans.c:275:sub_updates_write()) ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size ) failed: 
2016-07-05 17:28:26 [366952.641374] LustreError: 14934:0:(update_trans.c:275:sub_updates_write()) LBUG
2016-07-05 17:28:26 [366952.650631] Pid: 14934, comm: mdt01_001

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

jet13.log
1.25 MB
12/Jul/16 9:09 PM

Issue Links

is related to

LU-7800 Panic during recovery of soak-test.

Resolved

Activity

[LU-8370] ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size )

Christopher Morrone (Inactive) added a comment - 15/Jul/16 5:13 PM

I was out sick yesterday, but I'm back today. I'll give the 21334 patch a try today. Thanks!

Christopher Morrone (Inactive) added a comment - 15/Jul/16 5:13 PM I was out sick yesterday, but I'm back today. I'll give the 21334 patch a try today. Thanks!

Di Wang added a comment - 15/Jul/16 3:55 PM

Yes, it is better. Thanks Lai.

Di Wang added a comment - 15/Jul/16 3:55 PM Yes, it is better. Thanks Lai.

Lai Siyao added a comment - 15/Jul/16 1:22 PM

hi Di, I make a patch to use the same function to estimate and set update llog size, do you think it's better?

Lai Siyao added a comment - 15/Jul/16 1:22 PM hi Di, I make a patch to use the same function to estimate and set update llog size, do you think it's better?

Gerrit Updater added a comment - 15/Jul/16 1:20 PM

Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/21334
Subject: ~~LU-8370~~ dne: error in spliting update records
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 707164929f73698e9ff25eccebc67425e55efe07

Gerrit Updater added a comment - 15/Jul/16 1:20 PM Lai Siyao (lai.siyao@intel.com) uploaded a new patch: http://review.whamcloud.com/21334 Subject: LU-8370 dne: error in spliting update records Project: fs/lustre-release Branch: master Current Patch Set: 1 Commit: 707164929f73698e9ff25eccebc67425e55efe07

Di Wang added a comment - 14/Jul/16 5:04 AM

Chris: it looks like when splitting the update records (by ctxt chunk_size), it should use cfs_size_round(). I just pushed a patch (http://review.whamcloud.com/21307), and also added some debugging msg there.

Could you please try it and tell me what are those console message if it hits LBUG again. Thanks.

Di Wang added a comment - 14/Jul/16 5:04 AM Chris: it looks like when splitting the update records (by ctxt chunk_size), it should use cfs_size_round(). I just pushed a patch ( http://review.whamcloud.com/21307 ), and also added some debugging msg there. Could you please try it and tell me what are those console message if it hits LBUG again. Thanks.

Gerrit Updater added a comment - 14/Jul/16 5:00 AM

wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/21307
Subject: ~~LU-8370~~ updates: size round for update record.
Project: fs/lustre-release
Branch: b2_8
Current Patch Set: 1
Commit: b7ebae2b3b799aec76be207f7370c2129010fa26

Gerrit Updater added a comment - 14/Jul/16 5:00 AM wangdi (di.wang@intel.com) uploaded a new patch: http://review.whamcloud.com/21307 Subject: LU-8370 updates: size round for update record. Project: fs/lustre-release Branch: b2_8 Current Patch Set: 1 Commit: b7ebae2b3b799aec76be207f7370c2129010fa26

Christopher Morrone (Inactive) added a comment - 12/Jul/16 9:09 PM

The attached jet13.log file is from a node that is about to hit this assertion. I put a check for the same condition and a forced log dump in the code right before the assertion to get this.

Christopher Morrone (Inactive) added a comment - 12/Jul/16 9:09 PM The attached jet13.log file is from a node that is about to hit this assertion. I put a check for the same condition and a forced log dump in the code right before the assertion to get this.

Lai Siyao added a comment - 12/Jul/16 8:18 AM

no, I'm reviewing the code.

Lai Siyao added a comment - 12/Jul/16 8:18 AM no, I'm reviewing the code.

Christopher Morrone (Inactive) added a comment - 08/Jul/16 9:24 PM

And now the backtrace, in case that is helpful:

2016-07-08 14:15:20 [  337.504715] Pid: 26705, comm: mdt00_002
2016-07-08 14:15:20 [  337.509679] 
2016-07-08 14:15:20 [  337.509679] Call Trace:
2016-07-08 14:15:20 [  337.515418]  [<ffffffffa08857e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs]
2016-07-08 14:15:20 [  337.523920]  [<ffffffffa108c4cf>] sub_updates_write+0xb99/0xf2e [ptlrpc]
2016-07-08 14:15:20 [  337.532065]  [<ffffffff81179632>] ? __free_memcg_kmem_pages+0x22/0x50
2016-07-08 14:15:20 [  337.539953]  [<ffffffffa107ab4f>] top_trans_stop+0x62f/0x970 [ptlrpc]
2016-07-08 14:15:20 [  337.547819]  [<ffffffffa1307399>] lod_trans_stop+0x259/0x340 [lod]
2016-07-08 14:15:20 [  337.555394]  [<ffffffffa0e1f320>] ? linkea_add_buf+0x80/0x170 [obdclass]
2016-07-08 14:15:20 [  337.563543]  [<ffffffffa13958fa>] mdd_trans_stop+0x1a/0x1c [mdd]
2016-07-08 14:15:20 [  337.570904]  [<ffffffffa1380b08>] mdd_link+0x2e8/0x930 [mdd]
2016-07-08 14:15:20 [  337.577908]  [<ffffffffa127555a>] ? mdt_lookup_version_check+0xca/0x2f0 [mdt]
2016-07-08 14:15:20 [  337.586560]  [<ffffffffa127c08e>] mdt_reint_link+0xade/0xc30 [mdt]
2016-07-08 14:15:20 [  337.594113]  [<ffffffff816500a5>] ? mutex_lock+0x25/0x42
2016-07-08 14:15:20 [  337.600723]  [<ffffffffa1057ae6>] ? tgt_lookup_reply_by_xid+0x46/0x60 [ptlrpc]
2016-07-08 14:15:20 [  337.609471]  [<ffffffffa127e470>] mdt_reint_rec+0x80/0x210 [mdt]
2016-07-08 14:15:20 [  337.616845]  [<ffffffffa1261971>] mdt_reint_internal+0x5e1/0x990 [mdt]
2016-07-08 14:15:20 [  337.624811]  [<ffffffffa126b0d7>] mdt_reint+0x67/0x140 [mdt]
2016-07-08 14:15:20 [  337.631807]  [<ffffffffa10665d5>] tgt_request_handle+0x915/0x1320 [ptlrpc]
2016-07-08 14:15:20 [  337.640164]  [<ffffffffa10130cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc]
2016-07-08 14:15:20 [  337.649379]  [<ffffffffa0892758>] ? lc_watchdog_touch+0x68/0x180 [libcfs]
2016-07-08 14:15:20 [  337.657650]  [<ffffffffa1010c9b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc]
2016-07-08 14:15:20 [  337.665878]  [<ffffffff810bd4c2>] ? default_wake_function+0x12/0x20
2016-07-08 14:15:20 [  337.673534]  [<ffffffff810b33f8>] ? __wake_up_common+0x58/0x90
2016-07-08 14:15:20 [  337.680721]  [<ffffffffa1017170>] ptlrpc_main+0xa90/0x1db0 [ptlrpc]
2016-07-08 14:15:20 [  337.688368]  [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0
2016-07-08 14:15:20 [  337.695159]  [<ffffffffa10166e0>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc]
2016-07-08 14:15:20 [  337.702804]  [<ffffffff810a997f>] kthread+0xcf/0xe0
2016-07-08 14:15:20 [  337.708875]  [<ffffffff810a98b0>] ? kthread+0x0/0xe0
2016-07-08 14:15:20 [  337.715068]  [<ffffffff8165d658>] ret_from_fork+0x58/0x90
2016-07-08 14:15:20 [  337.721710]  [<ffffffff810a98b0>] ? kthread+0x0/0xe0
2016-07-08 14:15:20 [  337.727861] 
2016-07-08 14:15:20 [  337.730107] LustreError: 26705:0:(update_trans.c:279:sub_updates_write()) ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size ) failed: lrh_len 32776 loc_chunk_size 32768

Christopher Morrone (Inactive) added a comment - 08/Jul/16 9:24 PM And now the backtrace, in case that is helpful: 2016-07-08 14:15:20 [ 337.504715] Pid: 26705, comm: mdt00_002 2016-07-08 14:15:20 [ 337.509679] 2016-07-08 14:15:20 [ 337.509679] Call Trace: 2016-07-08 14:15:20 [ 337.515418] [<ffffffffa08857e3>] libcfs_debug_dumpstack+0x53/0x80 [libcfs] 2016-07-08 14:15:20 [ 337.523920] [<ffffffffa108c4cf>] sub_updates_write+0xb99/0xf2e [ptlrpc] 2016-07-08 14:15:20 [ 337.532065] [<ffffffff81179632>] ? __free_memcg_kmem_pages+0x22/0x50 2016-07-08 14:15:20 [ 337.539953] [<ffffffffa107ab4f>] top_trans_stop+0x62f/0x970 [ptlrpc] 2016-07-08 14:15:20 [ 337.547819] [<ffffffffa1307399>] lod_trans_stop+0x259/0x340 [lod] 2016-07-08 14:15:20 [ 337.555394] [<ffffffffa0e1f320>] ? linkea_add_buf+0x80/0x170 [obdclass] 2016-07-08 14:15:20 [ 337.563543] [<ffffffffa13958fa>] mdd_trans_stop+0x1a/0x1c [mdd] 2016-07-08 14:15:20 [ 337.570904] [<ffffffffa1380b08>] mdd_link+0x2e8/0x930 [mdd] 2016-07-08 14:15:20 [ 337.577908] [<ffffffffa127555a>] ? mdt_lookup_version_check+0xca/0x2f0 [mdt] 2016-07-08 14:15:20 [ 337.586560] [<ffffffffa127c08e>] mdt_reint_link+0xade/0xc30 [mdt] 2016-07-08 14:15:20 [ 337.594113] [<ffffffff816500a5>] ? mutex_lock+0x25/0x42 2016-07-08 14:15:20 [ 337.600723] [<ffffffffa1057ae6>] ? tgt_lookup_reply_by_xid+0x46/0x60 [ptlrpc] 2016-07-08 14:15:20 [ 337.609471] [<ffffffffa127e470>] mdt_reint_rec+0x80/0x210 [mdt] 2016-07-08 14:15:20 [ 337.616845] [<ffffffffa1261971>] mdt_reint_internal+0x5e1/0x990 [mdt] 2016-07-08 14:15:20 [ 337.624811] [<ffffffffa126b0d7>] mdt_reint+0x67/0x140 [mdt] 2016-07-08 14:15:20 [ 337.631807] [<ffffffffa10665d5>] tgt_request_handle+0x915/0x1320 [ptlrpc] 2016-07-08 14:15:20 [ 337.640164] [<ffffffffa10130cb>] ptlrpc_server_handle_request+0x21b/0xa90 [ptlrpc] 2016-07-08 14:15:20 [ 337.649379] [<ffffffffa0892758>] ? lc_watchdog_touch+0x68/0x180 [libcfs] 2016-07-08 14:15:20 [ 337.657650] [<ffffffffa1010c9b>] ? ptlrpc_wait_event+0xab/0x350 [ptlrpc] 2016-07-08 14:15:20 [ 337.665878] [<ffffffff810bd4c2>] ? default_wake_function+0x12/0x20 2016-07-08 14:15:20 [ 337.673534] [<ffffffff810b33f8>] ? __wake_up_common+0x58/0x90 2016-07-08 14:15:20 [ 337.680721] [<ffffffffa1017170>] ptlrpc_main+0xa90/0x1db0 [ptlrpc] 2016-07-08 14:15:20 [ 337.688368] [<ffffffff81015588>] ? __switch_to+0xf8/0x4d0 2016-07-08 14:15:20 [ 337.695159] [<ffffffffa10166e0>] ? ptlrpc_main+0x0/0x1db0 [ptlrpc] 2016-07-08 14:15:20 [ 337.702804] [<ffffffff810a997f>] kthread+0xcf/0xe0 2016-07-08 14:15:20 [ 337.708875] [<ffffffff810a98b0>] ? kthread+0x0/0xe0 2016-07-08 14:15:20 [ 337.715068] [<ffffffff8165d658>] ret_from_fork+0x58/0x90 2016-07-08 14:15:20 [ 337.721710] [<ffffffff810a98b0>] ? kthread+0x0/0xe0 2016-07-08 14:15:20 [ 337.727861] 2016-07-08 14:15:20 [ 337.730107] LustreError: 26705:0:(update_trans.c:279:sub_updates_write()) ASSERTION( lur->lur_hdr.lrh_len <= ctxt->loc_chunk_size ) failed: lrh_len 32776 loc_chunk_size 32768

Christopher Morrone (Inactive) added a comment - 08/Jul/16 7:13 PM

See your gerrit server, repository named "lustre-release-fe-llnl". Look for the highest numbered 2.8-llnl-preview* branch.

There is not a whole lot that is different code wise. Most of the change is in packaging and scripts.

But don't keep me in suspense, which commit in the latest master branch do you suspect fixes this?

Christopher Morrone (Inactive) added a comment - 08/Jul/16 7:13 PM See your gerrit server, repository named "lustre-release-fe-llnl". Look for the highest numbered 2.8-llnl-preview* branch. There is not a whole lot that is different code wise. Most of the change is in packaging and scripts. But don't keep me in suspense, which commit in the latest master branch do you suspect fixes this?

Lai Siyao added a comment - 08/Jul/16 3:43 AM

do you have a place to checkout your test code (2.8.0)? or the commit in master branch? this can help verify this is a fixed issue in latest master branch.

Lai Siyao added a comment - 08/Jul/16 3:43 AM do you have a place to checkout your test code (2.8.0)? or the commit in master branch? this can help verify this is a fixed issue in latest master branch.

People

Assignee:: Lai Siyao

Reporter:: Christopher Morrone (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 06/Jul/16 12:33 AM

Updated:: 14/Jun/18 9:41 PM

Resolved:: 11/Aug/16 11:57 AM