[LU-9469] conf-sanity test_61: test failed to respond and timed out Created: 08/May/17  Updated: 24/Oct/17  Resolved: 16/Oct/17

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.10.0
Fix Version/s: Lustre 2.11.0, Lustre 2.10.2

Type: Bug Priority: Minor
Reporter: James Casper Assignee: Bob Glossman (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

onyx-37, full
servers: SLES12.2, master branch, v2.9.56.11, b3565, ldiskfs
clients: SLES12.2, master branch, v2.9.56.11, b3565


Issue Links:
Blocker
is blocked by LU-9908 conf-sanity test_41b: test failed to ... Resolved
Related
is related to LU-9146 Backport patches from upstream to res... Resolved
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

https://testing.hpdd.intel.com/test_sessions/b7669e10-d1f4-42ce-a573-31dd57859317

From MDS console:

BUG: unable to handle kernel NULL pointer dereference at           (null)
Oops: 0000 [#1] SMP

Followed by a trace that is very similar to the one seen in LU-6821:

[<ffffffffa0764214>] ldiskfs_map_blocks+0x554/0x600 [ldiskfs]
21:19:41:[23773.866330]  [<ffffffffa074aa81>] ldiskfs_xattr_set_entry+0x4a1/0x790 [ldiskfs]
21:19:41:[23773.866330]  [<ffffffffa074b970>] ldiskfs_xattr_ibody_set+0x30/0x80 [ldiskfs]
21:19:41:[23773.866330]  [<ffffffffa074bd55>] ldiskfs_xattr_set_handle+0x1c5/0x530 [ldiskfs]
21:19:41:[23773.866330]  [<ffffffffa074c1fd>] ldiskfs_xattr_set+0x13d/0x1a0 [ldiskfs]
21:19:41:[23773.866330]  [<ffffffff8121e4bb>] generic_setxattr+0x5b/0x70
21:19:41:[23773.866330]  [<ffffffffa0e6bfd4>] osd_xattr_set+0x294/0xaf0 [osd_ldiskfs]
21:19:41:[23773.866330]  [<ffffffffa109bf2f>] lod_sub_object_xattr_set+0x1df/0x430 [lod]
21:19:41:[23773.866330]  [<ffffffffa108372f>] lod_xattr_set_internal+0xaf/0x2a0 [lod]
21:19:41:[23773.866330]  [<ffffffffa1091574>] lod_xattr_set+0xe4/0x530 [lod]
21:19:41:[23773.866330]  [<ffffffffa110277d>] mdd_xattr_set+0x6cd/0xe50 [mdd]
21:19:41:[23773.866330]  [<ffffffffa0fbac30>] mdt_reint_setxattr+0x410/0xf60 [mdt]
21:19:41:[23773.866330]  [<ffffffffa0fb967e>] mdt_reint_rec+0x6e/0x1e0 [mdt]
21:19:41:[23773.866330]  [<ffffffffa0f9d087>] mdt_reint_internal+0x557/0x950 [mdt]
21:19:41:[23773.866330]  [<ffffffffa0fa8051>] mdt_reint+0x61/0x120 [mdt]
21:19:41:[23773.866330]  [<ffffffffa0bd1cb8>] tgt_request_handle+0x8b8/0x1290 [ptlrpc]
21:19:41:[23773.866330]  [<ffffffffa0b7e573>] ptlrpc_server_handle_request+0x203/0xa00 [ptlrpc]
21:19:41:[23773.866330]  [<ffffffffa0b82398>] ptlrpc_main+0xa38/0x1d80 [ptlrpc]
21:19:41:[23773.866330]  [<ffffffff810999dd>] kthread+0xbd/0xe0
21:19:41:[23773.866330]  [<ffffffff815e6f3f>] ret_from_fork+0x3f/0x70


 Comments   
Comment by Peter Jones [ 10/May/17 ]

Bob

Can you please look into this one?

Thanks

Peter

Comment by Andreas Dilger [ 11/May/17 ]

It is very likely that this is related to patch https://review.whamcloud.com/25595 "LU-9146 ldiskfs: backport a few patches to resolve deadlock", which was landed for RHEL7, but not SLES12.

At the same time, it makes sense to port patch https://review.whamcloud.com/27045
"LU-9384 ldiskfs: port upstream patches for project quota" to SLES 12 as well.

Comment by Bob Glossman (Inactive) [ 11/May/17 ]

I have asked Yang Sheng to take a look. He was the author of https://review.whamcloud.com/25595

Comment by Yang Sheng [ 12/May/17 ]

From MDS log:

21:19:41:[23773.807317] Lustre: DEBUG MARKER: /usr/sbin/lctl mark save large xattr trusted.big on \/mnt\/lustre\/f61.conf-sanity
21:19:41:[23773.846525] Lustre: DEBUG MARKER: save large xattr trusted.big on /mnt/lustre/f61.conf-sanity
21:19:41:[23773.862509] BUG: unable to handle kernel NULL pointer dereference at           (null)
21:19:41:[23773.864655] IP: [<ffffffffa0288a10>] jbd2_journal_file_inode+0x30/0xf0 [jbd2]
21:19:41:[23773.866330] PGD 0 
21:19:41:[23773.866330] Oops: 0000 [#1] SMP 
21:19:41:[23773.866330] Modules linked in: osp(OEN) mdd(OEN) lod(OEN) mdt(OEN) lfsck(OEN) mgs(OEN) mgc(OEN) osd_ldiskfs(OEN) lquota(OEN) fid(OEN) fld(OEN) ksocklnd(OEN) ptlrpc(OEN) obdclass(OEN) lnet(OEN) ldiskfs(OEN) libcfs(OEN) loop(E) rpcsec_gss_krb5(E) auth_rpcgss(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) af_packet(E) iscsi_boot_sysfs(E) ib_ipoib(E) rdma_ucm(E) ib_ucm(E) ib_uverbs(E) ib_umad(E) rdma_cm(E) 

It is a jinode issue. We lost control since some codes have changed. For https://review.whamcloud.com/25595, looks like sles12sp2 already has included it. For https://review.whamcloud.com/27045, It can be ported if we will port project quota to sles12sp2.

Thanks,
YangSheng

Comment by Peter Jones [ 15/May/17 ]

Yes. Let's bring the SLES12 SP2 patches up to date with the RHEL7.3 patches

Comment by Peter Jones [ 23/Aug/17 ]

Any progress on this?

Comment by Bob Glossman (Inactive) [ 23/Aug/17 ]

attempting a patch suggested by Yang Sheng that addressed a similar panic seen on el7.4

Comment by Yang Sheng [ 23/Aug/17 ]

It looks same as https://testing.hpdd.intel.com/sub_tests/3157a7c8-7814-11e7-a12c-5254006e85c2.

03:59:12:[20259.344620] BUG: unable to handle kernel NULL pointer dereference at           (null)
03:59:12:[20259.345439] IP: [<ffffffffc0195c84>] jbd2_journal_file_inode+0x34/0xf0 [jbd2]
03:59:12:[20259.345439] PGD 0 
03:59:12:[20259.345439] Oops: 0000 [#1] SMP 
03:59:12:[20259.345439] Modules linked in: osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) ldiskfs(OE) libcfs(OE) loop dm_mod rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr joydev virtio_balloon i2c_piix4 parport_pc parport nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm virtio_blk drm ata_piix 8139too libata crct10dif_pclmul crct10dif_common crc32c_intel serio_raw 8139cp virtio_pci virtio_ring mii virtio i2c_core floppy [last unloaded: libcfs]
03:59:12:[20259.345439] CPU: 1 PID: 31105 Comm: mdt00_002 Tainted: G        W  OE  ------------   3.10.0-693.el7_lustre.x86_64 #1
03:59:12:[20259.345439] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
03:59:12:[20259.345439] task: ffff880064ab2f70 ti: ffff880064b58000 task.ti: ffff880064b58000
03:59:12:[20259.345439] RIP: 0010:[<ffffffffc0195c84>]  [<ffffffffc0195c84>] jbd2_journal_file_inode+0x34/0xf0 [jbd2]
03:59:12:[20259.345439] RSP: 0018:ffff880064b5b6b0  EFLAGS: 00010246
03:59:12:[20259.345439] RAX: 0000000000000800 RBX: ffff880054ba9800 RCX: 0000080000200000
03:59:12:[20259.345439] RDX: 000000008002c823 RSI: 0000000000000000 RDI: ffff88007bd3d2a0
03:59:12:[20259.345439] RBP: ffff880064b5b6d8 R08: 000000000000c350 R09: ffffffffc07b75c3
03:59:12:[20259.345439] R10: 0000000000006b03 R11: 0000000000000000 R12: 0000000000000000
03:59:12:[20259.345439] R13: 0000000000000001 R14: 0000000000000000 R15: ffff8800796f2000
03:59:12:[20259.345439] FS:  0000000000000000(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
03:59:12:[20259.345439] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
03:59:12:[20259.345439] CR2: 0000000000000000 CR3: 0000000079809000 CR4: 00000000000406e0
03:59:12:[20259.345439] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
03:59:12:[20259.345439] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
03:59:12:[20259.345439] Stack:
03:59:12:[20259.345439]  ffff880000c28958 0000000000000000 0000000000000001 ffff880064b5b7c8
03:59:12:[20259.345439]  0000000000000001 ffff880064b5b760 ffffffffc07c3b4c 0000000000000000
03:59:12:[20259.345439]  0000000000c28958 000000007bd3d2a0 ffff88007bd3d2a0 0000000000000003
03:59:12:[20259.345439] Call Trace:
03:59:12:[20259.345439]  [<ffffffffc07c3b4c>] ldiskfs_map_blocks+0x64c/0x6f0 [ldiskfs]
03:59:12:[20259.345439]  [<ffffffffc077824c>] ldiskfs_xattr_set_entry+0x2cc/0x7c0 [ldiskfs]
03:59:12:[20259.345439]  [<ffffffffc07791c5>] ldiskfs_xattr_ibody_set+0x35/0x80 [ldiskfs]
03:59:12:[20259.345439]  [<ffffffffc07795a4>] ldiskfs_xattr_set_handle+0x1a4/0x500 [ldiskfs]
03:59:12:[20259.345439]  [<ffffffffc0779a46>] ldiskfs_xattr_set+0x146/0x1c0 [ldiskfs]
03:59:12:[20259.345439]  [<ffffffffc07b89cd>] ldiskfs_xattr_trusted_set+0x2d/0x30 [ldiskfs]
03:59:12:[20259.345439]  [<ffffffff812273d2>] generic_setxattr+0x62/0x80
03:59:12:[20259.345439]  [<ffffffffc0d2f43c>] osd_xattr_set+0x2cc/0xbf0 [osd_ldiskfs]
03:59:12:[20259.345439]  [<ffffffffc06933e4>] ? libcfs_log_return+0x24/0x30 [libcfs]
03:59:12:[20259.345439]  [<ffffffffc0faec00>] lod_sub_xattr_set+0x1f0/0x480 [lod]
03:59:12:[20259.345439]  [<ffffffffc0f94ce9>] lod_xattr_set_internal+0xb9/0x2d0 [lod]
03:59:12:[20259.345439]  [<ffffffffc0fa3ce1>] lod_xattr_set+0xf1/0x560 [lod]
03:59:12:[20259.345439]  [<ffffffffc1010b05>] mdd_xattr_set+0x705/0xe80 [mdd]
03:59:12:[20259.345439]  [<ffffffffc0ec2607>] ? mdt_version_save+0x67/0x120 [mdt]
03:59:12:[20259.345439]  [<ffffffffc0ecdde0>] mdt_reint_setxattr+0x410/0xf80 [mdt]
03:59:12:[20259.345439]  [<ffffffffc0696bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
03:59:12:[20259.345439]  [<ffffffffc0ecc790>] mdt_reint_rec+0x80/0x210 [mdt]
03:59:12:[20259.345439]  [<ffffffffc0eae31b>] mdt_reint_internal+0x5fb/0x9c0 [mdt]
03:59:12:[20259.345439]  [<ffffffffc0eb9da7>] mdt_reint+0x67/0x140 [mdt]
03:59:12:[20259.345439]  [<ffffffffc0b5d195>] tgt_request_handle+0x925/0x1370 [ptlrpc]
03:59:12:[20259.345439]  [<ffffffffc0b06036>] ptlrpc_server_handle_request+0x236/0xa90 [ptlrpc]
03:59:12:[20259.345439]  [<ffffffffc0696bc7>] ? libcfs_debug_msg+0x57/0x80 [libcfs]
03:59:12:[20259.345439]  [<ffffffffc0b097d2>] ptlrpc_main+0xa92/0x1e40 [ptlrpc]
03:59:12:[20259.345439]  [<ffffffff81029557>] ? __switch_to+0xd7/0x510
03:59:12:[20259.345439]  [<ffffffff816a8f00>] ? __schedule+0x310/0x8b0
03:59:12:[20259.345439]  [<ffffffffc0b08d40>] ? ptlrpc_register_service+0xe80/0xe80 [ptlrpc]
03:59:12:[20259.345439]  [<ffffffff810b098f>] kthread+0xcf/0xe0
03:59:12:[20259.345439]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
03:59:12:[20259.345439]  [<ffffffff816b4f18>] ret_from_fork+0x58/0x90
03:59:12:[20259.345439]  [<ffffffff810b08c0>] ? insert_kthread_work+0x40/0x40
03:59:12:[20259.345439] Code: 41 57 41 56 41 55 41 54 53 f6 47 1c 08 48 8b 1f 0f 85 91 00 00 00 48 85 db 0f 84 88 00 00 00 4c 8b 3b 45 8b 37 41 83 e6 02 75 7c <48> 39 1e 49 89 f4 74 7a 48 39 5e 08 74 74 4d 8d af a0 03 00 00 
03:59:12:[20259.345439] RIP  [<ffffffffc0195c84>] jbd2_journal_file_inode+0x34/0xf0 [jbd2]
03:59:12:[20259.345439]  RSP <ffff880064b5b6b0>
03:59:12:[20259.345439] CR2: 0000000000000000

I have proposed a fix for that. Bob will port it to sles12.

Thanks,
YangSheng

Comment by Gerrit Updater [ 23/Aug/17 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/28665
Subject: LU-9469 ldiskfs: add additional attach_jinode call
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: f8c12ff7c6b57e16e741be7e4acaf381b5b293d3

Comment by Gerrit Updater [ 23/Aug/17 ]

Yang Sheng (yang.sheng@intel.com) uploaded a new patch: https://review.whamcloud.com/28666
Subject: LU-9469 ldiskfs: just test
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 354bcd64b3119666f84718aa9d56c47458f321c4

Comment by Yang Sheng [ 25/Aug/17 ]

So looks like it is very easy to reproduce on sles12sp2. https://testing.hpdd.intel.com/test_sessions/cc1c2380-7d1d-4672-a3ba-0c0f4275f105

We need run fullset tests while kernel update. Else may lost chance to catch issue timely.

Thanks,
YangSheng

Comment by Bob Glossman (Inactive) [ 25/Aug/17 ]

YangSheng,
I see that your test patch reproduced the problem.
I also see the proposed patch https://review.whamcloud.com/28665 did not show it. Tests on that patch are running into other problems though. possibly LU-9908, maybe other things too.

You could help in a couple of ways:

  1. rebase your test on top of https://review.whamcloud.com/28665 and see if the problem still happens with repeated test of conf-sanity.
  2. help with finding root cause(s) of LU-9908

Thanks,
Bob

Comment by Yang Sheng [ 25/Aug/17 ]

Hi, Bob,

Do you have a way to run all of tests group for sles12 sp2 instead of conf-sanity only? I afraid testing of sles12 is not enough. Maybe more issues will be revealed while full tests is running. I think just change Test-parameter of https://review.whamcloud.com/28665 is enough. My patch is just for test, it change nothing.

Thanks,
YangSheng

Comment by Bob Glossman (Inactive) [ 25/Aug/17 ]

I know your patch is just for test. want to just test with #29665 to see if original problem will occur with just conf-sanity if repeated enough times before trying anything else.

The only way I can think of to run all normal test groups in #29665 is to have 5 separate Test-Parmaters lines, 1 for each of the 5 default test runs. I can do that, but want to have more confidence about conf-sanity first. Doing that seems likely to run into other, unrelated, pre-existing fails on sles12.

Comment by Gerrit Updater [ 13/Sep/17 ]

Bob Glossman (bob.glossman@intel.com) uploaded a new patch: https://review.whamcloud.com/28968
Subject: LU-9469 ldiskfs: add additional attach_jinode call
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 415773036850bdd5562caa7f0cf9ccc18c1e7212

Comment by Gerrit Updater [ 16/Oct/17 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/28665/
Subject: LU-9469 ldiskfs: add additional attach_jinode call
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 84f690eee28f3b9123b89249c568dd6255a4a348

Comment by Peter Jones [ 16/Oct/17 ]

Landed for 2.11

Comment by Gerrit Updater [ 24/Oct/17 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/28968/
Subject: LU-9469 ldiskfs: add additional attach_jinode call
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: 6ece1f512e9e0ef8dfc8293738716808c70a5a42

Generated at Sat Feb 10 02:26:28 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.