[LU-10421] mds-survey test 1: Timeout occurred after 426 mins, last suite running was mds-survey, restarting cluster to continue tests Created: 20/Dec/17  Updated: 12/Apr/18  Resolved: 06/Mar/18

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.11.0, Lustre 2.10.3
Fix Version/s: Lustre 2.11.0, Lustre 2.10.4

Type: Bug Priority: Minor
Reporter: James Casper Assignee: John Hammond
Resolution: Fixed Votes: 0
Labels: dne, zfs
Environment:

onyx, full DNE
servers: el7.4, zfs, branch master, v2.10.56, b3678
clients: el7.4, branch master, v2.10.56, b3678


Issue Links:
Duplicate
duplicates LU-6249 mds-survey test_1: test failed to res... Resolved
is duplicated by LU-10636 mds-survey test_1: timeout Resolved
is duplicated by LU-10758 mds-survey: timeout Closed
Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

session: https://testing.hpdd.intel.com/test_sessions/9e3f4edc-daff-4e9c-bb2c-5e501afcb7bf
test set: https://testing.hpdd.intel.com/test_sets/ba56cb40-e0c8-11e7-9840-52540065bddc

From MDS console:

[22053.144258] LustreError: 13506:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0001-tests: rc = -2
[22053.145264] LustreError: 13506:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0001-tests: rc = -2
[22053.781142] LustreError: 13611:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0001-tests3: rc = -2
[22053.782164] LustreError: 13611:0:(echo_client.c:1795:echo_md_lookup()) Skipped 2 previous similar messages
[22053.783133] LustreError: 13611:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0001-tests3: rc = -2
[22053.784222] LustreError: 13611:0:(echo_client.c:2027:echo_md_destroy_internal()) Skipped 2 previous similar messages
[22055.866749] LustreError: 13891:0:(echo_client.c:1795:echo_md_lookup()) lookup MDT0003-tests: rc = -2
[22055.867931] LustreError: 13891:0:(echo_client.c:2027:echo_md_destroy_internal()) Can't find child MDT0003-tests: rc = -2
[22177.268865] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[22177.270372] IP: [<ffffffffc0bdb913>] lu_object_alloc+0x73/0x310 [obdclass]
[22177.271432] PGD 48733067 PUD 3cfc0067 PMD 0 
[22177.272157] Oops: 0002 [#1] SMP 
[22177.272692] Modules linked in: obdecho(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core dm_mod iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd nfsd pcspkr i2c_piix4 joydev virtio_balloon parport_pc i2c_core parport nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ata_generic pata_acpi ext4 mbcache jbd2 ata_piix libata virtio_blk 8139too crct10dif_pclmul crct10dif_common floppy crc32c_intel virtio_pci virtio_ring serio_raw virtio 8139cp mii
[22177.287656] CPU: 1 PID: 19364 Comm: lctl Tainted: P           OE  ------------   3.10.0-693.5.2.el7_lustre.x86_64 #1
[22177.289215] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[22177.290055] task: ffff88003ef69fa0 ti: ffff880049e5c000 task.ti: ffff880049e5c000
[22177.291188] RIP: 0010:[<ffffffffc0bdb913>]  [<ffffffffc0bdb913>] lu_object_alloc+0x73/0x310 [obdclass]
[22177.292617] RSP: 0018:ffff880049e5fb20  EFLAGS: 00010246
[22177.293373] RAX: 00000002400090a0 RBX: ffff8800528d0e40 RCX: 0000000000000000
[22177.294437] RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff88004885a000
[22177.295492] RBP: ffff880049e5fb68 R08: 0000000000000000 R09: ffff88004885a000
[22177.296546] R10: 000000000000000d R11: 0000000000000fff R12: ffff88004885a000
[22177.297624] R13: ffff880049e5fc08 R14: ffff88005a97a1f8 R15: 0000000000000000
[22177.298634] FS:  00007f0dfe043740(0000) GS:ffff88007fd00000(0000) knlGS:0000000000000000
[22177.299832] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[22177.300688] CR2: 0000000000000008 CR3: 000000001c64d000 CR4: 00000000000406e0
[22177.301787] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[22177.302862] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[22177.303911] Stack:
[22177.304231]  ffff880053233000 ffffffffc0bd91f3 0000000000000000 ffff880058793e18
[22177.305487]  ffff8800528d0e40 0000000000000000 ffff880049e5fc08 ffff88005a97a1f8
[22177.306635]  ffff880053233000 ffff880049e5fbd0 ffffffffc0bdbd7c ffff880058793e18
[22177.307918] Call Trace:
[22177.308341]  [<ffffffffc0bd91f3>] ? htable_lookup+0x153/0x170 [obdclass]
[22177.309359]  [<ffffffffc0bdbd7c>] lu_object_find_at+0x16c/0x290 [obdclass]
[22177.310377]  [<ffffffffc11bfa9e>] echo_md_dir_stripe_choose.isra.43+0x26e/0x680 [obdecho]
[22177.311601]  [<ffffffffc05d77eb>] ? cfs_hash_spin_unlock+0xb/0x10 [libcfs]
[22177.312625]  [<ffffffffc11c0d6c>] echo_md_handler.isra.45+0xebc/0x2c20 [obdecho]
[22177.313708]  [<ffffffffc11c6891>] echo_client_iocontrol+0x1091/0x1ba0 [obdecho]
[22177.314799]  [<ffffffffc0bbc459>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[22177.315936]  [<ffffffffc0ba714d>] class_handle_ioctl+0x18cd/0x1dd0 [obdclass]
[22177.316937]  [<ffffffff811b1e81>] ? handle_mm_fault+0x691/0xfa0
[22177.317792]  [<ffffffff812b1a98>] ? security_capable+0x18/0x20
[22177.318674]  [<ffffffffc0b8c602>] obd_class_ioctl+0xd2/0x170 [obdclass]
[22177.319675]  [<ffffffff812151bd>] do_vfs_ioctl+0x33d/0x540
[22177.320472]  [<ffffffff816b0456>] ? trace_do_page_fault+0x56/0x150
[22177.321376]  [<ffffffff81215461>] SyS_ioctl+0xa1/0xc0
[22177.322137]  [<ffffffff816b5089>] system_call_fastpath+0x16/0x1b


 Comments   
Comment by Mikhail Pershin [ 12/Jan/18 ]

+1 on master
https://testing.hpdd.intel.com/test_sets/d6abf5e4-f258-11e7-8c43-52540065bddc

Comment by Jian Yu [ 19/Jan/18 ]

This failure occurred at least 20 times in last two weeks.

Comment by Saurabh Tandan (Inactive) [ 31/Jan/18 ]

Seen for 2.10.57 "SLES 12 SP3 Server/DNE/ldiskfs SLES 12 SP3 Client" as well . https://testing.hpdd.intel.com/test_sets/ace94120-fd4e-11e7-a7cd-52540065bddc

Comment by Minh Diep [ 06/Feb/18 ]

+1 master dne-zfs

https://testing.hpdd.intel.com/test_sets/1eb3e98c-0b54-11e8-a7cd-52540065bddc

Comment by Jian Yu [ 08/Feb/18 ]

The failure occurred more than 50 times in one week, which is affecting patch testing on master branch:
https://testing.hpdd.intel.com/test_sets/88933722-0ce4-11e8-a6ad-52540065bddc
https://testing.hpdd.intel.com/test_sets/5db396ee-0cdc-11e8-a7cd-52540065bddc
https://testing.hpdd.intel.com/test_sets/e833fbc4-0cdc-11e8-a10a-52540065bddc

Comment by nasf (Inactive) [ 09/Feb/18 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/b80d2ea0-0d83-11e8-a6ad-52540065bddc

Comment by Mikhail Pershin [ 12/Feb/18 ]

+1 on master:
https://testing.hpdd.intel.com/test_logs/14427d04-0f39-11e8-a7cd-52540065bddc/show_text

[15920.683325] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[15920.684035] IP: [<ffffffffc0c50d23>] lu_object_alloc+0x73/0x310 [obdclass]
[15920.684035] PGD 800000003a856067 PUD 5fb92067 PMD 0 
[15920.684035] Oops: 0002 [#1] SMP 
[15920.684035] Modules linked in: obdecho(OE) osp(OE) mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgc(OE) osd_zfs(OE) lquota(OE) fid(OE) fld(OE) ksocklnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) zfs(POE) zunicode(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) libcfs(OE) rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod crc_t10dif crct10dif_generic crct10dif_common ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_core dm_mod ppdev nfsd pcspkr parport_pc joydev virtio_balloon parport i2c_piix4 nfs_acl lockd auth_rpcgss grace sunrpc ip_tables ext4 mbcache jbd2 ata_generic pata_acpi cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm virtio_blk 8139too ata_piix libata virtio_pci virtio_ring serio_raw virtio 8139cp mii i2c_core floppy
[15920.684035] CPU: 0 PID: 22706 Comm: lctl Tainted: P           OE  ------------   3.10.0-693.17.1.el7_lustre.x86_64 #1
[15920.684035] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[15920.684035] task: ffff8800174e4f10 ti: ffff88003e728000 task.ti: ffff88003e728000
[15920.684035] RIP: 0010:[<ffffffffc0c50d23>]  [<ffffffffc0c50d23>] lu_object_alloc+0x73/0x310 [obdclass]
[15920.684035] RSP: 0018:ffff88003e72baf0  EFLAGS: 00010246
[15920.684035] RAX: 0000000240000bd3 RBX: ffff880055523180 RCX: 0000000000000000
[15920.684035] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880050313dc0
[15920.684035] RBP: ffff88003e72bb38 R08: 0000000000000000 R09: 0000000000000000
[15920.684035] R10: ffff880050313dc0 R11: 0000000000000fff R12: ffff880050313dc0
[15920.684035] R13: ffff88003e72bbd8 R14: ffff8800528dc228 R15: 0000000000000000
[15920.684035] FS:  00007f2d43084740(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
[15920.684035] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[15920.684035] CR2: 0000000000000008 CR3: 00000000401e0000 CR4: 00000000000006f0
[15920.684035] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[15920.684035] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[15920.684035] Call Trace:
[15920.684035]  [<ffffffffc0c4e603>] ? htable_lookup+0x153/0x170 [obdclass]
[15920.684035]  [<ffffffffc0c5118c>] lu_object_find_at+0x16c/0x290 [obdclass]
[15920.684035]  [<ffffffffc12617de>] echo_md_dir_stripe_choose.isra.43+0x26e/0x680 [obdecho]
[15920.684035]  [<ffffffffc126268e>] echo_md_handler.isra.45+0xa9e/0x2c20 [obdecho]
[15920.684035]  [<ffffffffc12658a1>] echo_client_iocontrol+0x1091/0x1ba0 [obdecho]
[15920.684035]  [<ffffffffc0c31829>] ? lprocfs_counter_add+0xf9/0x160 [obdclass]
[15920.684035]  [<ffffffffc0c1c63d>] class_handle_ioctl+0x18ed/0x1df0 [obdclass]
[15920.684035]  [<ffffffff811af746>] ? do_read_fault.isra.44+0xe6/0x130
[15920.684035]  [<ffffffff812b3ea8>] ? security_capable+0x18/0x20
[15920.684035]  [<ffffffffc0c01602>] obd_class_ioctl+0xd2/0x170 [obdclass]
[15920.684035]  [<ffffffff8121730d>] do_vfs_ioctl+0x33d/0x540
[15920.684035]  [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
[15920.684035]  [<ffffffff810ec7ba>] ? __getnstimeofday64+0x3a/0xd0
[15920.684035]  [<ffffffff812175b1>] SyS_ioctl+0xa1/0xc0
[15920.684035]  [<ffffffff816b8930>] ? system_call_after_swapgs+0x15d/0x214
[15920.684035]  [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[15920.684035]  [<ffffffff816b889d>] ? system_call_after_swapgs+0xca/0x214
[15920.684035] Code: 48 8b 42 10 ff 10 48 85 c0 49 89 c4 0f 84 3c 02 00 00 48 3d 00 f0 ff ff 0f 87 6f 02 00 00 48 8b 08 49 8b 57 08 49 8b 07 45 31 ff <48> 89 51 08 48 89 01 49 8b 04 24 4c 8d 70 40 48 89 44 24 08 48 
[15920.684035] RIP  [<ffffffffc0c50d23>] lu_object_alloc+0x73/0x310 [obdclass]
Comment by Patrick Farrell (Inactive) [ 15/Feb/18 ]

+1 on master:
https://testing.hpdd.intel.com/test_sets/ef6a9f1a-1240-11e8-a10a-52540065bddc

Comment by Patrick Farrell (Inactive) [ 15/Feb/18 ]

One more:
https://testing.hpdd.intel.com/test_sessions/5f0985a8-746d-46ad-bfa6-dc20f921b807

Comment by Patrick Farrell (Inactive) [ 15/Feb/18 ]

https://testing.hpdd.intel.com/test_sessions/0e8a10c7-fd99-4d2a-8443-8d42a144e1b7
Lots of these...

Comment by Gerrit Updater [ 16/Feb/18 ]

John L. Hammond (john.hammond@intel.com) uploaded a new patch: https://review.whamcloud.com/31338
Subject: LU-10421 echo: use echo layer when finding stripe object
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 13adbc8c6f74f8b1138dd8ec30c35b4b982c8c00

Comment by Minh Diep [ 23/Feb/18 ]

+1 on b2_10

https://testing.hpdd.intel.com/test_sets/1c1f2c2c-11f3-11e8-bd00-52540065bddc

Comment by Mikhail Pershin [ 06/Mar/18 ]

+1 on master
testing.hpdd.intel.com/test_sessions/42ea7cc1-90c6-493a-847c-fdff29b133e1
same trace in trevis-39vm1.log

Comment by Gerrit Updater [ 06/Mar/18 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch https://review.whamcloud.com/31338/
Subject: LU-10421 echo: use echo layer when finding stripe object
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 6f60a28206b2755a9aa158d82713b73efa09e81b

Comment by Gerrit Updater [ 06/Mar/18 ]

Minh Diep (minh.diep@intel.com) uploaded a new patch: https://review.whamcloud.com/31556
Subject: LU-10421 echo: use echo layer when finding stripe object
Project: fs/lustre-release
Branch: b2_10
Current Patch Set: 1
Commit: 0a8a4f0cf53d853daa7f421516d9733afb399248

Comment by Peter Jones [ 06/Mar/18 ]

Landed for 2.11

Comment by Saurabh Tandan (Inactive) [ 11/Apr/18 ]

+1 on 2.10.3

https://testing.hpdd.intel.com/test_sets/6e7cb3b0-3d84-11e8-960d-52540065bddc

 

Comment by Gerrit Updater [ 12/Apr/18 ]

John L. Hammond (john.hammond@intel.com) merged in patch https://review.whamcloud.com/31556/
Subject: LU-10421 echo: use echo layer when finding stripe object
Project: fs/lustre-release
Branch: b2_10
Current Patch Set:
Commit: b21c6045c6dfffaea932b9632723a7569ebd5ce5

Generated at Sat Feb 10 02:34:56 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.