[LU-5855] sanity-lfsck does not work under ZFS-based DNE mode Created: 04/Nov/14  Updated: 13/Oct/21  Resolved: 09/Dec/20

Status: Resolved
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.7.0
Fix Version/s: Lustre 2.14.0

Type: Bug Priority: Critical
Reporter: Maloo Assignee: Andreas Dilger
Resolution: Fixed Votes: 0
Labels: HB

Issue Links:
Related
is related to LU-7585 Implement OI Scrub for ZFS Resolved
Severity: 3
Rank (Obsolete): 16391

 Description   

This issue was created by maloo for nasf <fan.yong@intel.com>

Please provide additional information about the failure here.

This issue relates to the following test suite run: https://testing.hpdd.intel.com/test_sets/d2e05b7e-632e-11e4-8a7a-5254006e85c2.

MDS crashed:

23:01:10:Lustre: DEBUG MARKER: == sanity-lfsck test 2e: namespace LFSCK can verify remote object linkEA == 06:00:52 (1414994452)
23:01:10:Lustre: ctl-lustre-MDT0000: super-sequence allocation rc = 0 [0x0000000340000400-0x0000000380000400):1:mdt
23:01:10:Lustre: Skipped 4 previous similar messages
23:01:10:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0x1603
23:01:10:Lustre: *** cfs_fail_loc=1603, val=0***
23:01:10:Lustre: DEBUG MARKER: /usr/sbin/lctl set_param fail_loc=0
23:01:10:Lustre: DEBUG MARKER: /usr/sbin/lctl lfsck_start -M lustre-MDT0000 -t namespace -r -A
23:01:10:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace |
23:01:10: awk '/^status/

{ print $2 }'
23:01:10:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace |
23:01:10: awk '/^status/ { print $2 }

'
23:01:10:BUG: unable to handle kernel NULL pointer dereference at (null)
23:01:10:IP: [<ffffffffa077aff9>] linkea_entry_unpack+0x9/0x60 [obdclass]
23:01:10:PGD 0
23:01:10:Oops: 0000 1 SMP
23:01:10:last sysfs file: /sys/devices/system/cpu/online
23:01:10:CPU 0
23:01:10:Modules linked in: osp(U) mdd(U) lod(U) mdt(U) lfsck(U) mgs(U) mgc(U) osd_zfs(U) lquota(U) lustre(U) lov(U) mdc(U) fid(U)
23:01:10:Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n mdd.lustre-MDT0000.lfsck_namespace |
23:01:10: awk '/^status/

{ print $2 }

'
23:01:10: lmv(U) fld(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfsd exportfs nfs lockd fscache auth_rpcgss nfs_acl sunrpc ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core zfs(P)(U) zcommon(P)(U) znvpair(P)(U) zavl(P)(U) zunicode(P)(U) spl(U) zlib_deflate microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
23:01:10:
23:01:10:Pid: 11262, comm: lfsck_namespace Tainted: P --------------- 2.6.32-431.29.2.el6_lustre.g8979f2c.x86_64 #1 Red Hat KVM
23:01:10:RIP: 0010:[<ffffffffa077aff9>] [<ffffffffa077aff9>] linkea_entry_unpack+0x9/0x60 [obdclass]
23:01:10:RSP: 0018:ffff88006d02d940 EFLAGS: 00010286
23:01:10:RAX: ffff88006ebcc890 RBX: ffff88006ebcc810 RCX: ffff88006ebcc890
23:01:10:RDX: ffff88006ebcc810 RSI: ffff88006d02db48 RDI: 0000000000000000
23:01:10:RBP: ffff88006d02d940 R08: ffff88006d02db30 R09: ffff88006d02db50
23:01:10:R10: 0000010000000004 R11: 0040030000001400 R12: ffff88006ebcca10
23:01:10:R13: ffff8800707e8540 R14: ffff88007079c000 R15: ffff8800707dc8b0
23:01:10:FS: 0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
23:01:10:CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
23:01:10:CR2: 0000000000000000 CR3: 0000000001a85000 CR4: 00000000000006f0
23:01:10:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
23:01:10:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
23:01:10:Process lfsck_namespace (pid: 11262, threadinfo ffff88006d02c000, task ffff8800705ccae0)
23:01:10:Stack:
23:01:10: ffff88006d02d960 ffffffffa0f8300f ffff88006ebcc800 ffff880070196e80
23:01:10:<d> ffff88006d02d9f0 ffffffffa0f99ec7 ffff88006d02d980 ffffffffa01b9646
23:01:10:<d> ffff88006d02d9d0 ffff880070180450 ffff88007019a800 ffff88006ebcc880
23:01:10:Call Trace:
23:01:10: [<ffffffffa0f8300f>] lfsck_namespace_unpack_linkea_entry+0x2f/0x60 [lfsck]
23:01:10: [<ffffffffa0f99ec7>] lfsck_namespace_dsd_single+0xf7/0xcb0 [lfsck]
23:01:10: [<ffffffffa01b9646>] ? nvlist_lookup_byte_array+0x16/0x20 [znvpair]
23:01:10: [<ffffffffa0f9b3db>] lfsck_namespace_dsd_multiple+0x2fb/0xe10 [lfsck]
23:01:10: [<ffffffffa0eb8cad>] ? osd_xattr_get+0x20d/0x310 [osd_zfs]
23:01:10: [<ffffffffa0f9c31c>] lfsck_namespace_double_scan_dir+0x42c/0xdb0 [lfsck]
23:01:10: [<ffffffffa0f9d024>] lfsck_namespace_double_scan_one+0x384/0x12c0 [lfsck]
23:01:10: [<ffffffffa0257faa>] ? zap_lookup_uint64+0xaa/0xc0 [zfs]
23:01:10: [<ffffffffa075f103>] ? lu_object_find_at+0xb3/0x100 [obdclass]
23:01:10: [<ffffffffa0f9eddf>] lfsck_namespace_assistant_handler_p2+0xe7f/0x1110 [lfsck]
23:01:10: [<ffffffffa0f8100d>] lfsck_assistant_engine+0x130d/0x1c50 [lfsck]
23:01:10: [<ffffffff81061d00>] ? default_wake_function+0x0/0x20
23:01:10: [<ffffffffa0f7fd00>] ? lfsck_assistant_engine+0x0/0x1c50 [lfsck]
23:01:10: [<ffffffff8109abf6>] kthread+0x96/0xa0
23:01:10: [<ffffffff8100c20a>] child_rip+0xa/0x20
23:01:10: [<ffffffff8109ab60>] ? kthread+0x0/0xa0
23:01:10: [<ffffffff8100c200>] ? child_rip+0x0/0x20
23:01:10:Code: 7a a0 31 c0 c7 05 5c 02 05 00 00 00 04 00 e8 7f 51 e8 ff 48 c7 c7 20 b2 7c a0 e8 63 4e e7 ff 0f 1f 00 55 48 89 e5 0f 1f 44 00 00 <0f> b6 07 44 0f b6 47 01 c1 e0 08 44 09 c0 89 06 4c 8b 47 02 4c
23:01:10:RIP [<ffffffffa077aff9>] linkea_entry_unpack+0x9/0x60 [obdclass]
23:01:10: RSP <ffff88006d02d940>
23:01:10:CR2: 0000000000000000



 Comments   
Comment by nasf (Inactive) [ 04/Nov/14 ]

Here is the patch:
http://review.whamcloud.com/#/c/12552/

Comment by Gerrit Updater [ 19/Nov/14 ]

Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/12552/
Subject: LU-5855 lfsck: misc fixes for zfs-based backend
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: 26995a3588e72d65f609b12772d24a879c9deb7f

Comment by Andreas Dilger [ 05/Aug/20 ]

The sanity-lfsck.sh test_31 is still in the ALWAYS_EXCEPT list. Since ZFS now supports DNE striped directories, this exception should be removed to see if the test problem is fixed.

Comment by Gerrit Updater [ 25/Nov/20 ]

Andreas Dilger (adilger@whamcloud.com) uploaded a new patch: https://review.whamcloud.com/40761
Subject: LU-5855 tests: enable skipped sanity-lfsck tests on ZFS
Project: fs/lustre-release
Branch: master
Current Patch Set: 1
Commit: 693ea9b803894c3307ae52661a2a1a15c51e156b

Comment by Gerrit Updater [ 09/Dec/20 ]

Oleg Drokin (green@whamcloud.com) merged in patch https://review.whamcloud.com/40761/
Subject: LU-5855 tests: enable skipped sanity-lfsck DNE ZFS tests
Project: fs/lustre-release
Branch: master
Current Patch Set:
Commit: fc61ddc252156cef5a21cefd82362bcdd77f3a51

Comment by Peter Jones [ 09/Dec/20 ]

Tests now back on..

Generated at Sat Feb 10 07:04:20 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.