[LU-4712] racer test_1: oops at __d_lookup+0x8c Created: 05/Mar/14 Updated: 17/Apr/19 Resolved: 28/Aug/15 |
|
| Status: | Resolved |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.6.0 |
| Fix Version/s: | Lustre 2.7.0 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Maloo | Assignee: | Di Wang |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | dne2 | ||
| Environment: |
client and server: lustre-master build # 1911 RHEL6 ldiskfs DNE |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Severity: | 3 | ||||||||||||||||||||||||||||
| Rank (Obsolete): | 12954 | ||||||||||||||||||||||||||||
| Description |
|
This issue was created by maloo for sarah <sarah@whamcloud.com> This issue relates to the following test suite run: http://maloo.whamcloud.com/test_sets/57b54ffa-a02d-11e3-947c-52540035b04c. The sub-test test_1 failed with the following error:
client console 00:22:43:Lustre: DEBUG MARKER: == racer test 1: racer on clients: client-32vm5,client-32vm6.lab.whamcloud.com DURATION=900 == 00:20:32 (1393489232)
00:22:44:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=2 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer1
00:22:44:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=2 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer1
00:22:44:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=2 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre2/racer
00:22:44:Lustre: DEBUG MARKER: DURATION=900 MDSCOUNT=2 /usr/lib64/lustre/tests/racer/racer.sh /mnt/lustre/racer
00:22:44:LustreError: 14649:0:(lmv_intent.c:251:lmv_revalidate_slaves()) lustre-clilmv-ffff880037cd8400: nlink 0 < 2 corrupt stripe 0 [0x3c0000402:0xf4:0x0]:[0x3c0000402:0xf4:0x0]
00:22:44:LustreError: 14651:0:(lmv_intent.c:251:lmv_revalidate_slaves()) lustre-clilmv-ffff880037cd8400: nlink 0 < 2 corrupt stripe 0 [0x3c0000402:0xf4:0x0]:[0x3c0000402:0xf4:0x0]
00:22:44:LustreError: 17543:0:(lmv_intent.c:251:lmv_revalidate_slaves()) lustre-clilmv-ffff88007aba6800: nlink 0 < 2 corrupt stripe 0 [0x3c0000401:0x2b6:0x0]:[0x3c0000401:0x2b6:0x0]
00:22:44:LustreError: 19056:0:(dir.c:467:ll_dir_setstripe()) mdc_setattr fails: rc = -22
00:22:44:LustreError: 17285:0:(dir.c:467:ll_dir_setstripe()) mdc_setattr fails: rc = -22
00:22:44:LustreError: 22091:0:(dir.c:467:ll_dir_setstripe()) mdc_setattr fails: rc = -2
00:22:44:LustreError: 22091:0:(dir.c:467:ll_dir_setstripe()) Skipped 3 previous similar messages
00:22:44:LustreError: 24110:0:(dir.c:467:ll_dir_setstripe()) mdc_setattr fails: rc = -22
00:22:44:LustreError: 4266:0:(lmv_intent.c:251:lmv_revalidate_slaves()) lustre-clilmv-ffff88007aba6800: nlink 0 < 2 corrupt stripe 0 [0x3c0000403:0x61b:0x0]:[0x3c0000403:0x61b:0x0]
00:22:44:LustreError: 6322:0:(lmv_intent.c:251:lmv_revalidate_slaves()) lustre-clilmv-ffff88007aba6800: nlink 0 < 2 corrupt stripe 0 [0x400000402:0x8d1:0x0]:[0x400000402:0x8d1:0x0]
00:22:44:LustreError: 11-0: lustre-OST0006-osc-ffff880037cd8400: Communicating with 10.10.4.199@tcp, operation ldlm_enqueue failed with -107.
00:22:45:Lustre: lustre-OST0006-osc-ffff880037cd8400: Connection to lustre-OST0006 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
00:22:45:LustreError: 167-0: lustre-OST0006-osc-ffff880037cd8400: This client was evicted by lustre-OST0006; in progress operations using this service will fail.
00:22:45:LustreError: 11-0: lustre-OST0006-osc-ffff880037cd8400: Communicating with 10.10.4.199@tcp, operation ldlm_enqueue failed with -107.
00:22:45:Lustre: 2344:0:(llite_lib.c:2697:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.10.4.198@tcp:/lustre/fid: [0x3c0000401:0x949:0x0]/ may get corrupted (rc -108)
00:23:07:Lustre: 2344:0:(llite_lib.c:2697:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.10.4.198@tcp:/lustre/fid: [0x400000401:0xcd7:0x0]/ may get corrupted (rc -108)
00:23:07:LustreError: 7998:0:(osc_lock.c:830:osc_ldlm_completion_ast()) lock@ffff880071ad8738[2 3 0 1 1 00000000] W(2):[0, 18446744073709551615]@[0x380000400:0x45:0x0] {
00:23:07:LustreError: 7998:0:(osc_lock.c:830:osc_ldlm_completion_ast()) lovsub@ffff88006e14a8a0: [0 ffff88006f61fe30 W(2):[0, 18446744073709551615]@[0x400000401:0x2b2:0x0]]
00:23:07:LustreError: 7998:0:(osc_lock.c:830:osc_ldlm_completion_ast()) osc@ffff88006f6227b8: ffff880072f2c100 0x20080020002 0xb20e2ac1892f0c38 3 ffff88007d6978c8 size: 0 mtime: 1393489253 atime: 0 ctime: 1393489253 blocks: 0
00:23:07:LustreError: 7998:0:(osc_lock.c:830:osc_ldlm_completion_ast()) } lock@ffff880071ad8738
00:23:08:LustreError: 7998:0:(osc_lock.c:830:osc_ldlm_completion_ast()) dlmlock returned -5
00:23:08:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) lock@ffff88006e240660[3 1 0 0 0 00000005] W(2):[0, 18446744073709551615]@[0x400000401:0x2b2:0x0] {
00:23:08:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) vvp@ffff88006f61ef60:
00:23:08:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) lov@ffff88006f61fe30: 4
00:23:08:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) 0 0: ---
00:23:08:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) 1 0: lock@ffff88006c625ed0[2 5 0 0 0 00000001] W(2):[0, 18446744073709551615]@[0x200000400:0x43:0x0] {
00:23:08:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) lovsub@ffff88006e14ac60: [1 ffff88006f61fe30 W(2):[0, 18446744073709551615]@[0x400000401:0x2b2:0x0]]
00:23:08:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) osc@ffff88006d1facc0: ffff88006d321c80 0x20080020002 0xb20e2ac1892f0dff 5 (null) size: 0 mtime: 1393489253 atime: 1393489251 ctime: 1393489253 blocks: 0
00:23:09:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) } lock@ffff88006c625ed0
00:23:09:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) 2 0: ---
00:23:09:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) 3 0: ---
00:23:09:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel())
00:23:10:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) } lock@ffff88006e240660
00:23:10:LustreError: 10363:0:(lov_lock.c:798:lov_lock_cancel()) lov_lock_cancel fails with -5.
00:23:10:Lustre: lustre-OST0006-osc-ffff880037cd8400: Connection restored to lustre-OST0006 (at 10.10.4.199@tcp)
00:23:10:LustreError: 11-0: lustre-OST0004-osc-ffff88007aba6800: Communicating with 10.10.4.199@tcp, operation ldlm_enqueue failed with -107.
00:23:10:Lustre: lustre-OST0004-osc-ffff88007aba6800: Connection to lustre-OST0004 (at 10.10.4.199@tcp) was lost; in progress operations using this service will wait for recovery to complete
00:23:11:LustreError: 167-0: lustre-OST0004-osc-ffff88007aba6800: This client was evicted by lustre-OST0004; in progress operations using this service will fail.
00:23:11:Lustre: 2343:0:(llite_lib.c:2697:ll_dirty_page_discard_warn()) lustre: dirty page discard: 10.10.4.198@tcp:/lustre/fid: [0x400000401:0xd55:0x0]/ may get corrupted (rc -108)
00:23:11:Lustre: 2343:0:(llite_lib.c:2697:ll_dirty_page_discard_warn()) Skipped 5 previous similar messages
00:23:11:LustreError: 15161:0:(osc_lock.c:830:osc_ldlm_completion_ast()) lock@ffff880070acaa98[2 3 0 1 1 00000000] W(2):[0, 18446744073709551615]@[0x100040000:0x64:0x0] {
00:23:11:LustreError: 15161:0:(osc_lock.c:830:osc_ldlm_completion_ast()) lovsub@ffff880063ef3120: [0 ffff88006d391610 W(2):[0, 18446744073709551615]@[0x3c0000402:0x365:0x0]]
00:23:11:LustreError: 15161:0:(osc_lock.c:830:osc_ldlm_completion_ast()) osc@ffff8800674e5420: ffff88007081f740 0x20080020002 0xb20e2ac18931b4ea 3 ffff8800708b6058 size: 5 mtime: 1393489283 atime: 0 ctime: 1393489283 blocks: 8
00:23:12:LustreError: 15161:0:(osc_lock.c:830:osc_ldlm_completion_ast()) } lock@ffff880070acaa98
00:23:12:LustreError: 15161:0:(osc_lock.c:830:osc_ldlm_completion_ast()) dlmlock returned -5
00:23:12:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) lock@ffff88006db3d228[2 1 0 0 0 00000005] W(2):[0, 18446744073709551615]@[0x3c0000402:0x365:0x0] {
00:23:13:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) vvp@ffff880070b9e9c0:
00:23:13:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) lov@ffff88006d391610: 3
00:28:36:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) 0 0: ---
00:28:37:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) 1 0: lock@ffff880070aca588[0 5 0 0 0 00000000] W(2):[0, 18446744073709551615]@[0x100050000:0x64:0x0] {
00:28:37:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) lovsub@ffff88006c4c3460: [1 ffff88006d391610 W(2):[0, 18446744073709551615]@[0x3c0000402:0x365:0x0]]
00:28:37:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) osc@ffff88006ed489e0: ffff88006a958140 0x20080020002 0xb20e2ac18931bcdf 4 (null) size: 0 mtime: 1393489283 atime: 0 ctime: 1393489283 blocks: 0
00:28:37:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) } lock@ffff880070aca588
00:28:37:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) 2 0: ---
00:28:37:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel())
00:28:37:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) } lock@ffff88006db3d228
00:28:37:LustreError: 10650:0:(lov_lock.c:798:lov_lock_cancel()) lov_lock_cancel fails with -5.
00:28:37:Lustre: lustre-OST0004-osc-ffff88007aba6800: Connection restored to lustre-OST0004 (at 10.10.4.199@tcp)
00:28:37:LustreError: 15216:0:(dir.c:467:ll_dir_setstripe()) mdc_setattr fails: rc = -22
00:28:37:LustreError: 15216:0:(dir.c:467:ll_dir_setstripe()) Skipped 1 previous similar message
00:28:37:LustreError: 16899:0:(lmv_intent.c:251:lmv_revalidate_slaves()) lustre-clilmv-ffff88007aba6800: nlink 0 < 2 corrupt stripe 0 [0x400000404:0xe1c:0x0]:[0x400000404:0xe1c:0x0]
00:28:38:LustreError: 17215:0:(dir.c:467:ll_dir_setstripe()) mdc_setattr fails: rc = -22
00:28:38:LustreError: 17215:0:(dir.c:467:ll_dir_setstripe()) Skipped 1 previous similar message
00:28:38:LustreError: 20240:0:(dir.c:467:ll_dir_setstripe()) mdc_setattr fails: rc = -22
00:28:38:LustreError: 20240:0:(dir.c:467:ll_dir_setstripe()) Skipped 3 previous similar messages
00:28:38:LustreError: 25893:0:(lmv_intent.c:251:lmv_revalidate_slaves()) lustre-clilmv-ffff88007aba6800: nlink 0 < 2 corrupt stripe 0 [0x400000402:0x1339:0x0]:[0x400000402:0x1339:0x0]
00:28:38:LustreError: 25893:0:(lmv_intent.c:251:lmv_revalidate_slaves()) Skipped 4 previous similar messages
00:28:38:LustreError: 31191:0:(lmv_intent.c:251:lmv_revalidate_slaves()) lustre-clilmv-ffff880037cd8400: nlink 0 < 2 corrupt stripe 0 [0x400000402:0x13b0:0x0]:[0x400000402:0x13b0:0x0]
00:28:38:LustreError: 31191:0:(lmv_intent.c:251:lmv_revalidate_slaves()) Skipped 1 previous similar message
00:28:38:LustreError: 4926:0:(dir.c:467:ll_dir_setstripe()) mdc_setattr fails: rc = -22
00:28:38:LustreError: 4926:0:(dir.c:467:ll_dir_setstripe()) Skipped 3 previous similar messages
00:28:38:LustreError: 11-0: lustre-MDT0001-mdc-ffff88007aba6800: Communicating with 10.10.4.202@tcp, operation ldlm_enqueue failed with -107.
00:28:38:Lustre: lustre-MDT0001-mdc-ffff88007aba6800: Connection to lustre-MDT0001 (at 10.10.4.202@tcp) was lost; in progress operations using this service will wait for recovery to complete
00:28:38:LustreError: 167-0: lustre-MDT0001-mdc-ffff88007aba6800: This client was evicted by lustre-MDT0001; in progress operations using this service will fail.
00:28:38:LustreError: 31718:0:(mdc_locks.c:920:mdc_enqueue()) ldlm_cli_enqueue: -5
00:28:38:LustreError: 31718:0:(mdc_locks.c:920:mdc_enqueue()) Skipped 4 previous similar messages
00:28:38:LustreError: 31718:0:(file.c:3088:ll_inode_revalidate_fini()) lustre: revalidate FID [0x400000401:0x1ad8:0x0] error: rc = -5
00:28:38:LustreError: 7822:0:(mdc_locks.c:920:mdc_enqueue()) ldlm_cli_enqueue: -108
00:28:38:LustreError: 7822:0:(mdc_locks.c:920:mdc_enqueue()) Skipped 2 previous similar messages
00:28:38:LustreError: 7822:0:(mdc_request.c:1537:mdc_read_page()) lustre-MDT0001-mdc-ffff88007aba6800: [0x400000401:0x1ad8:0x0] lock enqueue fails: rc = -108
00:28:38:LustreError: 7822:0:(file.c:174:ll_close_inode_openhandle()) lustre-clilmv-ffff88007aba6800: inode [0x400000401:0x1ad8:0x0] mdc close failed: rc = -108
00:28:39:LustreError: 31718:0:(file.c:3088:ll_inode_revalidate_fini()) Skipped 1 previous similar message
00:28:39:LustreError: 6258:0:(file.c:3088:ll_inode_revalidate_fini()) lustre: revalidate FID [0x400000401:0x1a30:0x0] error: rc = -108
00:28:39:LustreError: 31735:0:(file.c:174:ll_close_inode_openhandle()) lustre-clilmv-ffff88007aba6800: inode [0x400000400:0x1:0x0] mdc close failed: rc = -108
00:31:21:LustreError: 31735:0:(file.c:174:ll_close_inode_openhandle()) Skipped 2 previous similar messages
00:31:21:LustreError: 31742:0:(lmv_obd.c:1424:lmv_fid_alloc()) Can't alloc new fid, rc -19
00:31:21:LustreError: 7645:0:(vvp_io.c:1215:vvp_io_init()) lustre: refresh file layout [0x400000400:0x1c3a:0x0] error -108.
00:31:21:LustreError: 7645:0:(vvp_io.c:1215:vvp_io_init()) lustre: refresh file layout [0x400000400:0x1c3a:0x0] error -108.
00:31:21:LustreError: 31750:0:(lmv_obd.c:1424:lmv_fid_alloc()) Can't alloc new fid, rc -19
00:31:21:Lustre: lustre-MDT0001-mdc-ffff88007aba6800: Connection restored to lustre-MDT0001 (at 10.10.4.202@tcp)
00:31:21:LustreError: 341:0:(dir.c:467:ll_dir_setstripe()) mdc_setattr fails: rc = -22
00:31:21:LustreError: 341:0:(dir.c:467:ll_dir_setstripe()) Skipped 5 previous similar messages
00:31:21:LustreError: 32200:0:(lmv_intent.c:251:lmv_revalidate_slaves()) lustre-clilmv-ffff88007aba6800: nlink 0 < 2 corrupt stripe 0 [0x400000401:0x1ad8:0x0]:[0x400000401:0x1ad8:0x0]
00:31:21:LustreError: 11-0: lustre-MDT0001-mdc-ffff880037cd8400: Communicating with 10.10.4.202@tcp, operation ldlm_enqueue failed with -107.
00:31:21:LustreError: Skipped 1 previous similar message
00:31:21:Lustre: lustre-MDT0001-mdc-ffff880037cd8400: Connection to lustre-MDT0001 (at 10.10.4.202@tcp) was lost; in progress operations using this service will wait for recovery to complete
00:31:21:LustreError: 167-0: lustre-MDT0001-mdc-ffff880037cd8400: This client was evicted by lustre-MDT0001; in progress operations using this service will fail.
00:31:21:LustreError: 13398:0:(mdc_locks.c:920:mdc_enqueue()) ldlm_cli_enqueue: -5
00:31:21:LustreError: 13398:0:(mdc_locks.c:920:mdc_enqueue()) Skipped 224 previous similar messages
00:31:21:LustreError: 15879:0:(file.c:174:ll_close_inode_openhandle()) lustre-clilmv-ffff880037cd8400: inode [0x400000401:0x239f:0x0] mdc close failed: rc = -108
00:31:21:LustreError: 15879:0:(file.c:174:ll_close_inode_openhandle()) Skipped 53 previous similar messages
00:31:21:LustreError: 15671:0:(mdc_request.c:1537:mdc_read_page()) lustre-MDT0001-mdc-ffff880037cd8400: [0x400000400:0x1:0x0] lock enqueue fails: rc = -108
00:31:21:LustreError: 15671:0:(mdc_request.c:1537:mdc_read_page()) Skipped 1 previous similar message
00:31:21:LustreError: 15730:0:(lmv_obd.c:1424:lmv_fid_alloc()) Can't alloc new fid, rc -19
00:31:21:LustreError: 15730:0:(lmv_obd.c:1424:lmv_fid_alloc()) Skipped 22 previous similar messages
00:31:21:LustreError: 4415:0:(vvp_io.c:1215:vvp_io_init()) lustre: refresh file layout [0x400000400:0x1ad0:0x0] error -108.
00:31:21:LustreError: 4415:0:(vvp_io.c:1215:vvp_io_init()) Skipped 4 previous similar messages
00:31:21:Lustre: lustre-MDT0001-mdc-ffff880037cd8400: Connection restored to lustre-MDT0001 (at 10.10.4.202@tcp)
00:31:21:BUG: unable to handle kernel paging request at fffffffd00000018
00:31:21:IP: [<ffffffff811a374c>] __d_lookup+0x8c/0x150
00:31:21:PGD 1a87067 PUD 0
00:31:21:Oops: 0000 [#1] SMP
00:31:21:last sysfs file: /sys/devices/system/cpu/online
00:31:22:CPU 1
00:31:22:Modules linked in: lustre(U) obdecho(U) mgc(U) lov(U) osc(U) mdc(U) lmv(U) fid(U) fld(U) ptlrpc_gss(U) ptlrpc(U) obdclass(U) ksocklnd(U) lnet(U) sha512_generic sha256_generic libcfs(U) nfs fscache nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs autofs4 ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ib_mad ib_core microcode virtio_balloon 8139too 8139cp mii i2c_piix4 i2c_core ext3 jbd mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
00:31:22:
00:31:22:Pid: 10651, comm: file_rm.sh Not tainted 2.6.32-431.5.1.el6.x86_64 #1 Red Hat KVM
00:31:22:RIP: 0010:[<ffffffff811a374c>] [<ffffffff811a374c>] __d_lookup+0x8c/0x150
00:31:22:RSP: 0018:ffff880071a31c88 EFLAGS: 00010286
00:31:22:RAX: 0000000000000005 RBX: fffffffd00000000 RCX: 0000000000000012
00:31:22:RDX: 018721e00667721f RSI: ffff880071a31d68 RDI: ffff88007e801980
00:31:22:RBP: ffff880071a31cd8 R08: ffff880071a31d7d R09: 00000000fffffffa
00:31:22:R10: 0000000000000004 R11: 0000000000000000 R12: fffffffcffffffe8
00:31:22:R13: ffff88007e801980 R14: 00000000086181b9 R15: 0000000000003f40
00:31:22:FS: 00007fbad989b700(0000) GS:ffff880002300000(0000) knlGS:0000000000000000
00:31:22:CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
00:31:22:CR2: fffffffd00000018 CR3: 000000006e172000 CR4: 00000000000006e0
00:31:22:DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
00:31:22:DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
00:31:22:Process file_rm.sh (pid: 10651, threadinfo ffff880071a30000, task ffff880072d3e080)
00:31:22:Stack:
00:31:22: ffff880071a31d78 0000000500000001 0000000000000005 ffff880071a31d68
00:31:22:<d> 0000000000000000 0000000000010870 ffff880071a31d68 ffff88007e801980
00:31:22:<d> ffff880071a31d68 0000000000003f40 ffff880071a31d08 ffffffff811a3fc5
00:31:22:Call Trace:
00:31:22: [<ffffffff811a3fc5>] d_lookup+0x35/0x60
00:31:22: [<ffffffff811a4073>] d_hash_and_lookup+0x83/0xb0
00:31:22: [<ffffffff811f8930>] proc_flush_task+0xa0/0x290
00:31:22: [<ffffffff810751b8>] release_task+0x48/0x4b0
00:31:22: [<ffffffff81075fb6>] wait_consider_task+0x7e6/0xb20
00:31:22: [<ffffffff810763e6>] do_wait+0xf6/0x240
00:31:22: [<ffffffff810765d3>] sys_wait4+0xa3/0x100
00:31:22: [<ffffffff81074b70>] ? child_wait_callback+0x0/0x70
00:31:22: [<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
00:31:22: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
00:31:22:Code: 48 03 05 48 6d a6 00 48 8b 18 8b 45 bc 48 85 db 48 89 45 c0 75 11 eb 74 0f 1f 80 00 00 00 00 48 8b 1b 48 85 db 74 65 4c 8d 63 e8 <45> 39 74 24 30 75 ed 4d 39 6c 24 28 75 e6 4d 8d 7c 24 08 4c 89
00:31:22:RIP [<ffffffff811a374c>] __d_lookup+0x8c/0x150
00:31:22: RSP <ffff880071a31c88>
00:31:22:CR2: fffffffd00000018
|
| Comments |
| Comment by Andreas Dilger [ 13/Mar/14 ] |
|
Looks like a bug with racer and striped directories. There are quite a number of scary looking errors that are not fatal, but imply some problems with DNE2 striped directories: lmv_revalidate_slaves()) lustre-clilmv-ffff880037cd8400: nlink 0 < 2 corrupt stripe 0 [0x3c0000402:0xf4:0x0]:[0x3c0000402:0xf4:0x0] It also looks like the client was evicted from the MDT, which shouldn't happen even during racer. That would imply some kind of deadlock or bug in the code. Some error messages could just be turned off I think, since they will be returned to the application anyway, and printing them on the console just hides more important stuff: ll_dir_setstripe()) mdc_setattr fails: rc = -22 |
| Comment by Di Wang [ 13/Mar/14 ] |
|
I suspect the panic has been fixed with later landing, anyway, I did not see the panic in my local run. But these console error message needs to be turned off. I will provide a patch. |
| Comment by Di Wang [ 18/Mar/14 ] |
| Comment by John Hammond [ 15/Apr/14 ] |
|
I see this and similar oopses running racer with MDSCOUNT=4, 5% RPC drop, and migration disabled (commented out from racer/racer.sh). diff --git a/lustre/tests/racer/file_create.sh b/lustre/tests/racer/file_create.sh
index 828e69c..8f4830f 100755
--- a/lustre/tests/racer/file_create.sh
+++ b/lustre/tests/racer/file_create.sh
@@ -8,8 +8,8 @@ OSTCOUNT=${OSTCOUNT:-$(lfs df $DIR 2> /dev/null | grep -c OST)}
while /bin/true ; do
file=$((RANDOM % MAX))
- SIZE=$((RANDOM * MAX_MB / 32))
- echo "file_create: FILE=$DIR/$file SIZE=$SIZE"
+ SIZE=$((RANDOM % 4))
+
[ $OSTCOUNT -gt 0 ] &&
lfs setstripe -c $((RANDOM % OSTCOUNT)) $DIR/$file 2> /dev/null
dd if=/dev/zero of=$DIR/$file bs=1k count=$SIZE 2> /dev/null
diff --git a/lustre/tests/racer/racer.sh b/lustre/tests/racer/racer.sh
index 6ba8b7c..65528cb 100755
--- a/lustre/tests/racer/racer.sh
+++ b/lustre/tests/racer/racer.sh
@@ -16,7 +16,7 @@ RACER_PROGS="file_create dir_create file_rm file_rename file_link file_symlink \
file_list file_concat file_exec"
if [ $MDSCOUNT -gt 1 ]; then
- RACER_PROGS="${RACER_PROGS} dir_remote dir_migrate"
+ RACER_PROGS="${RACER_PROGS} dir_remote" # dir_migrate
fi
racer_cleanup()
--
# export MDSCOUNT=4
# export MOUNT_2=y
# llmount.sh
...
# lctl set_param fail_loc=0x08000505
# lctl set_param fail_val=20
# sh lustre/tests/racer.sh
|
| Comment by Andreas Dilger [ 16/Jan/15 ] |
|
Note in advance that the landing of http://review.whamcloud.com/9689 is not necessarily expected to fix this bug, so the bug should not be closed when it lands. |
| Comment by Gerrit Updater [ 04/Feb/15 ] |
|
Oleg Drokin (oleg.drokin@intel.com) merged in patch http://review.whamcloud.com/9689/ |
| Comment by Peter Jones [ 04/Feb/15 ] |
|
Landed for 2.7 |
| Comment by Lai Siyao [ 09/Feb/15 ] |
|
Peter, http://review.whamcloud.com/9689/ doesn't fully fix this issue, as is noted by Andreas in above comment. So this shouldn't be marked resolved. |
| Comment by Peter Jones [ 09/Feb/15 ] |
|
ok |
| Comment by Peter Jones [ 28/Aug/15 ] |
|
As per Di this is no longer happening on current master |