[LU-15763] large-scale test_3a: LustreError: (file.c:243:ll_close_inode_openhandle()) mdc close failed: rc = -108 Created: 19/Apr/22 Updated: 03/Nov/23 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.12.9, Lustre 2.15.0 |
| Fix Version/s: | Lustre 2.16.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Maloo | Assignee: | Alex Deiter |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | failing_tests | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
This issue was created by maloo for Cliff White <cwhite@whamcloud.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/cde24d83-d48d-4472-a697-ac0b030ae772 Test run timed out after nfsv4 scale, multiple stack dumps on various nodes, example: [11156.489772] Lustre: DEBUG MARKER: ./compilebench -D /mnt/lustre/d0.parallel-scale-nfs/d0.compilebench.513626 -i 2 -r 2 --makej [16428.917978] compilebench (518595): drop_caches: 3 [16733.594234] compilebench (518595): drop_caches: 3 [18407.728842] sysrq: SysRq : Changing Loglevel [18407.737126] sysrq: Loglevel set to 8 [18407.740113] sysrq: SysRq : Show State [18407.740908] task:systemd state:S stack: 0 pid: 1 ppid: 0 flags:0x00000000 [18407.742495] Call Trace: [18407.743081] __schedule+0x2c4/0x700 [18407.743825] schedule+0x37/0xa0 [18407.744493] schedule_hrtimeout_range_clock+0xf7/0x110 [18407.745550] ? seq_release+0x24/0x30 [18407.746279] ep_poll+0x403/0x460 [18407.747000] ? wake_up_q+0x80/0x80 [18407.747713] do_epoll_wait+0xb0/0xd0 [18407.748453] __x64_sys_epoll_wait+0x1a/0x20 [18407.749299] do_syscall_64+0x5b/0x1a0 [18407.750070] entry_SYSCALL_64_after_hwframe+0x65/0xca [18407.751089] RIP: 0033:0x7f87afd9c0f7 [18407.751837] Code: Unable to access opcode bytes at RIP 0x7f87afd9c0cd. [18407.753097] RSP: 002b:00007ffe859313f0 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [18407.754551] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f87afd9c0f7 [18407.755919] RDX: 000000000000006a RSI: 00007ffe85931430 RDI: 0000000000000004 [18407.757299] RBP: 00007ffe85931430 R08: 0000000000000000 R09: 00007ffe85931a80 [18407.758669] R10: 00000000ffffffff R11: 0000000000000293 R12: 000000000000006a [18407.760038] R13: 00000000ffffffff R14: 00007ffe85931430 R15: 0000000000000001 [18407.761415] task:kthreadd state:S stack: 0 pid: 2 ppid: 0 flags:0x80004000 [18407.762995] Call Trace: |
| Comments |
| Comment by Sarah Liu [ 14/Jun/22 ] |
|
Similar error in 2.12.9 with el8.6 client large-scale test_3a and performance-sanity test_6 client 2 crash [14214.730002] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 03:44:34 (1655178274) [14215.236907] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm4.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 200 [14215.600050] Lustre: DEBUG MARKER: onyx-46vm4.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 200 [14414.697205] Lustre: DEBUG MARKER: /usr/sbin/lctl mark 3 : Starting failover on mds1 [14415.090008] Lustre: DEBUG MARKER: 3 : Starting failover on mds1 [14416.480451] LustreError: 11-0: lustre-MDT0000-mdc-ffff8bb04541b000: operation ldlm_enqueue to node 10.240.24.11@tcp failed: rc = -19 [14416.483207] LustreError: Skipped 2 previous similar messages [14416.483304] Lustre: lustre-MDT0000-mdc-ffff8bb04541b000: Connection to lustre-MDT0000 (at 10.240.24.11@tcp) was lost; in progress operations using this service will wait for recovery to complete [14416.487495] Lustre: Skipped 2 previous similar messages [14447.015526] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 03:48:27 \(1655178507\) [14447.375691] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 03:48:27 (1655178507) [14447.951613] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4 [14448.364838] Lustre: DEBUG MARKER: onyx-46vm4.onyx.whamcloud.com: executing set_default_debug -1 all 4 [14455.474927] Lustre: DEBUG MARKER: /usr/sbin/lctl mark == rpc test complete, duration -o sec ================================================================ 03:48:35 \(1655178515\) [14456.067224] Lustre: DEBUG MARKER: == rpc test complete, duration -o sec ================================================================ 03:48:35 (1655178515) [14456.877997] Lustre: DEBUG MARKER: /usr/sbin/lctl mark onyx-46vm4.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 200 [14457.333427] Lustre: DEBUG MARKER: onyx-46vm4.onyx.whamcloud.com: executing _wait_recovery_complete *.lustre-MDT0000.recovery_status 200 [20405.164682] sysrq: SysRq : Changing Loglevel [20405.167754] sysrq: Loglevel set to 8 [20405.171169] sysrq: SysRq : Show State [20405.171956] task:systemd state:S stack: 0 pid: 1 ppid: 0 flags:0x00000000 [20405.173513] Call Trace: [20405.174071] __schedule+0x2d1/0x830 [20405.174837] schedule+0x35/0xa0 [20405.175479] schedule_hrtimeout_range_clock+0xf1/0x100 [20405.176494] ? ep_scan_ready_list.constprop.23+0x20c/0x230 [20405.177576] ? __handle_mm_fault+0x7c4/0x7e0 [20405.178444] ep_poll+0x3f3/0x450 [20405.179134] ? wake_up_q+0x70/0x70 [20405.179845] do_epoll_wait+0xb0/0xd0 [20405.180576] __x64_sys_epoll_wait+0x1a/0x20 [20405.181414] do_syscall_64+0x5b/0x1a0 [20405.182197] entry_SYSCALL_64_after_hwframe+0x65/0xca [20405.183190] RIP: 0033:0x7f63a7011a27 [20405.183935] Code: Unable to access opcode bytes at RIP 0x7f63a70119fd. [20405.185171] RSP: 002b:00007ffdb172ee70 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [20405.186592] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f63a7011a27 [20405.187932] RDX: 0000000000000058 RSI: 00007ffdb172eeb0 RDI: 0000000000000004 [20405.189273] RBP: 00007ffdb172eeb0 R08: 0000000000000000 R09: 00007ffdb172eeb0 [20405.190616] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000058 [20405.191965] R13: 00000000ffffffff R14: ffffffffffffffff R15: 00007ffdb172f430 [20405.193340] task:kthreadd state:S stack: 0 pid: 2 ppid: 0 flags:0x80004000 [20405.194898] Call Trace: [20405.195414] __schedule+0x2d1/0x830 [20405.196131] ? __schedule+0x2f3/0x830 [20405.196884] schedule+0x35/0xa0 [20405.197522] kthreadd+0x2d7/0x2f0 [20405.198249] ? kthread_is_per_cpu+0x20/0x20 [20405.199092] ret_from_fork+0x35/0x40 [20405.199825] task:rcu_gp state:I stack: 0 pid: 3 ppid: 2 flags:0x80004000 [20405.201384] Call Trace: [20405.201922] __schedule+0x2d1/0x830 [20405.202620] schedule+0x35/0xa0 [20405.203294] rescuer_thread+0x289/0x330 [20405.204102] ? process_one_work+0x360/0x360 [20405.204956] kthread+0x10a/0x120 [20405.205641] ? set_kthread_struct+0x40/0x40 [20405.206472] ret_from_fork+0x35/0x40 [20405.207206] task:rcu_par_gp state:I stack: 0 pid: 4 ppid: 2 flags:0x80004000 [20405.208767] Call Trace: [20405.209288] __schedule+0x2d1/0x830 [20405.210007] schedule+0x35/0xa0 [20405.210659] rescuer_thread+0x289/0x330 [20405.211442] ? process_one_work+0x360/0x360 [20405.212274] kthread+0x10a/0x120 [20405.212951] ? set_kthread_struct+0x40/0x40 [20405.213782] ret_from_fork+0x35/0x40 [20405.214501] task:kworker/0:0H state:I stack: 0 pid: 6 ppid: 2 flags:0x80004000 [20405.216058] Workqueue: 0x0 (events_highpri) [20405.216909] Call Trace: [20405.217428] __schedule+0x2d1/0x830 [20405.218138] schedule+0x35/0xa0 [20405.218787] worker_thread+0xb7/0x390 [20405.219519] ? create_worker+0x1a0/0x1a0 [20405.220295] kthread+0x10a/0x120 [20405.220969] ? set_kthread_struct+0x40/0x40 [20405.221804] ret_from_fork+0x35/0x40 [20405.222528] task:mm_percpu_wq state:I stack: 0 pid: 8 ppid: 2 flags:0x80004000 [20405.224077] Call Trace: [20405.224590] __schedule+0x2d1/0x830 [20405.225300] schedule+0x35/0xa0 [20405.225961] rescuer_thread+0x289/0x330 [20405.226758] ? process_one_work+0x360/0x360 [20405.227575] kthread+0x10a/0x120 [20405.228238] ? set_kthread_struct+0x40/0x40 [20405.229078] ret_from_fork+0x35/0x40 [20405.229799] task:rcu_tasks_rude_ state:S stack: 0 pid: 9 ppid: 2 flags:0x80004000 [20405.231352] Call Trace: [20405.231888] __schedule+0x2d1/0x830 [20405.232579] ? __switch_to_asm+0x41/0x70 [20405.233358] ? rcu_tasks_kthread+0x102/0x1a0 [20405.234248] schedule+0x35/0xa0 [20405.234909] rcu_tasks_kthread+0x170/0x1a0 [20405.235734] ? finish_wait+0x80/0x80 [20405.236457] ? rcu_tasks_wait_gp+0x1f0/0x1f0 [20405.237300] kthread+0x10a/0x120 [20405.237976] ? set_kthread_struct+0x40/0x40 [20405.238821] ret_from_fork+0x35/0x40 [20405.239539] task:rcu_tasks_trace state:S stack: 0 pid: 10 ppid: 2 flags:0x80004000 [20405.241087] Call Trace: |
| Comment by Andreas Dilger [ 16/Sep/22 ] |
|
It looks like large-scale test_3a has never passed with RHEL8.6 (55/55 timeouts in past 4 weeks) or el9 (9/9 timeouts in past 4 weeks), while it passed 100% (309/309) on CentOS8.5 and 37/43 runs on CentOS8.3, and 100% (156/156) on el7.9. |
| Comment by Minh Diep [ 04/Feb/23 ] |
|
this large-scale test_3a failure really needs to be looked at carefully, it only failed on rhel8.6, rhel9.0, but passed on rhel8.7, sles15sp3, ubuntu20.04 and ubuntu22.04 |
| Comment by Andreas Dilger [ 24/Mar/23 ] |
|
It looks like mdsrate is failing due to UCX ERROR seen in other uses of mdsrate. I would prefer that all of these tests be moved over to use mdtest to avoid what seems like a bug in mdsrate... https://testing.whamcloud.com/sub_tests/2883fff3-fd86-4e82-bd7a-c6fb77995b88 trevis-67vm4: *.lustre-MDT0000.recovery_status status: COMPLETE CMD: trevis-67vm4 lctl get_param -n *.lustre-MDT0000.recovery_status RECOVERY TIME: NFILES=50000 number of clients: 2 recovery_duration: 11 Rate: 526.79 eff 526.82 aggr 87.80 avg client creates/sec (total: 6 threads 50000 creates 1 dirs 6 threads/dir 94.91 secs) 0: trevis-67vm1.trevis.whamcloud.com finished at Wed Mar 22 02:50:30 2023 + chmod 0777 /mnt/lustre drwxrwxrwx 4 root root 65536 Mar 22 02:48 /mnt/lustre + su mpiuser sh -c "/usr/lib64/openmpi/bin/mpirun --mca btl tcp,self --mca btl_tcp_if_include eth0 -mca boot ssh --oversubscribe --oversubscribe -machinefile /tmp/auster.machines -np 3 /usr/lib64/openmpi/bin/mdsrate --unlink --nfiles 50000 --dir /mnt/lustre/d0.test_3a --filefmt f%%d --ignore " 0: trevis-67vm1.trevis.whamcloud.com starting at Wed Mar 22 02:50:31 2023 [1679453431.525929] [trevis-67vm2:342581:0] select.c:514 UCX ERROR no active messages transport to : posix/memory - Destination is unreachable, sysv/memory - Destination is unreachable, self/memory0 - Destination is unreachable, tcp/lo - Destination is unreachable, tcp/eth0 - Destination is unreachable |
| Comment by Andreas Dilger [ 29/Aug/23 ] |
|
There was patch https://review.whamcloud.com/51414 " |