[LU-14302] obdfilter-survey test 2b hangs in case=netdisk Created: 06/Jan/21 Updated: 21/Dec/23 |
|
| Status: | Reopened |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | Lustre 2.14.0, Lustre 2.12.7, Lustre 2.15.0, Lustre 2.15.3, Lustre 2.15.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | James Nunez (Inactive) | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||
| Severity: | 3 | ||||||||||||
| Rank (Obsolete): | 9223372036854775807 | ||||||||||||
| Description |
|
obdfilter-survey test_2b hangs starting on 02 OCT 2020 with logs at https://testing.whamcloud.com/test_sets/258c8761-9d71-4d82-8a5e-1c473e5a089c. So far, this looks like it only hangs for ZFS. Looking at the suite_log for the failure at https://testing.whamcloud.com/test_sessions/296ca0ce-3daa-42aa-a92d-cdbdb7af75da, the last test run is for netdisk Unable to detect ip address for host: '' + NETTYPE=tcp thrlo=4 nobjhi=1 thrhi=4 size=1024 case=netdisk rslt_loc=/tmp targets="10.9.7.117:lustre-OST0000 10.9.7.117:lustre-OST0001 10.9.7.117:lustre-OST0002 10.9.7.117:lustre-OST0003 10.9.7.117:lustre-OST0004 10.9.7.117:lustre-OST0005 10.9.7.117:lustre-OST0006" /usr/bin/obdfilter-survey Fri Dec 25 12:00:08 UTC 2020 Obdfilter-survey for case=netdisk from trevis-201vm1.trevis.whamcloud.com ost 7 sz 7340032K rsz 1024K obj 7 thr 28 write 5.28 [ 0.00, 8.15] rewrite 4.88 [ 0.00, 4.00] read We see some errors in the console logs. For the client1 (vm1) we see an error message from osc_init_grant() [22622.025189] Lustre: DEBUG MARKER: == obdfilter-survey test 2b: Stripe F/S over the Network, async journal ============================== 12:00:01 (1608897601) [25456.252361] LustreError: 200837:0:(osc_request.c:1043:osc_init_grant()) lustre-OST0003_osc: granted 368640 but already consumed 1048576 [26276.964898] sysrq: SysRq : Changing Loglevel [26276.965724] sysrq: Loglevel set to 8 [26276.967609] sysrq: SysRq : Show State [26276.968269] task PC stack pid father [26276.969210] systemd S 0 1 0 0x00000000 [26276.970088] Call Trace: [26276.970548] __schedule+0x2a6/0x700 [26276.971139] schedule+0x38/0xa0 [26276.971677] schedule_hrtimeout_range_clock+0xf7/0x110 [26276.972514] ? ep_scan_ready_list.constprop.22+0x20c/0x230 [26276.973406] ep_poll+0x3c8/0x410 [26276.973953] ? wake_up_q+0x80/0x80 [26276.974534] do_epoll_wait+0xb0/0xd0 [26276.975131] __x64_sys_epoll_wait+0x1a/0x20 [26276.975821] do_syscall_64+0x5b/0x1a0 [26276.976447] entry_SYSCALL_64_after_hwframe+0x65/0xca [26276.977339] RIP: 0033:0x7f023c889257 [26276.977937] Code: Bad RIP value. [26276.978495] RSP: 002b:00007ffc57bbe450 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [26276.979678] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f023c889257 [26276.980796] RDX: 0000000000000056 RSI: 00007ffc57bbe490 RDI: 0000000000000004 [26276.981913] RBP: 00007ffc57bbe490 R08: 0000000000000000 R09: 7465677261742e79 [26276.983033] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000056 [26276.984152] R13: 00000000ffffffff R14: 00007ffc57bbe490 R15: 0000000000000001 [26276.985270] kthreadd S 0 2 0 0x80004000 [26276.986143] Call Trace: [26276.986580] __schedule+0x2a6/0x700 [26276.987166] schedule+0x38/0xa0 [26276.987703] kthreadd+0x2e8/0x300 [26276.988275] ? kthread_create_on_cpu+0xa0/0xa0 [26276.989001] ret_from_fork+0x35/0x40 [26276.989601] rcu_gp I 0 3 2 0x80004000 [26276.990486] Call Trace: [26276.990916] __schedule+0x2a6/0x700 [26276.991509] schedule+0x38/0xa0 [26276.992048] rescuer_thread+0x29e/0x340 [26276.992686] ? process_one_work+0x360/0x360 In the OSS (vm3) console log, we see some grant errors [22642.753377] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param obdfilter.lustre-*.sync_journal=0 [22643.253588] LustreError: 366155:0:(tgt_grant.c:919:tgt_grant_alloc()) lustre-OST0000: client lustre-OST0000_osc_UUID/00000000089e081d requesting > max (2147483647), 34359738368 [22643.256053] LustreError: 366155:0:(tgt_grant.c:919:tgt_grant_alloc()) Skipped 6 previous similar messages [26311.031755] sysrq: SysRq : Changing Loglevel [26311.032550] sysrq: Loglevel set to 8 [26311.034388] sysrq: SysRq : Show State [26311.034963] task PC stack pid father [26311.035888] systemd S 0 1 0 0x00000000 [26311.036726] Call Trace: [26311.037144] __schedule+0x2a6/0x700 [26311.037701] schedule+0x38/0xa0 [26311.038200] schedule_hrtimeout_range_clock+0xf7/0x110 [26311.038999] ? ep_scan_ready_list.constprop.22+0x20c/0x230 [26311.039846] ep_poll+0x3c8/0x410 [26311.040381] ? wake_up_q+0x80/0x80 [26311.040903] do_epoll_wait+0xb0/0xd0 [26311.041468] __x64_sys_epoll_wait+0x1a/0x20 [26311.042113] do_syscall_64+0x5b/0x1a0 [26311.042695] entry_SYSCALL_64_after_hwframe+0x65/0xca [26311.043491] RIP: 0033:0x7ffbb0799257 [26311.044053] Code: Bad RIP value. [26311.044562] RSP: 002b:00007fffda2f4c30 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [26311.045720] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007ffbb0799257 [26311.046806] RDX: 0000000000000057 RSI: 00007fffda2f4c70 RDI: 0000000000000004 [26311.047893] RBP: 00007fffda2f4c70 R08: 0000000000000000 R09: 7465677261742e79 [26311.048978] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000057 [26311.050066] R13: 00000000ffffffff R14: 00007fffda2f4c70 R15: 0000000000000001 [26311.051152] kthreadd S 0 2 0 0x80004000 [26311.052001] Call Trace: We’ve seen this hang in the past at |
| Comments |
| Comment by James Nunez (Inactive) [ 20/Jan/21 ] |
|
sanity-benchmark test_fsx has a similar issue with the following call trace on each node [17061.275242] sysrq: SysRq : Changing Loglevel [17061.276192] sysrq: Loglevel set to 8 [17061.278000] sysrq: SysRq : Show State [17061.278766] task PC stack pid father [17061.279880] systemd S 0 1 0 0x00000000 [17061.280938] Call Trace: [17061.281501] __schedule+0x2a6/0x700 [17061.282203] schedule+0x38/0xa0 [17061.282834] schedule_hrtimeout_range_clock+0xf7/0x110 [17061.283834] ? ep_scan_ready_list.constprop.22+0x20c/0x230 [17061.284870] ep_poll+0x3c8/0x410 [17061.285565] ? wake_up_q+0x80/0x80 [17061.286245] do_epoll_wait+0xb0/0xd0 [17061.286954] __x64_sys_epoll_wait+0x1a/0x20 [17061.287781] do_syscall_64+0x5b/0x1a0 [17061.288537] entry_SYSCALL_64_after_hwframe+0x65/0xca [17061.289513] RIP: 0033:0x7f1833964257 [17061.290221] Code: Bad RIP value. [17061.290865] RSP: 002b:00007fffdf992860 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [17061.292251] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f1833964257 [17061.293825] RDX: 0000000000000055 RSI: 00007fffdf9928a0 RDI: 0000000000000004 [17061.295585] RBP: 00007fffdf9928a0 R08: 0000000000000000 R09: 7465677261742e79 [17061.297085] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000055 [17061.298425] R13: 00000000ffffffff R14: 00007fffdf9928a0 R15: 0000000000000001 [17061.299748] kthreadd S 0 2 0 0x80004000 https://testing.whamcloud.com/test_sets/55bc0e09-caa5-4d08-a4b5-7b53173af78a |
| Comment by James Nunez (Inactive) [ 15/Feb/21 ] |
|
Here are a small fraction of test hangs that are attributed to this ticket. I think the true stack trace/issue for each hang below is being masked by the entry_syscall_64 call trace. |
| Comment by James Nunez (Inactive) [ 09/Jul/21 ] |
|
Another test hang for sanity-sec test 25 for 2.12.7 RC1 at https://testing.whamcloud.com/test_sets/3abb9645-7bc7-4630-bc41-557da7b04375 On client1 adn 2, we see the mount point is 'still busy' [17855.295093] Lustre: DEBUG MARKER: == sanity-sec test 25: test save and reload nodemap config =========================================== 16:59:52 (1625072392) [17855.928251] Lustre: DEBUG MARKER: running=$(grep -c /mnt/lustre' ' /proc/mounts); [17855.928251] if [ $running -ne 0 ] ; then [17855.928251] echo Stopping client $(hostname) /mnt/lustre opts:; [17855.928251] lsof /mnt/lustre || need_kill=no; [17855.928251] if [ x != x -a x$need_kill != xno ]; then [17855.928251] pids=$(lsof -t /mnt/lustre | sort -u); [17855.928251] if [21488.693672] sysrq: SysRq : Changing Loglevel [21488.694779] sysrq: Loglevel set to 8 [21488.696709] sysrq: SysRq : Show State [21488.697557] task PC stack pid father [21488.698776] systemd S 0 1 0 0x00000000 [21488.699937] Call Trace: [21488.700608] __schedule+0x292/0x880 [21488.700631] schedule+0x32/0x80 [21488.702137] schedule_hrtimeout_range_clock+0x106/0x110 [21488.702169] ? ep_scan_ready_list.isra.12+0x1af/0x1d0 [21488.704339] ep_poll+0x2b6/0x360 [21488.704368] ? wake_up_q+0x70/0x70 [21488.704371] SyS_epoll_wait+0xb2/0xe0 [21488.706704] do_syscall_64+0x74/0x160 [21488.706722] entry_SYSCALL_64_after_hwframe+0x59/0xbe [21488.708632] RIP: 0033:0x7f5917ae78a3 [21488.708633] RSP: 002b:00007ffebbb08010 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [21488.708634] RAX: ffffffffffffffda RBX: 00005607a2f63de0 RCX: 00007f5917ae78a3 [21488.708635] RDX: 0000000000000031 RSI: 00007ffebbb08020 RDI: 0000000000000004 [21488.708636] RBP: 00007ffebbb08370 R08: 5566e87192e6acdb R09: 00005607a2d099c0 [21488.708636] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000001 [21488.708638] R13: 00007ffebbb08020 R14: ffffffffffffffff R15: 0000000000000000 [21488.718338] kthreadd S 0 2 0 0x00000000 |
| Comment by Sarah Liu [ 22/Feb/22 ] |
|
Hit similar error on master zfs, obdfilter-survey and replay-dual hung with following trace [17639.357284] Lustre: DEBUG MARKER: == obdfilter-survey test 2a: Stripe F/S over the Network ========================================================== 03:48:52 (1644378532) [24905.814376] sysrq: SysRq : Changing Loglevel [24905.815333] sysrq: Loglevel set to 8 [24905.819043] sysrq: SysRq : Show State [24905.819814] task:systemd state:S stack: 0 pid: 1 ppid: 0 flags:0x00000000 [24905.821371] Call Trace: [24905.821936] __schedule+0x2c4/0x700 [24905.822647] schedule+0x37/0xa0 [24905.823246] schedule_hrtimeout_range_clock+0xf7/0x110 [24905.824250] ? ep_scan_ready_list.constprop.23+0x20c/0x230 [24905.825291] ep_poll+0x403/0x460 [24905.825968] ? wake_up_q+0x80/0x80 [24905.826636] do_epoll_wait+0xb0/0xd0 [24905.827316] __x64_sys_epoll_wait+0x1a/0x20 [24905.828125] do_syscall_64+0x5b/0x1a0 [24905.828857] entry_SYSCALL_64_after_hwframe+0x65/0xca [24905.829846] RIP: 0033:0x7f0e2cb5a0f7 [24905.830547] Code: Unable to access opcode bytes at RIP 0x7f0e2cb5a0cd. [24905.831770] RSP: 002b:00007fffa13c46a0 EFLAGS: 00000293 ORIG_RAX: 00000000000000e8 [24905.833189] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f0e2cb5a0f7 [24905.834530] RDX: 0000000000000059 RSI: 00007fffa13c46e0 RDI: 0000000000000004 [24905.835860] RBP: 00007fffa13c46e0 R08: 0000000000000000 R09: 00007fffa13c4c70 [24905.837189] R10: 00000000ffffffff R11: 0000000000000293 R12: 0000000000000059 [24905.838516] R13: 00000000ffffffff R14: 00007fffa13c46e0 R15: 0000000000000001 [24905.839849] task:kthreadd state:S stack: 0 pid: 2 ppid: 0 flags:0x80004000 [24905.841412] Call Trace: [24905.841893] __schedule+0x2c4/0x700 [24905.842565] schedule+0x37/0xa0 [24905.843179] kthreadd+0x2e8/0x300 [24905.843843] ? kthread_create_on_cpu+0xa0/0xa0 [24905.844704] ret_from_fork+0x35/0x40 [24905.845398] task:rcu_gp state:I stack: 0 pid: 3 ppid: 2 flags:0x80004000 [24905.846963] Call Trace: [24905.847465] __schedule+0x2c4/0x700 [24905.848134] schedule+0x37/0xa0 [24905.848752] rescuer_thread+0x29e/0x340 [24905.849501] ? process_one_work+0x360/0x360 [24905.850295] kthread+0x116/0x130 [24905.850933] ? kthread_flush_work_fn+0x10/0x10 [24905.851787] ret_from_fork+0x35/0x40 [24905.852489] task:rcu_par_gp state:I stack: 0 pid: 4 ppid: 2 flags:0x80004000 [24905.854047] Call Trace: ... another one on ldiskfs DNE config parallel-scale |
| Comment by Andreas Dilger [ 30/Nov/22 ] |
|
Has not failed in 6 months. |
| Comment by Jian Yu [ 14/Dec/22 ] |
|
sanity test 24v on b2_15 branch: |
| Comment by Sarah Liu [ 20/Dec/22 ] |
|
hit this again in 2.15.2-rc1 https://testing.whamcloud.com/test_sets/5ed91b10-10ab-4dd9-b960-6a72d84f11df client Sun Dec 11 23:16:42 2022] Lustre: 19950:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670800585/real 1670800585] req@00000000532f8e4e x1751941312768064/t0(0) o4->lustre-OST0004_osc@10.240.39.106@tcp:6/4 lens 488/448 e 0 to 1 dl 1670800604 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'ptlrpcd_01_00.0' [Sun Dec 11 23:16:42 2022] Lustre: lustre-OST0004_osc: Connection to lustre-OST0004 (at 10.240.39.106@tcp) was lost; in progress operations using this service will wait for recovery to complete [Sun Dec 11 23:16:42 2022] Lustre: lustre-OST0004_osc: Connection restored to host_6_UUID (at 10.240.39.106@tcp) [Sun Dec 11 23:24:14 2022] LustreError: 19947:0:(osc_request.c:1037:osc_init_grant()) lustre-OST0005_osc: granted 372736 but already consumed 1048576 [Sun Dec 11 23:24:27 2022] Lustre: 19950:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670801059/real 1670801059] req@00000000c6ac8d21 x1751941313020928/t0(0) o4->lustre-OST0003_osc@10.240.39.106@tcp:6/4 lens 488/448 e 0 to 1 dl 1670801069 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'ptlrpcd_01_00.0' [Sun Dec 11 23:24:27 2022] Lustre: 19948:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1670801059/real 1670801059] req@000000009b3144c2 x1751941313020800/t0(0) o4->lustre-OST0003_osc@10.240.39.106@tcp:6/4 lens 488/448 e 0 to 1 dl 1670801069 ref 2 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'ptlrpcd_00_00.0' [Sun Dec 11 23:24:27 2022] Lustre: 19950:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [Sun Dec 11 23:24:27 2022] Lustre: 19948:0:(client.c:2295:ptlrpc_expire_one_request()) Skipped 2 previous similar messages [Sun Dec 11 23:24:27 2022] Lustre: lustre-OST0003_osc: Connection to lustre-OST0003 (at 10.240.39.106@tcp) was lost; in progress operations using this service will wait for recovery to complete [Sun Dec 11 23:24:27 2022] Lustre: Skipped 1 previous similar message [Sun Dec 11 23:24:27 2022] Lustre: lustre-OST0003_osc: Connection restored to host_6_UUID (at 10.240.39.106@tcp) [Sun Dec 11 23:24:27 2022] Lustre: Skipped 1 previous similar message [Sun Dec 11 23:55:08 2022] LustreError: 19947:0:(osc_request.c:1037:osc_init_grant()) lustre-OST0003_osc: granted 372736 but already consumed 1048576
OSS [Sun Dec 11 23:07:10 2022] Lustre: lustre-OST0000: Client lustre-OST0000_osc_UUID (at 10.240.39.104@tcp) reconnecting [Sun Dec 11 23:08:25 2022] Lustre: lustre-OST0005: Client lustre-OST0005_osc_UUID (at 10.240.39.104@tcp) reconnecting [Sun Dec 11 23:08:30 2022] LustreError: 353209:0:(ldlm_lib.c:3490:target_bulk_io()) @@@ bulk WRITE failed: rc = -107 req@00000000b67af265 x1751941312481472/t0(0) o4->lustre-OST0005_osc_UUID@10.240.39.104@tcp:0/0 lens 488/448 e 0 to 0 dl 1670800121 ref 1 fl Interpret:/0/0 rc 0/0 job:'ptlrpcd_00_01.0' [Sun Dec 11 23:08:30 2022] LustreError: 353209:0:(ldlm_lib.c:3490:target_bulk_io()) Skipped 6 previous similar messages [Sun Dec 11 23:08:30 2022] Lustre: lustre-OST0005: Bulk IO write error with lustre-OST0005_osc_UUID (at 10.240.39.104@tcp), client will retry: rc = -107 [Sun Dec 11 23:08:30 2022] Lustre: Skipped 1 previous similar message [Sun Dec 11 23:09:46 2022] Lustre: 353215:0:(service.c:2329:ptlrpc_server_handle_request()) @@@ Request took longer than estimated (20/1s); client may timeout req@0000000005918235 x1751941312522240/t8589941733(0) o4->lustre-OST0004_osc_UUID@10.240.39.104@tcp:0/0 lens 488/448 e 0 to 0 dl 1670800187 ref 1 fl Complete:/0/0 rc 0/0 job:'ptlrpcd_00_00.0' [Sun Dec 11 23:14:23 2022] Lustre: lustre-OST0004: Client lustre-OST0004_osc_UUID (at 10.240.39.104@tcp) reconnecting [Sun Dec 11 23:14:25 2022] LustreError: 353232:0:(ldlm_lib.c:3490:target_bulk_io()) @@@ bulk WRITE failed: rc = -107 req@00000000e6dde848 x1751941312695872/t0(0) o4->lustre-OST0003_osc_UUID@10.240.39.104@tcp:0/0 lens 488/448 e 0 to 0 dl 1670800483 ref 1 fl Interpret:/0/0 rc 0/0 job:'ptlrpcd_01_01.0' [Sun Dec 11 23:14:25 2022] Lustre: lustre-OST0003: Bulk IO write error with lustre-OST0003_osc_UUID (at 10.240.39.104@tcp), client will retry: rc = -107 [Sun Dec 11 23:14:28 2022] LustreError: 353216:0:(ldlm_lib.c:3490:target_bulk_io()) @@@ bulk WRITE failed: rc = -107 req@0000000004060a7c x1751941312694464/t0(0) o4->lustre-OST0004_osc_UUID@10.240.39.104@tcp:0/0 lens 488/448 e 0 to 0 dl 1670800480 ref 1 fl Interpret:/0/0 rc 0/0 job:'ptlrpcd_00_00.0' [Sun Dec 11 23:14:28 2022] Lustre: lustre-OST0004: Bulk IO write error with lustre-OST0004_osc_UUID (at 10.240.39.104@tcp), client will retry: rc = -107 [Sun Dec 11 23:14:52 2022] Lustre: lustre-OST0006: Client lustre-OST0006_osc_UUID (at 10.240.39.104@tcp) reconnecting |