[LU-15735] sanity-sec test_19: OSS not responding after "force_sync=1" Created: 12/Apr/22 Updated: 12/Apr/22 |
|
| Status: | Open |
| Project: | Lustre |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Maloo | Assignee: | WC Triage |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Severity: | 3 |
| Rank (Obsolete): | 9223372036854775807 |
| Description |
|
This issue was created by maloo for eaujames <eaujames@ddn.com> This issue relates to the following test suite run: https://testing.whamcloud.com/test_sets/bdc21484-cfd2-46d9-a0a9-a46d1a5682b2 test_19 failed with the following error: test_19 returned 1 This failure apears on review-ldiskfs-arm session, but the OSS was running on x86_64 Client test: CMD: onyx-124vm1 lctl set_param -n osd*.*OS*.force_sync=1 ... CMD: onyx-81vm3 /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1 CMD: onyx-81vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: onyx-81vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* CMD: onyx-81vm3 /usr/sbin/lctl get_param -n osc.*MDT*.sync_* ... Delete is not completed in 37 seconds CMD: onyx-81vm3 /usr/sbin/lctl get_param osc.*MDT*.sync_* osc.lustre-OST0000-osc-MDT0000.sync_changes=0 osc.lustre-OST0000-osc-MDT0000.sync_in_flight=0 osc.lustre-OST0000-osc-MDT0000.sync_in_progress=1 ... CMD: onyx-91vm5.onyx.whamcloud.com runas -u0 -g0 -G0 lfs quota -q /mnt/lustre running as uid/gid/euid/egid 0/0/0/0, groups: 0 [lfs] [quota] [-q] [/mnt/lustre] /usr/lib64/lustre/tests/sanity-sec.sh: line 1293: [21448]: syntax error: operand expected (error token is "[21448]") CMD: onyx-81vm3 /usr/sbin/lctl nodemap_del c0 CMD: onyx-81vm3 /usr/sbin/lctl nodemap_del c1 CMD: onyx-81vm3 /usr/sbin/lctl nodemap_modify --name default --property admin --value 0 CMD: onyx-81vm3 /usr/sbin/lctl get_param -n nodemap.active CMD: onyx-81vm3 /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ On MGS 10.240.26.198, default.admin_nodemap = nodemap.default.admin_nodemap=0 CMD: onyx-124vm1 /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ onyx-124vm1: ssh: connect to host onyx-124vm1 port 22: No route to host pdsh@onyx-91vm5: onyx-124vm1: ssh exited with exit code 255 MDS dmesg: [18117.644399] Lustre: DEBUG MARKER: /usr/sbin/lctl set_param -n os[cd]*.*MD*.force_sync 1 [18118.434329] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_* [18120.214180] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param -n osc.*MDT*.sync_* .... [18133.582033] Lustre: 11600:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1649444731/real 1649444731] req@000000009702997f x1729549387801536/t0(0) o13->lustre-OST0006-osc-MDT0000@10.240.30.32@tcp:7/4 lens 224/368 e 0 to 1 dl 1649444738 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-6-0.0' [18133.582035] Lustre: 11599:0:(client.c:2295:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1649444731/real 1649444731] req@00000000f3577720 x1729549387801600/t0(0) o13->lustre-OST0002-osc-MDT0000@10.240.30.32@tcp:7/4 lens 224/368 e 0 to 1 dl 1649444738 ref 1 fl Rpc:XQr/0/ffffffff rc 0/-1 job:'osp-pre-2-0.0' [18133.582063] Lustre: lustre-OST0002-osc-MDT0000: Connection to lustre-OST0002 (at 10.240.30.32@tcp) was lost; in progress operations using this service will wait for recovery to complete OST dmesg: [18108.151196] Lustre: DEBUG MARKER: /usr/sbin/lctl list_nids | grep -w tcp | cut -f 1 -d @ [18108.941076] Lustre: DEBUG MARKER: /usr/sbin/lctl get_param nodemap.c0.trusted_nodemap [18115.611844] Lustre: DEBUG MARKER: lctl set_param -n osd*.*OS*.force_sync=1 No messages are logged after "force_sync=1" on the OST, but no crash has been reported. The OSS seems to have disappeared. VVVVVVV DO NOT REMOVE LINES BELOW, Added by Maloo for auto-association VVVVVVV |