[LU-15774] Rolling upgrade 2.12.8 -> 2.15 fails, sanity : @@@@@@ FAIL: unable to write to /mnt/lustre/d0_runas_test as UID 500 Created: 21/Apr/22  Updated: 22/Apr/22

Status: Open
Project: Lustre
Component/s: None
Affects Version/s: Lustre 2.15.0
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Cliff White (Inactive) Assignee: WC Triage
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Trevis test cluster.


Severity: 3
Rank (Obsolete): 9223372036854775807

 Description   

Upgrading from 2.12.8 build 150 to 2.15 build 4283, OSS upgrade works fine. 

MDS upgrade fails sanity.sh run with this error: 

 [94020.104051] Lustre: DEBUG MARKER: trevis-86vm1.trevis.whamcloud.com: executing check_config_client /mnt/lustre
[94022.084620] Lustre: DEBUG MARKER: Using TIMEOUT=100
[94023.414760] Lustre: DEBUG MARKER: sanity : @@@@@@ FAIL: unable to write to /mnt/lustre/d0_runas_test as UID 500.

Appears to be a test script issue. looks very much like LU-8725 - in this case the sanityuser was not created by the loadjenkinbuild upgrade:

 

[root@trevis-86vm1 205355]# pdsh -w trevis-86vm[1-3] "grep 500 /etc/passwd"
trevis-86vm1: sanityusr:x:500:500::/mnt/lustre:/bin/bash
trevis-86vm3: sanityusr:x:500:500::/mnt/lustre:/bin/bash 


 Comments   
Comment by Andreas Dilger [ 22/Apr/22 ]

This looks more like a test environment (ljb) issue than a Lustre or test issue? Should probably be moved to ATM?

Comment by Charlie Olmstead [ 22/Apr/22 ]

I looked at trevis-86vm2, chef failed to complete so the node was not ready for testing. This is an issue with how loadjenkinsbuild works; it kicks off the installation and then exits. It is then up to the user to determine if the node is ready which is more than just checking if the node is ssh-able. This is a reason to switch over to ljb which takes over that responsibility. It waits for the OS installation to complete, verifies chef has completed and installs Lustre packages, kernel, etc. Once ljb exits (with 0), then the node is ready.

  * directory[/mnt/lustre] action create    ================================================================================
    Error executing action `create` on resource 'directory[/mnt/lustre]'
    ================================================================================    Errno::EROFS
    ------------
    Read-only file system @ apply2files - /mnt/lustre    Resource Declaration:
    ---------------------
    # In /var/tmp/chef-client/roles/lib/config_helper.rb     50:       directory d do
     51:         mode  mode
     52:         owner own
     53:         group group || own
     54:       end
     55:     }    Compiled Resource:
    ------------------
    # Declared in /var/tmp/chef-client/roles/lib/config_helper.rb:50:in `block in mkdir'    directory("/mnt/lustre") do
      action [:create]
      default_guard_interpreter :default
      declared_type :directory
      cookbook_name "test_node"
      recipe_name "default"
      mode "0777"
      owner "root"
      group "root"
      path "/mnt/lustre"
    end    System Info:
    ------------
    chef_version=16.17.18
    platform=centos
    platform_version=7.9.2009
    ruby=ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux]
    program_name=/usr/bin/chef-client
    ruby=ruby 2.7.5p203 (2021-11-24 revision f69aeb8314) [x86_64-linux]
    program_name=/usr/bin/chef-client
    executable=/opt/chef/bin/chef-client 

 

Generated at Sat Feb 10 03:21:11 UTC 2024 using Jira 9.4.14#940014-sha1:734e6822bbf0d45eff9af51f82432957f73aa32c.