Table of Contents
背景
目前接手了Hadoop集群的维护,由于服务器老旧经常出现宕机的问题,无法维修的需要从集群中摘除掉再加入新的机器。新机器加入集群后需要重新下发配置,此前一直没有关注下发配置成功的服务器数量。这次在集群新增机器下发配置的过程中发现有台机器更新配置居然失败了。
部署客户端配置失败问题排查
将上图中的日志从stdout切换到stderr:
Can't open /run/cloudera-scm-agent/process/ccdeploy_hadoop-conf_etchadoopconf.cloudera.yarn_-7727416719664341322/yarn-conf/hive-env.sh: No such file or directory. ++ dirname /etc/hadoop/conf.cloudera.yarn + ROOT_DIR_NAME=/etc/hadoop + '[' '!' -e /etc/hadoop ']' + for SPECIAL_FILE in '$DEST_PATH/{taskcontroller.cfg,container-executor.cfg}' + '[' -e /etc/hadoop/conf.cloudera.yarn/taskcontroller.cfg ']' + for SPECIAL_FILE in '$DEST_PATH/{taskcontroller.cfg,container-executor.cfg}' + '[' -e /etc/hadoop/conf.cloudera.yarn/container-executor.cfg ']' ++ basename /etc/hadoop/conf + LINK_BASENAME=conf + [[ -d conf ]] + '[' -n '' ']' + DEPLOYED_FILE_USER=root + rm -rf /etc/hadoop/conf.cloudera.yarn + cp -a /run/cloudera-scm-agent/process/ccdeploy_hadoop-conf_etchadoopconf.cloudera.yarn_-7727416719664341322/yarn-conf /etc/hadoop/conf.cloudera.yarn + chown root /etc/hadoop/conf.cloudera.yarn + chmod -R ugo+r /etc/hadoop/conf.cloudera.yarn + '[' -e /etc/hadoop/conf.cloudera.yarn/topology.py ']' + chmod +x /etc/hadoop/conf.cloudera.yarn/topology.py + /usr/sbin/update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cloudera.yarn 92 /var/lib/alternatives/hadoop-conf empty!
对stderr日志进行分析,YARN的配置文件是从/run/cloudera-scm-agent/process/ccdeploy_hadoop-conf_etchadoopconf.cloudera.yarn_-7727416719664341322/yarn-conf拷贝到/etc/hadoop/conf.cloudera.yarn中供角色使用。进入process目录下
[root@prefix.company-inc.com ~]$ ls -al /run/cloudera-scm-agent/process/ccdeploy_hadoop-conf_etchadoopconf.cloudera.yarn_-7727416719664341322/ 总用量 4 drwxr-xr-x 4 root root 100 6月 24 15:53 . drwxr-x--x 10 root root 200 6月 24 15:53 .. -rw-r--r-- 1 root root 20 6月 24 15:53 __cloudera_metadata__ drwxr-x--x 2 root root 120 6月 24 15:52 logs drwxr-x--x 2 root root 260 6月 24 15:52 yarn-conf [root@prefix.company-inc.com ccdeploy_hadoop-conf_etchadoopconf.cloudera.yarn_-7727416719664341322]$ ls -alht logs/ 总用量 56K drwxr-xr-x 4 root root 100 6月 24 15:55 .. -rw-r--r-- 1 root root 21K 6月 24 15:55 stderr.log -rw-r--r-- 1 root root 550 6月 24 15:55 stdout.log drwxr-x--x 2 root root 120 6月 24 15:55 . -rw-r----- 1 root root 21K 6月 24 15:55 stderr.log.bak -rw-r----- 1 root root 550 6月 24 15:55 stdout.log.bak
查看stderr.log,在最后面的位置可以发现部署信息:/var/lib/alternatives/hadoop-conf empty!
+ for SPECIAL_FILE in '$DEST_PATH/{taskcontroller.cfg,container-executor.cfg}' + '[' -e /etc/hadoop/conf.cloudera.yarn/container-executor.cfg ']' ++ basename /etc/hadoop/conf + LINK_BASENAME=conf + [[ -d conf ]] + '[' -n '' ']' + DEPLOYED_FILE_USER=root + rm -rf /etc/hadoop/conf.cloudera.yarn + cp -a /run/cloudera-scm-agent/process/ccdeploy_hadoop-conf_etchadoopconf.cloudera.yarn_-7727416719664341322/yarn-conf /etc/hadoop/conf.cloudera.yarn + chown root /etc/hadoop/conf.cloudera.yarn + chmod -R ugo+r /etc/hadoop/conf.cloudera.yarn + '[' -e /etc/hadoop/conf.cloudera.yarn/topology.py ']' + chmod +x /etc/hadoop/conf.cloudera.yarn/topology.py + /usr/sbin/update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cloudera.yarn 92 /var/lib/alternatives/hadoop-conf empty!
通过和以下客户端配置部署成功服务器上的日志进行对比可以确部署客户端配置失败是由于/var/lib/alternatives/hadoop-conf文件为空导致的。
+ for SPECIAL_FILE in '$DEST_PATH/{taskcontroller.cfg,container-executor.cfg}' + '[' -e /etc/hadoop/conf.cloudera.yarn/taskcontroller.cfg ']' + for SPECIAL_FILE in '$DEST_PATH/{taskcontroller.cfg,container-executor.cfg}' + '[' -e /etc/hadoop/conf.cloudera.yarn/container-executor.cfg ']' ++ basename /etc/hadoop/conf + LINK_BASENAME=conf + [[ -d conf ]] + '[' -n '' ']' + DEPLOYED_FILE_USER=root + rm -rf /etc/hadoop/conf.cloudera.yarn + cp -a /run/cloudera-scm-agent/process/ccdeploy_hadoop-conf_etchadoopconf.cloudera.yarn_-7727416719664341322/yarn-conf /etc/hadoop/conf.cloudera.yarn + chown root /etc/hadoop/conf.cloudera.yarn + chmod -R ugo+r /etc/hadoop/conf.cloudera.yarn + '[' -e /etc/hadoop/conf.cloudera.yarn/topology.py ']' + chmod +x /etc/hadoop/conf.cloudera.yarn/topology.py + /usr/sbin/update-alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.cloudera.yarn 92 + /usr/sbin/update-alternatives --auto hadoop-conf
部署客户端配置失败问题处理
查看客户端配置部署成功服务器上的文件/var/lib/alternatives/hadoop-conf,并将其中内容拷贝到客户端配置部署失败的服务器上:
[root@prefix.company-inc.com ~]$ more /var/lib/alternatives/hadoop-conf auto /etc/hadoop/conf /opt/cloudera/parcels/CDH-5.16.2-1.cdh5.16.2.p0.8/etc/hadoop/conf.empty 10 /etc/hadoop/conf.cloudera.hdfs 90 /etc/hadoop/conf.cloudera.yarn 92
完成后再重新部署一次客户端配置即可。