本文主要对kubesphere项目中的ks-installer模块进行分析，并提出改进思路。ks-installer是用于在k8s集群中部署kubesphere的程序。

shell-operator

查看ks-installer的dockerfile，发现使用了一个叫shell-operator的基础镜像 https://github.com/flant/shell-operator shell-operator通过k8s的事件触发，调用对应的钩子脚本。脚本需要放在/hooks目录，也可以指定 https://github.com/flant/shell-operator/blob/main/HOOKS.md shell-operator启动时会扫描目录，用--config参数执行所有钩子，钩子需要输出自己运行的配置

ks-installer的钩子

ks-installer的钩子放在controller目录，build时拷贝到镜像的/hooks/kubesphere/目录。钩子有两个：

installRunner.py 监听clusterconfiguration
schedule.sh 定时任务

installRunner.py

installRunner.py主要是通过分析clusterconfiguration，然后调用对应的ancible playbook。主要流程：

python
    if len(sys.argv) > 1 and sys.argv[1] == "--config":
        print(ks_hook)
    else:
        config.load_incluster_config()
        api = client.CustomObjectsApi()
        generate_new_cluster_configuration(api)
        generateConfig(api)
        # execute preInstall tasks
        preInstallTasks()
        resultState = getResultInfo()
        resultInfo(resultState, api)

config

--config输出的配置是：

yaml
{
	"onKubernetesEvent": [{
		"name": "Monitor clusterconfiguration",
		"kind": "ClusterConfiguration",
		"event": [ "add", "update" ],
		"objectName": "ks-installer",
		"namespaceSelector": {
			"matchNames": ["kubesphere-system"]
		},
		"jqFilter": ".spec",
		"allowFailure": false
	}]
}

也就是说，当kubesphere-system的名为ks-installer的clusterconfiguration有add或update事件时，这个脚本会被调用。

main

前面几个流程都是在更新和生成配置。
generate_new_cluster_configuration会读取集群的clusterconfiguration，对里面的字段做预处理，并写回clusterconfiguration
generateConfig再次读取clusterconfiguration，保存到generateConfig 中，供后面getComponentLists使用
preInstallTasks主要是执行几个每次都要运行的步骤（文件在playbooks目录）：


preinstall.yaml
metrics_server.yaml
common.yaml
ks-core.yaml
traefik.yaml

getResultInfo设计成可插拔的模块化结构，每次只运行enabled的模块。通过调用getComponentLists获取enabled和disabled的模块，代码里默认写了三个（最新版是四个）：


multicluster
openpitrix
network

没有disabled的。其它的从clusterconfiguration里面把enabled的加上。
然后generateTaskLists把这些模块的任务生成一个列表，调用ansible执行对应的playbook。每个task执行结束，都会在文件里记录执行结果。
最后一个函数ResultInfo，会先执行ks-config和result-info两个playbook，然后把运行结果写回clusterconfiguration的status.

playbook的运行

ks-installer的playbook放在playbooks目录，以monitoring.yaml为例，内容是：

yaml
---

- hosts: localhost
  gather_facts: false
  roles:
    - kubesphere-defaults
    - ks-monitor

可以看到引用了两个role，这两个role在roles目录可以看到对应的目录。每个role的目录里面可能会有很多子目录，子目录里面的main.yaml会被作为入口解析执行。比如ks-monitor，有两个子目录有main.yaml：


ks-monitor/
~ defaults/
    main.yaml
~ tasks/
    main.yaml

defaults主要是配置一些默认变量，tasks是主要执行安装任务的。

问题

组件可插不可拔，设置为disabled后没有进行任何清理操作

每次修改配置，都要全部跑一边playbook，耗时比较大

开启etcd监控后prometheus没有挂载etcd的secret

在clusterconfiguration的monitoring为true的情况下，将etcd的monitoring项从false改成true，etcd的servicemonitor正常安装，但是prometheus容器没有挂载etcd的secret

查看prometheus-stack.yaml，看到只有clusterconfiguration的status中monitoring为false或没有值时，才会调用prometheus.yaml，也就是说如果之前已经将monitoring设置为true了，那prommetheus的yaml是不会更新的。

yaml
- import_tasks: prometheus.yaml
  when:
    - "status.monitoring is not defined or status.monitoring.status is not defined or status.monitoring.status != 'enabled'"

查看etcd.yaml，当etcd.monitoring为true时，安装prometheus/etcd目录内的yaml，也就是说没有更新prometheus的deployment

yaml
---
- name: Monitoring | Installing etcd monitoring
  shell: "{{ bin_dir }}/kubectl apply -f {{ kubesphere_dir }}/prometheus/{{ item }}"
  loop:
    - "etcd"
  register: import
  failed_when: "import.stderr and 'Warning' not in import.stderr and 'spec.clusterIP' not in import.stderr"
  until: import is succeeded
  retries: 5
  delay: 3
  when:
    - etcd.monitoring is defined
    - etcd.monitoring == true

解决方法：在etcd.yaml里，loop加上prometheus 疑问：为何prometheus.yaml里loop里有两个prometheus

改造方案

减少每次修改配置后的运行时间，只执行修改了配置的模块

减少preInstallTasks执行的任务数量

将metrics-server，common和ks-core从preInstallTasks中去掉，只保留preinstall

ks-core在配置文件中没有体现，通过检查是否有上次运行的配置，来决定是否运行ks-core

完善getComponentLists逻辑

去掉readyToEnableLists默认的三个模块

每次运行生成新配置后，保存到一个last-applied-configs.yaml文件中；下次运行时，对比新配置和之前的配置，只运行配置有变化的模块

增加卸载功能

getComponentLists

如果模块的enabled字段从true变成false，加入到readyToDisableList中，并返回

每个模块对应的uninstall模块，在generateTaskLists中调用