Basic Tips

GKE created K8s cluster has only 1 master and can only be managed by GKE service, it cannot be accessed.

use http://kubernetes_master_address/api/v1/namespaces/namespace_name/services/service_name[:port_name]/proxy to access services through kubectl proxy without exposing service onto internet. e.g for grafana: http://localhost:8001/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy/d/-eQmYbgiz/kubernetes-cluster-resource-monitor?orgId=1, where d/-eQmYbgiz/kubernetes-cluster-resource-monitor is the link of a particular dashboard copied from grafana’s main page [http://localhost:8001/api/v1/namespaces/kube-system/services/monitoring-grafana/proxy/].

The easiest way to write yaml script for K8s from scrach is to use create with --dry-run. E.g: kubectl create deployment app --image nginx -o yaml --dry-run. (can also use run, which also generate scripts in this case)

Rollout feature is like configuration checkpoint for manifest, when object created with --record command, it can record what casued this configuration change. E.g: kubectl create deployment app --record. It’s easy to rollback to previous checkpoints, just use: kubectl rollout undo deployment app --to-revision=2 and use: kubectl rollout history deployment app --revision=2 to check version2’s yaml script.

To check rough resource usage of a cluster:

kubectl top nodes 
NAME           CPU(cores)   CPU%      MEMORY(bytes)   MEMORY%   
k8s-master-0   143m         3%        2448Mi          72%       
k8s-master-1   107m         2%        2735Mi          81%       
k8s-master-2   146m         3%        2630Mi          78%       
k8s-slave-0    89m          1%        2956Mi          18%       
k8s-slave-1    125m         1%        4736Mi          30%       
k8s-slave-2    82m          1%        1051Mi          6% 

To only show slave nodes:

kubectl top nodes | grep slave 
k8s-slave-0    94m          1%        2949Mi          18%       
k8s-slave-1    127m         1%        4738Mi          30%       
k8s-slave-2    81m          1%        1051Mi          6%  

To only show slave’s RAM total average usage:

kubectl top nodes | grep k8s-slave |  awk '{ SUM += $5} END { print SUM/3 }'
17.6667

A shell script to monitor resource usage:

#!/bin/bash 
lines=$(kubectl top nodes | grep k8s-slave -c)
cpu=$(kubectl top nodes | grep k8s-slave |  awk '{ SUM += $3} END { print SUM}')
ram=$(kubectl top nodes | grep k8s-slave |  awk '{ SUM += $5} END { print SUM}')
echo "cpu usage: $[$cpu/$lines]"
echo "ram usage: $[ram/$lines]"

output:

$ ./resource.sh
cpu usage: 1
ram usage: 24

CNI

local docker going outside issue

  1. When use k8s CNI, it changes host iptable to forbiden local non k8s docker container to go outside. To bring this feature back, we can add these on iptables:

iptables -t nat -N DOCKER iptables -t nat -A DOCKER -i docker0 -j RETURN iptables -t nat -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE iptables -t nat -A OUTPUT ! -d 127.0.0.0/8 -m addrtype –dst-type LOCAL -j DOCKER iptables -A FORWARD -i docker0 -o ens3 -j ACCEPT iptables -A FORWARD -i ens3 -o docker0 -j ACCEPT ``` 2. Another option is to enable IPTABLES binding with local docker. In kubespray, it’s hardcoded to disable iptalbes on non k8s container by removing docker0 related iptable rules. To enable it, change --iptables=true in roles/docker/templates/docker-options.conf.j2 if deploy using kubespray, it will eventually be pushed to /etc/systemd/system/docker.service.d/docker-options.conf. This is a better fix, as 1st fix would be lost once system rebooted.

Difference between network type, refer to song’s blog:

  1. HostPort: 1 to 1 map/NAT.
  2. NodePort: use port range 30000-33000 to NAT pod internal ports to hosts’ exposed ports on public interfaces. Don’t need to indicate specific port number, k8s will automatically assign it once change port type to nodePort.
  3. ClusterIP: won’t expose ports on hosts public facing interface, but create internal LB that can only be visible inside k8s cluster and map ports on docker bridged interfaces.
  4. LoadBalancer: Samilar as ClusterIP but will expose ports on public interface of host and claim Cloud LB to assign IP/LB instance.

In the following yaml, port: 22 defines ClusterIP(10.233.37.145) will be listening on 22, real pod IP will listening on gitlab-shell which is also defined under pod yaml, and host will use port 30444 and expose it to public.

  clusterIP: 10.233.37.145
  - name: gitlab-shell
    nodePort: 30444
    port: 22
    protocol: TCP
    targetPort: gitlab-shell

Kubectl Commands

Label Usage

kubectl get pods --show-labels will show pods with labels kubectl get pods -l 'env in (prod, dev)' will show either prod or dev pods

How Kubernetes works

Concept of Services

As for now, it has 3 modes. Userspace, Iptables, and IPVS. Userspace is round robin based LB. For each Service it opens a port (randomly chosen) on the local node. Any connections to this “proxy port” will be proxied to one of the Service’s backend Pods (as reported in Endpoints/Pods). it installs iptables rules which capture traffic to the Service’s clusterIP (which is virtual) and Port and redirects that traffic to the proxy port which proxies the backend Pod. svc-userspace

Iptables is random based LB. For each Service, it installs iptables rules which capture traffic to the Service’s clusterIP (which is virtual) and Port and redirects that traffic to one of the Service’s backend sets. For each Endpoints object, it installs iptables rules which select a backend Pod. svc-iptables

IPVS is a new feature started after 1.9, supposed to have better performance. It requires nodes to be installed with IPVS kernel module.

Normal usage should be Iptables. In this case, keeping the mapping between VIP and endpoints/pods is Kube-proxy’s job, it runs on every node and inject iptables learned from API-server.

need a section for ingress service

Job Usage

Job is for one time batch scripted pod, once job’s done, got destroyed.

CronJob Usage

CronJob is for periodically repetitive jobs. E.g, a busybox pod including backup script that backs up storage every night.

CPU/RAM Usage

CPU

If a Container attempts to allocate more CPUs at startup than what node has or resource limit, it will result in Insufficient cpu failure.

RAM

if a Container attempts to allocate memory more than its limited resource quota, it will result in OOMkill failure as it’s out of memory.

Storage

Volume API contains PVC function, PVC isolates diskspace, it cuts space from PV, which is dedicated used by a pod. PV is associated with storageclass which defines what storage underlay to be used.

When a pod consumes volume PVC, it goes through this process: POD -> PVC -> PV -> Storageclass PVC claims what it wants to use, then it will dig into PV to find whatever matches. User only needs to define PVC if Storageclass is used, k8s can generate and lock PV based on PVC/Storageclass combination automaticlly.

Service Account

K8s has an important feature called RBAC (Role Based Access Control). User can bind pod with a serviceaccount, and control what pod can visit by limiting a serviceaccount visible scope.

create a new cluster-role-binding for serviceaccount default:exposecontroller:

kubectl create clusterrolebinding expose-rule --serviceaccount=default:exposecontroller --clusterrole=cluster-admin

To make a user has admin right only to a specific namespace, you can use create a RoleBinding to bind namespace serviceaccount to edit CluterRole.

Remove a Pod from a service

Changing label on a running pod can eliminate that pod from its service group. e.g: if we have a deployment which has 3 replicated pods, and each of them with a label RUN:nginx, then if we do kubectl label pod nginx-fdsa-fds run=notworking --overwrite, it will eliminate pod nginx-fdsa-fds from service but keep it alive and running, while adding a new pod with label RUN:nginx into the service group.

Get summary state of k8s cluster

kubectl cluster-info dump --all-namespace --output-directory=$PWD/cluster-state tree cluster-state

it will list full structure of this cluster.


Following are copied from Jimmy’s wiki.

在容器中获取 Pod 的IP

通过环境变量来实现,该环境变量直接引用 resource 的状态字段,示例如下:

apiVersion: v1
kind: ReplicationController
metadata:
  name: world-v2
spec:
  replicas: 3
  selector:
    app: world-v2
  template:
    metadata:
      labels:
        app: world-v2
    spec:
      containers:
      - name: service
        image: test
        env:
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        ports:
        - name: service
          containerPort: 777

容器中可以直接使用 POD_IP 环境变量获取容器的 IP

指定容器的启动参数

我们可以在 Pod 中为容器使用 command 为容器指定启动参数:

command: ["/bin/bash","-c","bootstrap.sh"]

看似很简单,使用数组的方式定义,所有命令使用跟 Dockerfile 中的 CMD 配置是一样的,但是有一点不同的是,bootsttap.sh 必须具有可执行权限,否则容器启动时会出错。

让Pod调用宿主机的docker能力

我们可以想象一下这样的场景,让 Pod 来调用宿主机的 docker 能力,只需要将宿主机的 docker 命令和 docker.sock 文件挂载到 Pod 里面即可,如下:

apiVersion: v1
kind: Pod
metadata:
 name: busybox-cloudbomb
spec:
 containers:
 - image: busybox
 command:
 - /bin/sh
 - "-c"
 - "while true; \
 do \
 docker run -d --name BOOM_$(cat /dev/urandom | tr -cd 'a-f0-9' | head -c 6) nginx ; \
 done"
 name: cloudbomb
 volumeMounts:
 - mountPath: /var/run/docker.sock
 name: docker-socket
 - mountPath: /bin/docker
 name: docker-binary
 volumes:
 - name: docker-socket
 hostPath:
 path: /var/run/docker.sock
 - name: docker-binary
 hostPath:
 path: /bin/docker

参考:Architecture Patterns for Microservices in Kubernetes

使用Init container初始化应用配置

Init container可以在应用程序的容器启动前先按顺序执行一批初始化容器,只有所有Init容器都启动成功后,Pod才算启动成功。看下下面这个例子(来源:kubernetes: mounting volume from within init container - Stack Overflow):

apiVersion: v1
kind: Pod
metadata:
  name: init
  labels:
    app: init
  annotations:
    pod.beta.kubernetes.io/init-containers: '[
        {
            "name": "download",
            "image": "axeclbr/git",
            "command": [
                "git",
                "clone",
                "https://github.com/mdn/beginner-html-site-scripted",
                "/var/lib/data"
            ],
            "volumeMounts": [
                {
                    "mountPath": "/var/lib/data",
                    "name": "git"
                }
            ]
        }
    ]'
spec:
  containers:
  - name: run
    image: docker.io/centos/httpd
    ports:
      - containerPort: 80
    volumeMounts:
    - mountPath: /var/www/html
      name: git
  volumes:
  - emptyDir: {}
    name: git

这个例子就是用来再应用程序启动前首先从GitHub中拉取代码并存储到共享目录下。

关于Init容器的更详细说明请参考 init容器。

###使容器内时间与宿主机同步

我们下载的很多容器内的时区都是格林尼治时间,与北京时间差8小时,这将导致容器内的日志和文件创建时间与实际时区不符,有两种方式解决这个问题:

  • 修改镜像中的时区配置文件
  • 将宿主机的时区配置文件/etc/localtime使用volume方式挂载到容器中

第二种方式比较简单,不需要重做镜像,只要在应用的yaml文件中增加如下配置:

volumeMounts:
  - name: host-time
    mountPath: /etc/localtime
    readOnly: true
  volumes:
  - name: host-time
    hostPath:
      path: /etc/localtime

在Pod中获取宿主机的主机名、namespace等

这条技巧补充了第一条获取 podIP 的内容,方法都是一样的,只不过列出了更多的引用字段。

参考下面的 pod 定义,每个 pod 里都有一个 {.spec.nodeName} 字段,通过 fieldRef 和环境变量,就可以在Pod中获取宿主机的主机名(访问环境变量MY_NODE_NAME)。

apiVersion: v1
kind: Pod
metadata:
  name: dapi-test-pod
spec:
  containers:
    - name: test-container
      image: busybox
      command: [ "/bin/sh", "-c", "env" ]
      env:
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        - name: MY_POD_SERVICE_ACCOUNT
          valueFrom:
            fieldRef:
              fieldPath: spec.serviceAccountName
  restartPolicy: Never

配置Pod使用外部DNS

修改kube-dns的使用的ConfigMap。

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-dns
  namespace: kube-system
data:
  stubDomains: |
    {"k8s.com": ["192.168.10.10"]}
  upstreamNameservers: |
    ["8.8.8.8", "8.8.4.4"]

upstreamNameservers 即使用的外部DNS,参考:Configuring Private DNS Zones and Upstream Nameservers in Kubernetes