Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
6df299f
Cherry picking hotfix changes to ci_dev (#605)
rashmichandrashekar Jul 14, 2021
3b38337
release changes (#607)
rashmichandrashekar Jul 15, 2021
bcea7fc
Gangams/aad stage3 msi auth (#585)
ganga1980 Jul 19, 2021
13eb3a6
Gangams/remove chart version dependency (#589)
ganga1980 Jul 20, 2021
63f22d9
Gangams/july 2021 release tasks 3 (#613)
ganga1980 Jul 23, 2021
902c939
remove un-used output plugin (#614)
vishiy Jul 23, 2021
a76905a
fix telegraf telemetry and improve fluentd liveness (#611)
ganga1980 Jul 23, 2021
52612b5
Gangams/july 2021 release tasks 2 (#612)
ganga1980 Jul 23, 2021
5b5d048
Fix out_oms.go dependency vulnerabilities (#623)
gracewehner Aug 13, 2021
2a0f4ec
revert libsystemd0 update (#616)
ganga1980 Aug 13, 2021
45f35ae
updates for ci-prod release instructions (#619)
ganga1980 Aug 13, 2021
10b2ea6
cherry pick changes from ci_prod (#622)
ganga1980 Aug 13, 2021
ad31c55
Support az login for passwords starting with dash ('-') (#626)
vladimir-babichev Aug 14, 2021
57beb59
Gangams/add telemetry fbit settings (#628)
ganga1980 Aug 17, 2021
cf4775a
check onboarding status (#629)
ganga1980 Aug 19, 2021
da55fe5
Gangams/arc k8s conformance test updates (#617)
ganga1980 Aug 19, 2021
e39b83b
upgrade golang version for windows in pipeline build and locally (#630)
gracewehner Aug 20, 2021
d4e2209
informer code - should be working
khchau7 Aug 30, 2021
1eeb7bd
optimizing for better cpu and memory usage
khchau7 Sep 1, 2021
9ba67bc
reverting to current prod in kube pod inv to test cpu spikes
khchau7 Sep 2, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pr-checker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ jobs:
format: 'table'
severity: 'CRITICAL,HIGH'
vuln-type: 'os,library'
skip-dirs: 'opt/telegraf'
skip-dirs: 'opt/telegraf,usr/sbin/telegraf'
exit-code: '1'
timeout: '5m0s'
WINDOWS-build:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,5 +40,6 @@ package:
# to be named differently. Defaults to Dockerfile.
# In effect, the -f option value passed to docker build will be repository_checkout_folder/src/DockerFinal/Foo.dockerfile.
repository_name: 'cdpxlinux' # only supported ones are cdpx acr repos
tag: 'ciprod' # OPTIONAL: Defaults to latest. The tag for the built image. Final tag will be 1.0.0alpha, 1.0.0-timestamp-commitID.
tag: 'ciprod' # OPTIONAL: Defaults to latest. The tag for the built image. Final tag will be 1.0.0alpha, 1.0.0-timestamp-commitID.
latest: false # OPTIONAL: Defaults to false. If tag is not set to latest and this flag is set, then tag as latest as well and push latest as well.
export_to_artifact_path: 'agentimage.tar.gz' # path for exported image and use this instead of fixed tag
1 change: 1 addition & 0 deletions .pipelines/pipeline.user.linux.yml
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,4 @@ package:
repository_name: 'cdpxlinux' # only supported ones are cdpx acr repos
tag: 'cidev' # OPTIONAL: Defaults to latest. The tag for the built image. Final tag will be 1.0.0alpha, 1.0.0-timestamp-commitID.
latest: false # OPTIONAL: Defaults to false. If tag is not set to latest and this flag is set, then tag as latest as well and push latest as well.
export_to_artifact_path: 'agentimage.tar.gz' # path for exported image and use this instead of fixed tag
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ environment:
version: '2019'
runtime:
provider: 'appcontainer'
image: 'cdpxwin1809.azurecr.io/user/azure-monitor/container-insights:6.0'
image: 'cdpxwin1809.azurecr.io/user/azure-monitor/container-insights:latest'
source_mode: 'map'

version:
Expand Down Expand Up @@ -53,3 +53,4 @@ package:
repository_name: 'cdpxwin1809' # only supported ones are cdpx acr repos
tag: 'win-ciprod' # OPTIONAL: Defaults to latest. The tag for the built image. Final tag will be 1.0.0alpha, 1.0.0-timestamp-commitID.
latest: false # OPTIONAL: Defaults to false. If tag is not set to latest and this flag is set, then tag as latest as well and push latest as well.
export_to_artifact_path: 'agentimage.tar.zip' # path for exported image and use this instead of fixed tag
3 changes: 2 additions & 1 deletion .pipelines/pipeline.user.windows.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ environment:
version: '2019'
runtime:
provider: 'appcontainer'
image: 'cdpxwin1809.azurecr.io/user/azure-monitor/container-insights:6.0'
image: 'cdpxwin1809.azurecr.io/user/azure-monitor/container-insights:latest'
source_mode: 'map'

version:
Expand Down Expand Up @@ -53,3 +53,4 @@ package:
repository_name: 'cdpxwin1809' # only supported ones are cdpx acr repos
tag: 'win-cidev' # OPTIONAL: Defaults to latest. The tag for the built image. Final tag will be 1.0.0alpha, 1.0.0-timestamp-commitID.
latest: false # OPTIONAL: Defaults to false. If tag is not set to latest and this flag is set, then tag as latest as well and push latest as well.
export_to_artifact_path: 'agentimage.tar.zip' # path for exported image and use this instead of fixed tag
74 changes: 74 additions & 0 deletions .pipelines/release-agent.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
#!/bin/bash

# Note - This script used in the pipeline as inline script

# These are plain pipeline variable which can be modified anyone in the team
# AGENT_RELEASE=cidev
# AGENT_IMAGE_TAG_SUFFIX=07222021

#Name of the ACR for ciprod & cidev images
ACR_NAME=containerinsightsprod.azurecr.io
AGENT_IMAGE_FULL_PATH=${ACR_NAME}/public/azuremonitor/containerinsights/${AGENT_RELEASE}:${AGENT_RELEASE}${AGENT_IMAGE_TAG_SUFFIX}
AGENT_IMAGE_TAR_FILE_NAME=agentimage.tar.gz

if [ -z $AGENT_IMAGE_TAG_SUFFIX ]; then
echo "-e error value of AGENT_RELEASE variable shouldnt be empty"
exit 1
fi

if [ -z $AGENT_RELEASE ]; then
echo "-e error AGENT_RELEASE shouldnt be empty"
exit 1
fi

echo "ACR NAME - ${ACR_NAME}"
echo "AGENT RELEASE - ${AGENT_RELEASE}"
echo "AGENT IMAGE TAG SUFFIX - ${AGENT_IMAGE_TAG_SUFFIX}"
echo "AGENT IMAGE FULL PATH - ${AGENT_IMAGE_FULL_PATH}"
echo "AGENT IMAGE TAR FILE PATH - ${AGENT_IMAGE_TAR_FILE_NAME}"

echo "loading linuxagent image tarball"
IMAGE_NAME=$(docker load -i ${AGENT_IMAGE_TAR_FILE_NAME})
echo IMAGE_NAME: $IMAGE_NAME
if [ $? -ne 0 ]; then
echo "-e error, on loading linux agent tarball from ${AGENT_IMAGE_TAR_FILE_NAME}"
echo "** Please check if this caused due to build error **"
exit 1
else
echo "successfully loaded linux agent image tarball"
fi
# IMAGE_ID=$(docker images $IMAGE_NAME | awk '{print $3 }' | tail -1)
# echo "Image Id is : ${IMAGE_ID}"
prefix="Loadedimage:"
IMAGE_NAME=$(echo $IMAGE_NAME | tr -d '"' | tr -d "[:space:]")
IMAGE_NAME=${IMAGE_NAME/#$prefix}
echo "*** trimmed image name-:${IMAGE_NAME}"
echo "tagging the image $IMAGE_NAME as ${AGENT_IMAGE_FULL_PATH}"
# docker tag $IMAGE_NAME ${AGENT_IMAGE_FULL_PATH}
docker tag $IMAGE_NAME $AGENT_IMAGE_FULL_PATH

if [ $? -ne 0 ]; then
echo "-e error tagging the image $IMAGE_NAME as ${AGENT_IMAGE_FULL_PATH}"
exit 1
else
echo "successfully tagged the image $IMAGE_NAME as ${AGENT_IMAGE_FULL_PATH}"
fi

# used pipeline identity to push the image to ciprod acr
echo "logging to acr: ${ACR_NAME}"
az acr login --name ${ACR_NAME}
if [ $? -ne 0 ]; then
echo "-e error log into acr failed: ${ACR_NAME}"
exit 1
else
echo "successfully logged into acr:${ACR_NAME}"
fi

echo "pushing ${AGENT_IMAGE_FULL_PATH}"
docker push ${AGENT_IMAGE_FULL_PATH}
if [ $? -ne 0 ]; then
echo "-e error on pushing the image ${AGENT_IMAGE_FULL_PATH}"
exit 1
else
echo "Successfully pushed the image ${AGENT_IMAGE_FULL_PATH}"
fi
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ For DEV and PROD branches, automatically deployed latest yaml with latest agent
docker build -f ./core/Dockerfile -t <repo>/<imagename>:<imagetag> .
docker push <repo>/<imagename>:<imagetag>
```
3. update existing agentest image tag in e2e-tests.yaml with newly built image tag with MCR repo
3. update existing agentest image tag in e2e-tests.yaml & conformance.yaml with newly built image tag with MCR repo

# Scenario Tests
Clusters are used in release pipeline already has the yamls under test\scenario deployed. Make sure to validate these scenarios.
Expand Down
18 changes: 18 additions & 0 deletions ReleaseNotes.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,24 @@ additional questions or comments.

Note : The agent version(s) below has dates (ciprod<mmddyyyy>), which indicate the agent build dates (not release dates)

### 08/05/2021 -
##### Version microsoft/oms:ciprod08052021 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod08052021 (linux)
##### Code change log
- Linux Agent
- Fix for CPU spike which occurrs at around 6.30am UTC on every day because of unattended package upgrades
- Update MDSD build which has fixes for the following issues
- Undeterministic Core dump issue because of the non 200 status code and runtime exception stack unwindings
- Reduce the verbosity of the error logs for OMS & ODS code paths.
- Increase Timeout for OMS Homing service API calls from 30s to 60s
- Fix for https://github.com/Azure/AKS/issues/2457
- In replicaset, tailing of the mdsd.err log file to agent telemetry


### 07/13/2021 -
##### Version microsoft/oms:win-ciprod06112021-2 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod06112021-2 (windows)
##### Code change log
- Hotfix for fixing NODE_IP environment variable not set issue for non sidecar mode

### 07/02/2021 -
##### Version microsoft/oms:ciprod06112021-1 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod06112021-1 (linux)
##### Version microsoft/oms:win-ciprod06112021 Version mcr.microsoft.com/azuremonitor/containerinsights/ciprod:win-ciprod06112021 (windows)
Expand Down
39 changes: 19 additions & 20 deletions ReleaseProcess.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,12 @@ Here are the high-level instructions to get the CIPROD`<MM><DD><YYYY>` image for
2. Make PR to ci_dev branch and once the PR approved, merge the changes to ci_dev
3. Latest bits of ci_dev automatically deployed to CIDEV cluster in build subscription so just validated E2E to make sure everthing works
4. If everything validated in DEV, make merge PR from ci_dev and ci_prod and merge once this reviewed by dev team
6. Update following pipeline variables under ReleaseCandiate with version of chart and image tag
- CIHELMCHARTVERSION <VersionValue> # For example, 2.7.4
- CIImageTagSuffix <ImageTag> # ciprod08072020 or ciprod08072020-1 etc.
7. Merge ci_dev and ci_prod branch which will trigger automatic deployment of latest bits to CIPROD cluster with CIPROD`<MM><DD><YYYY>` image to test and scale cluters, AKS, AKS-Engine
> Note: production image automatically pushed to CIPROD Public cloud ACR which will inturn replicated to Public cloud MCR.
5. Once the PR to ci_prod approved, please go-ahead and merge, and wait for ci_prod build successfully completed
6. Once the merged PR build successfully completed, update the value of AGENT_IMAGE_TAG_SUFFIX pipeline variable by editing the Release [ci-prod-release](https://github-private.visualstudio.com/microsoft/_release?_a=releases&view=mine&definitionId=38)
> Note - value format of AGENT_IMAGE_TAG_SUFFIX pipeline should be in `<MM><DD><YYYY>` for our releases
7. Create a release by selecting the targetted build version of the _docker-provider_Official-ci_prod release
8. Validate all the scenarios against clusters in build subscription and scale clusters


# 2. Perf and scale testing

Deploy latest omsagent yaml with release candidate agent image in to supported k8s versions and validate all the critical scenarios. In perticular, throughly validate the updates going as part of this release and also make sure no regressions. If this passes, deploy onto scale cluster and validate perf and scale aspects. Scale cluster in AME cloud and co-ordinate with agent team who has access to this cluster to deploy the release candiate onto this cluster.
Expand All @@ -39,48 +37,49 @@ Image automatically synched to MCR CN from Public cloud MCR.

Make PR against [AKS-Engine](https://github.com/Azure/aks-engine). Refer PR https://github.com/Azure/aks-engine/pull/2318

## Arc for Kubernetes
## Arc for Kubernetes

Ev2 pipeline used to deploy the chart of the Arc K8s Container Insights Extension as per Safe Deployment Process.
Ev2 pipeline used to deploy the chart of the Arc K8s Container Insights Extension as per Safe Deployment Process.
Here is the high level process
```
1. Specify chart version of the release candidate and trigger [container-insights-arc-k8s-extension-ci_prod-release](https://github-private.visualstudio.com/microsoft/_release?_a=releases&view=all)
2. Get the approval from one of team member for the release
3. Once the approved, release should be triggered automatically
3. Once the approved, release should be triggered automatically
4. use `cimon-arck8s-eastus2euap` for validating latest release in canary region
5. TBD - Notify vendor team for the validation on all Arc K8s supported platforms
```

## Microsoft Charts Repo release for On-prem K8s
> Note: This chart repo being used in the ARO v4 onboarding script as well.

Since HELM charts repo being deprecated, Microsoft charts repo being used for HELM chart release of on-prem K8s clusters.
To make chart release PR, fork [Microsoft-charts-repo]([https://github.com/microsoft/charts/tree/gh-pages) and make the PR against `gh-pages` branch of the upstream repo.
Since HELM charts repo being deprecated, Microsoft charts repo being used for HELM chart release of on-prem K8s clusters.
To make chart release PR, fork [Microsoft-charts-repo]([https://github.com/microsoft/charts/tree/gh-pages) and make the PR against `gh-pages` branch of the upstream repo.

Refer PR - https://github.com/microsoft/charts/pull/23 for example.
Once the PR merged, latest version of HELM chart should be available in couple of mins in https://microsoft.github.io/charts/repo and https://artifacthub.io/.

Instructions to create PR
```
# 1. create helm package for the release candidate
# 1. create helm package for the release candidate
git clone [email protected]:microsoft/Docker-Provider.git
git checkout ci_prod
cd ~/Docker-Provider/charts/azuremonitor-containers # this path based on where you have cloned the repo
helm package .
helm package .

# 2. clone your fork repo and checkout gh_pages branch # gh_pages branch used as release branch
cd ~
# 2. clone your fork repo and checkout gh_pages branch # gh_pages branch used as release branch
cd ~
git clone <your-forked-repo-of-microsoft-charts-repo>
cd ~/charts # assumed the root dir of the clone is charts
git checkout gh_pages

# 3. copy release candidate helm package
cd ~/charts/repo/azuremonitor-containers
# 3. copy release candidate helm package
cd ~/charts/repo/azuremonitor-containers
# update chart version value with the version of chart being released
cp ~/Docker-Provider/charts/azuremonitor-containers/azuremonitor-containers-<chart-version>.tgz .
cp ~/Docker-Provider/charts/azuremonitor-containers/azuremonitor-containers-<chart-version>.tgz .
cd ~/charts/repo
# update repo index file
# update repo index file
helm repo index .

# 4. Review the changes and make PR. Please note, you may need to revert unrelated changes automatically added by `helm repo index .` command

```
Expand Down
13 changes: 13 additions & 0 deletions build/linux/installer/conf/td-agent-bit-rs.conf
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,19 @@
Skip_Long_Lines On
Ignore_Older 2m

[INPUT]
Name tail
Tag oms.container.log.flbplugin.mdsd.*
Path /var/opt/microsoft/linuxmonagent/log/mdsd.err
Read_from_Head true
DB /var/opt/microsoft/docker-cimprov/state/mdsd-ai.db
DB.Sync Off
Parser docker
Mem_Buf_Limit 1m
Path_Key filepath
Skip_Long_Lines On
Ignore_Older 2m

[INPUT]
Name tcp
Tag oms.container.perf.telegraf.*
Expand Down
20 changes: 0 additions & 20 deletions build/linux/installer/conf/telegraf-rs.conf
Original file line number Diff line number Diff line change
Expand Up @@ -124,26 +124,6 @@
namedrop = ["agent_telemetry", "file"]
#tagdrop = ["AgentVersion","AKS_RESOURCE_ID", "ACS_RESOURCE_NAME", "Region","ClusterName","ClusterType", "Computer", "ControllerType"]

[[outputs.application_insights]]
## Instrumentation key of the Application Insights resource.
instrumentation_key = "$TELEMETRY_APPLICATIONINSIGHTS_KEY"

## Timeout for closing (default: 5s).
# timeout = "5s"

## Enable additional diagnostic logging.
# enable_diagnostic_logging = false

## Context Tag Sources add Application Insights context tags to a tag value.
##
## For list of allowed context tag keys see:
## https://github.com/Microsoft/ApplicationInsights-Go/blob/master/appinsights/contracts/contexttagkeys.go
# [outputs.application_insights.context_tag_sources]
# "ai.cloud.role" = "kubernetes_container_name"
# "ai.cloud.roleInstance" = "kubernetes_pod_name"
namepass = ["agent_telemetry"]
#tagdrop = ["nodeName"]

###############################################################################
# PROCESSOR PLUGINS #
###############################################################################
Expand Down
20 changes: 0 additions & 20 deletions build/linux/installer/conf/telegraf.conf
Original file line number Diff line number Diff line change
Expand Up @@ -158,26 +158,6 @@
namepass = ["container.azm.ms/disk"]
#fieldpass = ["used_percent"]

[[outputs.application_insights]]
## Instrumentation key of the Application Insights resource.
instrumentation_key = "$TELEMETRY_APPLICATIONINSIGHTS_KEY"

## Timeout for closing (default: 5s).
# timeout = "5s"

## Enable additional diagnostic logging.
# enable_diagnostic_logging = false

## Context Tag Sources add Application Insights context tags to a tag value.
##
## For list of allowed context tag keys see:
## https://github.com/Microsoft/ApplicationInsights-Go/blob/master/appinsights/contracts/contexttagkeys.go
# [outputs.application_insights.context_tag_sources]
# "ai.cloud.role" = "kubernetes_container_name"
# "ai.cloud.roleInstance" = "kubernetes_pod_name"
namepass = ["agent_telemetry"]
#tagdrop = ["nodeName"]

###############################################################################
# PROCESSOR PLUGINS #
###############################################################################
Expand Down
3 changes: 3 additions & 0 deletions build/linux/installer/datafiles/base_container.data
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,9 @@ MAINTAINER: 'Microsoft Corporation'

/etc/fluent/plugin/omslog.rb; source/plugins/utils/omslog.rb; 644; root; root
/etc/fluent/plugin/oms_common.rb; source/plugins/utils/oms_common.rb; 644; root; root
/etc/fluent/plugin/extension.rb; source/plugins/utils/extension.rb; 644; root; root
/etc/fluent/plugin/extension_utils.rb; source/plugins/utils/extension_utils.rb; 644; root; root


/etc/fluent/kube.conf; build/linux/installer/conf/kube.conf; 644; root; root
/etc/fluent/container.conf; build/linux/installer/conf/container.conf; 644; root; root
Expand Down
18 changes: 17 additions & 1 deletion build/linux/installer/scripts/livenessprobe.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,29 @@ fi

#optionally test to exit non zero value if fluentd is not running
#fluentd not used in sidecar container
if [ "${CONTAINER_TYPE}" != "PrometheusSidecar" ]; then
if [ "${CONTAINER_TYPE}" != "PrometheusSidecar" ]; then
(ps -ef | grep "fluentd" | grep -v "grep")
if [ $? -ne 0 ]
then
echo "fluentd is not running" > /dev/termination-log
exit 1
fi
# fluentd launches by default supervisor and worker process
# so adding the liveness checks individually to handle scenario if any of the process dies
# supervisor process
(ps -ef | grep "fluentd" | grep "supervisor" | grep -v "grep")
if [ $? -ne 0 ]
then
echo "fluentd supervisor is not running" > /dev/termination-log
exit 1
fi
# worker process
(ps -ef | grep "fluentd" | grep -v "supervisor" | grep -v "grep" )
if [ $? -ne 0 ]
then
echo "fluentd worker is not running" > /dev/termination-log
exit 1
fi
fi

#test to exit non zero value if fluentbit is not running
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,14 +36,18 @@ def start

def enumerate
begin
puts "Calling certificate renewal code..."
maintenance = OMS::OnboardingHelper.new(
ENV["WSID"],
ENV["DOMAIN"],
ENV["CI_AGENT_GUID"]
)
ret_code = maintenance.register_certs()
puts "Return code from register certs : #{ret_code}"
if !ENV["AAD_MSI_AUTH_MODE"].nil? && !ENV["AAD_MSI_AUTH_MODE"].empty? && ENV["AAD_MSI_AUTH_MODE"].downcase == "true"
puts "skipping certificate renewal code since AAD MSI auth configured"
else
puts "Calling certificate renewal code..."
maintenance = OMS::OnboardingHelper.new(
ENV["WSID"],
ENV["DOMAIN"],
ENV["CI_AGENT_GUID"]
)
ret_code = maintenance.register_certs()
puts "Return code from register certs : #{ret_code}"
end
rescue => errorStr
puts "in_heartbeat_request::enumerate:Failed in enumerate: #{errorStr}"
# STDOUT telemetry should alredy be going to Traces in AI.
Expand Down
Loading