profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/tillt/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.

dcos/dcos 2307

DC/OS - The Datacenter Operating System

tillt/docker-kdc 45

Docker container generator for a Kerberos KDC.

mesosphere-backup/open-docs 20

[DEPRECATED] Documentation for Mesosphere supported open source projects.

dcos/dcos-mesos-modules 16

Mesos Modules used in DC/OS

lloesche/dcos-ovh-cloud 16

DC/OS on OVH Cloud Installer

tillt/jenkins-automation 16

Jenkins iOS Test Automation Integration

mesosphere/dcos-mesos-modules 1

Mesos Modules used in DC/OS

tillt/homy 1

location based configuration management

mesosphere/troubleshoot 0

Preflight Checks and Support Bundles Framework for Kubernetes Applications

tillt/3rdparty 0

Collection of the 3rdparty dependencies bundled into Mesos.

startedmesosphere/konvoy-image-builder

started time in a day

PullRequestReviewEvent

pull request commentmesosphere/konvoy-image-builder

build(deps): bump hashicorp/packer from 1.7.4 to 1.7.5

@joejulian after a couple of retries this branch finally got green on the e2e kib test. I've gone ahead and added/increased retries so that we should not see such frequent failures anymore. See https://github.com/mesosphere/konvoy-image-builder/pull/97

dependabot[bot]

comment created time in a day

push eventmesosphere/konvoy-image-builder

Till Toenshoff

commit sha ea1a08959da1c8b1a48d2abdf293912663c0006c

chore: sprinkled some additional retries into SLES installation

view details

push time in a day

push eventmesosphere/konvoy-image-builder

Till Toenshoff

commit sha a975bc1693e34f5c9925f5b7271e7cac05aa19c7

chore: sprinkled some additional retries and increased timeouts into SLES installation

view details

push time in a day

push eventmesosphere/konvoy-image-builder

Till Toenshoff

commit sha 80bb75a1748a81d36434b73b9a970a2f7c0d80f0

chore: unified SLES ansible output casing

view details

Till Toenshoff

commit sha 4476a5c66cfdb07b3222d412b4d0f02f08d872d8

chore: sprinkled some additional retries and increased timeouts into SLES installation

view details

push time in a day

PR opened mesosphere/konvoy-image-builder

Reviewers
Increased timeouts and retries for SLES bug

In two control runs we saw two failures that might be retryable:

TASK [gpu : Install Nvidia drivers]
[...]
Failed to provide Package libX11-data-1.6.5-3.21.1.noarch (SLE-Module-Basesystem15-SP3-Updates). Do you want to retry retrieval?
TASK [packages : install common packages]
[...]
sles-15: fatal: [default]: FAILED! => {"attempts": 5, "changed": false, "cmd": ["/usr/bin/zypper", "--quiet", "--non-interactive", "--xmlout", "install", "--type", "package", "--auto-agree-with-licenses", "--no-recommends", "--", "+audit", "+conntrack-tools", "+open-vm-tools", "+python3-pip", "+python3-netifaces", "+socat", "+sysstat", "+nfs-utils"], "msg": "Zypper run command failed with return code 1.", "rc": 1, "stderr": "", "stderr_lines": [], "stdout": "<?xml version='1.0'?>\n<stream>\n<message type=\"error\">Unexpected 

The nvidia installation did not have any retries at all, adding some. The common package installation did have a retry set to 5 but maybe that is not enough as @fatz also hinted to me.

+17 -7

0 comment

2 changed files

pr created time in a day

create barnchmesosphere/konvoy-image-builder

branch : till/sles-timeouts-inc

created branch time in a day

PullRequestReviewEvent

Pull request review commentmesosphere/troubleshoot

feat: add privilegedhostexec collector

+package collect++import (+	"bytes"+	"context"+	"path/filepath"+	"time"++	"github.com/pkg/errors"+	"github.com/segmentio/ksuid"+	appsv1 "k8s.io/api/apps/v1"+	corev1 "k8s.io/api/core/v1"+	kuberneteserrors "k8s.io/apimachinery/pkg/api/errors"+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"+	"k8s.io/apimachinery/pkg/labels"+	"k8s.io/client-go/kubernetes"+	restclient "k8s.io/client-go/rest"+	"k8s.io/utils/pointer"++	troubleshootv1beta2 "github.com/replicatedhq/troubleshoot/pkg/apis/troubleshoot/v1beta2"+	"github.com/replicatedhq/troubleshoot/pkg/logger"+)++// Privileged host exec collector is a collector that executes a container with+// elevated permissions on nodes in the cluster based on the provided selector+// and copies the results produced by the container to the support bundle.++// The k8s has no primitive for running a dameonset like job that would run once+// on all nodes in the cluster. To overcome this issue this collector runs+// container that collects data as an `initContainer` and it expects that the+// container runs one-off process with an expectation of a clean exit. Once the+// initContainer is completed an additional `pause` container is launched that+// is a no-op container. This container only waits and allows the main+// support-bundle process to collect produced data (copy from the container).+// The `initContainer` and `pause` containers share a volume to which the+// container producing the support bundle data writes. Data written to other+// paths by the `initContainer` will not be copied back to the support bundle.++// This collector is a combination of `CopyFromHost` and `Exec` containers.++const (+	// privilegedHostExecSharedVolumePath is a path to a directory that is+	// shared between the init container and pause container. The container that+	// is producing data that should be copied to the diagnostics bundle should+	// write data to this path.+	privilegedHostExecSharedVolumePath = "/host/output"++	// privilegedHostExecHostVolumePath is a path that is mounted to the container+	// that is collecting data from the host node.+	privilegedHostExecHostVolumePath = "/host"++	// defaultPauseImage is the name of the image that will be launched to transfer+	// data collected by the collector container.+	defaultPauseImage = "mesosphere/pause-alpine:3.2"+)++// PrivilegedHostExec is a function that executes arbitrary container on all
// PrivilegedHostExec is a function that executes an arbitrary container on all
mhrabovcin

comment created time in a day

Pull request review commentmesosphere/troubleshoot

feat: add privilegedhostexec collector

+package collect++import (+	"bytes"+	"context"+	"path/filepath"+	"time"++	"github.com/pkg/errors"+	"github.com/segmentio/ksuid"+	appsv1 "k8s.io/api/apps/v1"+	corev1 "k8s.io/api/core/v1"+	kuberneteserrors "k8s.io/apimachinery/pkg/api/errors"+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"+	"k8s.io/apimachinery/pkg/labels"+	"k8s.io/client-go/kubernetes"+	restclient "k8s.io/client-go/rest"+	"k8s.io/utils/pointer"++	troubleshootv1beta2 "github.com/replicatedhq/troubleshoot/pkg/apis/troubleshoot/v1beta2"+	"github.com/replicatedhq/troubleshoot/pkg/logger"+)++// Privileged host exec collector is a collector that executes a container with+// elevated permissions on nodes in the cluster based on the provided selector+// and copies the results produced by the container to the support bundle.++// The k8s has no primitive for running a dameonset like job that would run once+// on all nodes in the cluster. To overcome this issue this collector runs+// container that collects data as an `initContainer` and it expects that the+// container runs one-off process with an expectation of a clean exit. Once the+// initContainer is completed an additional `pause` container is launched that+// is a no-op container. This container only waits and allows the main+// support-bundle process to collect produced data (copy from the container).+// The `initContainer` and `pause` containers share a volume to which the+// container producing the support bundle data writes. Data written to other+// paths by the `initContainer` will not be copied back to the support bundle.++// This collector is a combination of `CopyFromHost` and `Exec` containers.

Great comment - very helpful.

mhrabovcin

comment created time in a day

PullRequestReviewEvent

push eventmesosphere/charts

Till Toenshoff

commit sha 9fd90f90f1b2862b1850e4b64ac5ce208be9d1f2

chore: bumps DCGM exporter to 2.2.9 (#1227) * chore: bumps DCGM exporter to 2.2.9 * chore: update requirements

view details

push time in 3 days

delete branch mesosphere/charts

delete branch : till/bump-dcgm-to-229

delete time in 3 days

PR merged mesosphere/charts

Reviewers
chore: bumps DCGM exporter to 2.2.9 ready chore

What type of PR is this? <!-- Bug, Chore, Documentation, Feature --> Chore

What this PR does/ why we need it: <!-- Explain, without going into the details, what this PR does, and what problem it solves. -->

Which issue(s) this PR fixes: <!-- Add a link to the JIRA issue. Otherwise, put "no issue." --> https://jira.d2iq.com/browse/D2IQ-79233

Special notes for your reviewer:

Does this PR introduce a user-facing change?: <!-- If no, just write "NONE" in the release-note block below. If yes, a release note is required: Enter your extended release note in the block below. If the PR requires additional action from users switching to the new release, include the string "action required". -->

updates Nvidia DCGM exporter to 2.2.9

Checklist

  • [ ] If a chart is changed, the chart version is correctly incremented.
  • [ ] The commit message explains the changes and why are needed.
  • [ ] The code builds and passes lint/style checks locally.
  • [ ] The relevant subset of integration tests pass locally.
  • [ ] The core changes are covered by tests.
  • [ ] The documentation is updated where needed.
+4 -4

0 comment

4 changed files

tillt

pr closed time in 3 days

push eventmesosphere/konvoy-image-builder

Till Toenshoff

commit sha eb7bfe04c6527dfd1dc4e15f2ce71b60cea5400e

feat: Upgrade NVIDIA GPU drivers to 470.x (#96) * chore: moved gpu related defines * chore: removed unused variable definitions * feat: upgrade GPU driver to 470.x

view details

push time in 3 days

PR merged mesosphere/konvoy-image-builder

Reviewers
feat: Upgrade NVIDIA GPU drivers to 470.x enhancement

https://jira.d2iq.com/browse/D2IQ-79232

+4 -22

4 comments

2 changed files

tillt

pr closed time in 3 days

push eventmesosphere/charts

Till Toenshoff

commit sha 267467edb50770d15de1e57088eb290ef6bf2806

chore: update requirements

view details

push time in 4 days

PR opened mesosphere/charts

Reviewers
chore: bumps DCGM exporter to 2.2.9

What type of PR is this? <!-- Bug, Chore, Documentation, Feature --> Chore

What this PR does/ why we need it: <!-- Explain, without going into the details, what this PR does, and what problem it solves. -->

Which issue(s) this PR fixes: <!-- Add a link to the JIRA issue. Otherwise, put "no issue." --> https://jira.d2iq.com/browse/D2IQ-79233

Special notes for your reviewer:

Does this PR introduce a user-facing change?: <!-- If no, just write "NONE" in the release-note block below. If yes, a release note is required: Enter your extended release note in the block below. If the PR requires additional action from users switching to the new release, include the string "action required". -->

updates Nvidia DCGM exporter to 2.2.9

Checklist

  • [ ] If a chart is changed, the chart version is correctly incremented.
  • [ ] The commit message explains the changes and why are needed.
  • [ ] The code builds and passes lint/style checks locally.
  • [ ] The relevant subset of integration tests pass locally.
  • [ ] The core changes are covered by tests.
  • [ ] The documentation is updated where needed.
+3 -3

0 comment

3 changed files

pr created time in 4 days

create barnchmesosphere/charts

branch : till/bump-dcgm-to-229

created branch time in 4 days

pull request commentmesosphere/konvoy-image-builder

wip: feat: Upgrade NVIDIA GPU drivers to 470.x

Ran the molecule GPU suite against this PR while disabling flatcar:

============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
rootdir: /home/till
plugins: testinfra-6.4.0
collected 25 items

molecule/ec2_gpu/tests/test_default.py .........................         [100%]

5 tests per distro - 5 distros tested = 25 tests -- succeeded 100%

  • centos 7.8
  • centos 7.9
  • rhel 7.9
  • rhel 8.4
  • suse 15

That means we have the 470.x drivers available for all of those and they install and validate properly.

tillt

comment created time in 4 days

pull request commentmesosphere/konvoy-image-builder

wip: feat: Upgrade NVIDIA GPU drivers to 470.x

running nvidia-smi on the host:

$ nvidia-smi
Mon Sep 20 20:57:41 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   64C    P0    61W / 149W |      0MiB / 11441MiB |     68%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
tillt

comment created time in 4 days

pull request commentmesosphere/konvoy-image-builder

wip: feat: Upgrade NVIDIA GPU drivers to 470.x

on centos7.9 this installs the following:

[...]
Sep 20 20:37:35 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-driver-branch-470-libs-470.57.02-1.el7.x86_64
Sep 20 20:37:35 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-driver-branch-470-NVML-470.57.02-1.el7.x86_64
Sep 20 20:37:35 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-driver-branch-470-devel-470.57.02-1.el7.x86_64
Sep 20 20:37:35 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-driver-branch-470-NvFBCOpenGL-470.57.02-1.el7.x86_64
Sep 20 20:37:40 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-driver-branch-470-cuda-libs-470.57.02-1.el7.x86_64
Sep 20 20:37:40 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-modprobe-branch-470-470.57.02-1.el7.x86_64
Sep 20 20:37:40 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-xconfig-branch-470-470.57.02-1.el7.x86_64
Sep 20 20:37:40 ip-172-31-80-203.ec2.internal groupadd[20795]: group added to /etc/group: name=nvidia-persistenced, GID=994
Sep 20 20:37:40 ip-172-31-80-203.ec2.internal groupadd[20795]: group added to /etc/gshadow: name=nvidia-persistenced
Sep 20 20:37:40 ip-172-31-80-203.ec2.internal groupadd[20795]: new group: name=nvidia-persistenced, GID=994
Sep 20 20:37:40 ip-172-31-80-203.ec2.internal useradd[20800]: new user: name=nvidia-persistenced, UID=997, GID=994, home=/var/run/nvidia-persistenced, shell=/sbin/nologin
Sep 20 20:37:41 ip-172-31-80-203.ec2.internal systemd[1]: Reloading.
Sep 20 20:37:41 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-persistenced-branch-470-470.57.02-1.el7.x86_64
Sep 20 20:37:41 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-driver-branch-470-cuda-470.57.02-1.el7.x86_64

and then loads the installed modules successfully

[...]
Sep 20 20:41:04 ip-172-31-80-203.ec2.internal kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Sep 20 20:41:04 ip-172-31-80-203.ec2.internal kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 243
Sep 20 20:41:04 ip-172-31-80-203.ec2.internal kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.57.02  Tue Jul 13 16:14:05 UTC 2021
Sep 20 20:41:04 ip-172-31-80-203.ec2.internal kernel: nvidia-uvm: Loaded the UVM driver, major device number 241.
Sep 20 20:41:04 ip-172-31-80-203.ec2.internal kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.57.02  Tue Jul 13 16:06:24 UTC 2021
Sep 20 20:41:04 ip-172-31-80-203.ec2.internal kernel: [drm] [nvidia-drm] [GPU ID 0x0000001e] Loading driver
Sep 20 20:41:04 ip-172-31-80-203.ec2.internal kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:1e.0 on minor 1
Sep 20 20:41:04 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:kmod-nvidia-latest-dkms-470.57.02-1.el7.x86_64
Sep 20 20:41:06 ip-172-31-80-203.ec2.internal yum[20642]: Installed: 3:nvidia-driver-branch-470-470.57.02-1.el7.x86_64
tillt

comment created time in 4 days

push eventmesosphere/konvoy-image-builder

Till Toenshoff

commit sha c2c82a1bef80da0beb3466ecff8fee4b590bcb8c

feat: upgrade GPU driver to 470.x

view details

push time in 4 days

push eventmesosphere/konvoy-image-builder

Till Toenshoff

commit sha c8172c046e5dbb2cfd2266f996e4d9f1252c6001

chore: removed unused variable definitions

view details

push time in 4 days

push eventmesosphere/konvoy-image-builder

Till Toenshoff

commit sha d87ba2b4d2a1ed36780a4853139d04f370b65a9c

wip: deduping

view details

push time in 4 days

create barnchmesosphere/konvoy-image-builder

branch : till/470-bump

created branch time in 4 days

PR opened mesosphere/konvoy-image-builder

Reviewers
fix: flatcar oci hook and containerd config
  • fixes missing OCI hook by adding the nvidia-container-toolkit
    • based on the approach demonstrated in https://github.com/mesosphere/konvoy-image-builder/pull/77
  • adds nvidia-persistenced and service description for it
  • fixes configuration for flatcar containerd service
  • adds comments and fixes some
  • minor cleanup in pulling some versions out into a var file
+160 -37

0 comment

9 changed files

pr created time in 11 days

push eventmesosphere/konvoy-image-builder

Till Toenshoff

commit sha 4d7f90d44060fb19b2b2c0ab9780972667d8bd12

chore: comment cleanup

view details

push time in 11 days