Travis

Travis tools: Restart failed jobs and clean caches

Throughout my work on creating a travis.yml file for nextcloudpi project (read more about this here), I created several tools, which I believe would be useful to share.

Restart failed jobs

One usual issue I encountered with Travis was an unstable behavior of the armhf image, which was sometimes failing. I usually just had to restart the job and then it was passing. It appears that Travis boxes are doing something like this on and off and thus, restarting jobs is inevitable. Since it was something I had to do often, I decided to automate this.

Travis offers a command line client, travis-cli, which replaces the actions you used to do through Travis page with commands.

However, Travis-cli package seems to have some issues on the distro I’m using (By the way… I’m using Arch 😎 ), so I decided to use travis-cli docker image in my automation scripts. Docker is portable and its containers run anywhere.

The automated script I wrote to restart jobs is the following:

#!/usr/bin/env python3

"""

Automatic restarting failed jobs on Travis

This script constitutes an agent on host who
uses travis-cli to monitor the status of the
most recent build and restart any failed jobs.
Before running this script, generate a token on
Github page (https://github.com/settings/tokens)
and export it on host machine as an env var
named GITHUB_TOKEN
(EXPORT GITHUB_TOKEN=<github token>)
The script will return when the build passes.

    python restart_failed_jobs.py

"""

import subprocess, re, os, signal, sys

# Signal handler for termination signals
def termination_signal_handler(signalNumber, frame):
    print ('\nReceived signal no ', signalNumber, '\nKilling travis-cli container...')
    docker_kill()
    raise SystemExit (1)
    return

signal.signal(signal.SIGTERM, termination_signal_handler)
signal.signal(signal.SIGINT, termination_signal_handler)

# Killing the running container of travis-cli
def docker_kill():
    subprocess.run("docker kill travis-cli", shell=True)
    return

def main():

    # Travis cli configuration

    # Build the travis cli docker image
    subprocess.run("cd .travis/travis-cli && docker build . -t travis-cli && cd ../..", shell=True)

    # Restarts need to be made interactively so that travis login is verified
    subprocess.run("docker run --name travis-cli --rm -t -d -v $(pwd):/project --entrypoint=/bin/sh travis-cli", shell=True)

    # Get github token env var
    gh_token = os.environ['GITHUB_TOKEN']

    # Enter the running container with docker exec and login to travis
    command_docker = "docker exec travis-cli travis login --pro --org --github-token "
    command_docker += gh_token
    subprocess.run(command_docker, shell=True)

    # Travis Build

    build_state = ''
    restart_attempt = 0
    # A daemon will run this block of code until the build is successful
    while (build_state != 'passed'):

        flag_attempt = 1

        # Run travis show to get info about the build state and its jobs
        travis_show = subprocess.run("docker exec travis-cli travis show", shell=True, encoding='utf-8', stdout=subprocess.PIPE)
        travis_show = travis_show.stdout.split('\n')
        
        subprocess.run("sleep 5", shell=True)

        # Extract status and number of current build
        build_state = travis_show[1].split()[1]
        build_num = travis_show[0].split()[1].lstrip('#').rstrip(':')

        # Extract info about jobs
        jobs = []
        for line in travis_show:
            if line.startswith('#'):
                jobs.append(line)

        for job in jobs:
            if any(status in job for status in ['failed:', 'errored:', 'canceled:']):
                num = job.split()[0].split('.')[1]
                restart_job = 'docker exec travis-cli travis restart '
                restart_job += build_num + '.' + num
                if flag_attempt:
                    restart_attempt+=1
                    print ('\n===Restart attempt no ' + str(restart_attempt) + '===')
                print ('Job ' + build_num + '.' + num + ' has failed. Restarting...')
                subprocess.run(restart_job, shell=True)
                flag_attempt = 0
    
    # Kill travis-cli docker container
    docker_kill()

if __name__ == '__main__':
    try:
        main()
    except (SystemExit):
        print ('Exiting gracefully...')
    else:
        print ('Caught error. Killing travis-cli container...')
        docker_kill()
        e = sys.exc_info()[0]
        print('Error:', e)

Usage: As soon as you push your commit which triggers a Travis build, execute the restarting jobs script. If any job fails, the script will detect it and will restart it through the travis-cli container and will keep doing it until the job passes. Also, remember to export your github token before you run the script.

This script is actually parsing the output of travis show within a loop and if any jobs are found to have status failed/errored/canceled, it uses travis restart to restart it.

Clean caches

When I first started using the caching strategy in Travis, I noticed that caches were not cleaned after the build. Thus, when I started a new build, if there was a cache with the same name, Travis was using it. If the build is successful, then this caching strategy will speed up your new build. But if the build has failed, then it might cause problems to the new build as well. Thus, I prefer to clean the cache before each build.

I created the following script to do this automatically:

#!/usr/bin/env python3

"""

Automatic clearing travis cache

Before each build in Travis the cache
should be clean, for having a new clean
environment.
Before running this script, generate a
token on Github page 
(https://github.com/settings/tokens)
and export it on host machine as an
env var named GITHUB_TOKEN 
(EXPORT GITHUB_TOKEN=<github token>)

    python clear_travis_cache.py

"""

import subprocess, re, os, signal, sys

# Killing the running container of travis-cli
def docker_kill():
    subprocess.run("docker kill travis-cli", shell=True)
    return

def main():

    # Travis cli configuration

    # Build the travis cli docker image
    subprocess.run("cd .travis/travis-cli && docker build . -t travis-cli && cd ../..", shell=True)

    # Clearing cache needs to be made interactively so that travis login is verified
    subprocess.run("docker run --name travis-cli --rm -t -d -v $(pwd):/project --entrypoint=/bin/sh travis-cli", shell=True)

    # Get github token env var
    gh_token = os.environ['GITHUB_TOKEN']

    # Enter the running container with docker exec and login to travis
    command_docker = "docker exec travis-cli travis login --pro --org --github-token "
    command_docker += gh_token
    subprocess.run(command_docker, shell=True)

    # Run travis cache to delete all caches of the repo
    subprocess.run("docker exec travis-cli travis cache -f --delete", shell=True)

    # Kill travis-cli docker container
    docker_kill()

if __name__ == '__main__':
    try:
        main()
    except:
        print ('Caught error. Killing travis-cli container...')
        docker_kill()
        e = sys.exc_info()[0]
        print('Error:', e)

Usage: Execute this script before you push your commit so that the caches are clear on your new build. Also, remember to export your github token before you run the script.

GitHub token

As the usage comment inside the scripts informs you, before you execute any of them, you need to export your GitHub token. This derives from the fact that the travis-cli login command needs to be run interactively so that it can verify your credentials.

What the above automation scripts do is running the travis-cli container, and then using docker exec, they enter inside the container process to execute any travis-cli command.

Travis login commands though needs to be run specifically using your github token as a flag.

In order to make this script go public (and not hand my github token to everyone 😆 ), I use an environment variable named GITHUB_TOKEN. So if you intend to use the scripts, just execute the following command first:

$ export GITHUB_TOKEN=<github_token>

If you don’t have a GitHub token already, you can create one here.

Dockerfile

The Dockerfile I used for travis-cli docker image is this one.

If you’re using the automation scripts, you have to copy this Dockerfile under the directory .travis/travis-cli inside your repository.

The scripts are building it every time they run, in order to ensure that the docker image exists locally.

CI/CD for NextCloudPi: travis.yml explained

It’s time for NextCloudPi to get adapted to the modern software development, Continuous Integration / Continuous Delivery. This is an implementation that should be done, in order to reduce risks for each build, automate the production and testing and clear the way to get valuable features out to users faster. Thus, one of my GSoC tasks was to create this missing feature.

If you’re trying to create a travis.yml file for a project like NCP, you’re also welcome to read this article, as I will go through the whole process till the final travis.yml.

The first thing I had to figure out was which CI/CD system to use. NextCloudPi project has a git repository on GitHub, which means that the choices among the CI systems were limited. The CI system that I finally decided to use was Travis CI.

Travis CI is a hosted continuous integration service used to build and test software projects hosted at GitHub. Travis is recommended for open source projects and is also cloud-based and supported by Docker to run tests, which in NCP case is important, since docker images exist as an alternative to arm images.

As of now, the .travis.yml that I created produces only NextCloudPi’s docker images (and its component docker images: debian-ncp, lamp and nextcloud) and not the native images, due to some issue I’m experiencing with the Locales.

One may wonder, why not use Docker automated builds then? It’s a fact that Docker itself is capable of automatically building images from github repos. However, it currently uses the stable version of Docker – thus, no experimental features are supported – and doesn’t allow to specify a certain Docker daemon. NextCloudPi’s Dockerfiles are using experimental features and therefore, using Travis CI to automate builds, is the best choice.

The .travis.yml I created is explained in the following image:

Let’s break it down and explain each step.

Briefly, what Travis really does is cloning your git repository and run the commands of travis.yml on a machine. You can specify some details about the machine that will be used, but the choices are quite limited.

The first thing to do in a travis.yml is specify some details about the machine and your work like: do you need sudo? what languages are you using? what OS you need? which branches of your repo should be checked?

In our case, we definitely need sudo and we picked the generic pack of languages which contains Docker and python and we’ll install anything else we need. As for the OS, Ubuntu xenial, was the best option at the time of writing the travis.yml file.

Also, it’s important to specify when should the Travis CI build take place. If you don’t specify this, it will just build on every single git commit you push. If you want it to build only when you tag a git commit – the so called build tags – then add the following line:

if: tag IS present

Now, let’s talk about the Travis ecosystem terms.

In Travis, a block of code can constitute a Job. A job contains commands that run sequentially, inside the same VM. There is one big limitation though about the time of a job’s execution: it should not exceed the limit of 50 minutes. After 50′ the job fails and thus, the whole build fails (if it’s not manually specified to ignore a job’s failure).

A job’s lifecycle splits up to the following parts (install and script are the most often used):

  • before_install
  • install: install any dependencies required
  • script: run the build script
  • before_script
  • after_success/after_failure
  • deploy
  • after_script

These parts can be omitted. In our case, we use install for any packages required, script for the main work and sometimes the before_script to configure stuff.

Different jobs can run concurrently, within an entity called Stage. Stages – or as they are often called, Build Stages – can contain multiple parallel jobs and don’t have any time limit – except for the one of each job. Stages can only run sequentially among them. What’s of utmost importance to mention for stages, is that each stage runs on a fresh, independent VM.

What happens when you need to break your code down in stages (maybe due to the jobs time limit) but need some data from the previous stage to go on? Here comes the Cache. Travis offers a caching strategy which you can use to transfer your data from stage to stage. The cache can be defined as a directory, and inside it we can store data and the following stages will be able to access it directly.

Combining these concepts, I decided to split the travis.yml into 3 stages: Building docker images, Testing and Pushing to DockerHub. Each architecture is independent from the others, thus its docker image can have its own job inside every stage.

Every job/architecture is using its own cache so that the images built at the first stage can be transferred to the next stages (example of cache for x86):

env:
  CACHE_NAME=X86

Make sure that any parallel jobs in Travis should have distinct names for their cache, otherwise the same cache will be used and all processes accessing it will cause errors in your build!

Also, in every stage we install Docker with the convenience script because we need a recent release of Docker (at least 18.09), which supports the experimental features and Travis is not yet updating to this release automatically. A bash script like the following, will do the trick:

#!/bin/bash

set -o errexit

echo "INFO:
Updating docker configuration
"
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
echo $'{
  "experimental": true,
  "storage-driver": "overlay2"
}' | sudo tee /etc/docker/daemon.json
sudo service docker restart

After it, in order to enable the experimental features don’t forget to export the respective var:

export DOCKER_CLI_EXPERIMENTAL=enabled

Building docker images

At this stage, each architecture job will build the docker images needed for NextCloudPi: ncp-debian, lamp, nextcloud and nextcloudpi. Then they save the docker image in a tar file inside their cache.

Through this stage, I learned the hard way that Travis has much more limitations than the 50 minutes per job.

A job fails when there is no produced output for more than 10 minutes.. I solved this by adding the following line before any other command in the script section:

while sleep 9m; do echo "=====[ $SECONDS seconds, build-docker still building... ]====="; done &

Also, there is a limitation about the log output. So if you encountered an error that says “The log length has exceeded the limit of 4 Megabytes“, then guess what? You should produce output less than 4 MB. :’) In my yml, I just redirect the long command’s output to a file and if you want to monitor it just follow the output file using this command:

tail -f output &

The armhf images were struggling not to exceed the time limit and therefore, I split the Building Stage into two Stages: part 1 and part 2. Arm images build ncp-debian and lamp at part 1 (x86 is pretty fast so there is no need to split its stage) and in part 2, they load their lamp docker image and build nextcloud and nextcloudpi.

Lastly, I noticed that mysql was having trouble on starting properly as it was requesting more open files than the ulimit allowed. This shouldn’t be happening, as mysql usually calculates on its own the open files limit – the ulimit inside the docker container – but in Travis VM it didn’t (running the same docker images on my host, it worked just fine). So, what I did to solve this issue, was manually fixing the limit in mysql config file, via lamp.sh – using a sed command.

Testing

Finally, the building docker images jobs have passed and it’s time to test them using the headless tests provided by the nextcloudpi repository.

The requirements for this stage that had to be installed on Travis box were selenium and geckodriver – and of course since we have a fresh VM, Docker should be installed again.

The only image we need is nextcloudpi, so we load it and run the tests.

As soon as every job passes, this stage is complete.

Pushing to DockerHub

The last part of travis.yml focuses on pushing the docker images to DockerHub. They’re built and tested and ready to go public.

Each architecture will push its images and then we’ll create a manifest containing all the architectures.

Docker Manifest is a smart experimental tool which contains information about an image, such as layers, size and digest and extra information can be added by users like OS and architecture. This is why we often call manifest lists “multi-arch images”.

Manifest will need the nextcloudpi images of all architectures and thus, I decided to create separate stages: One for pushing the 4 components of nextcloudpi and one for pushing the manifest list of all architectures.

The first of these stages, loads the 4 images of each architecture per job, tags it and pushes it to DockerHub. The DockerHub credentials should be inserted through Travis page and in the yml use vars $DOCKER_USERNAME and $DOCKER_PASSWORD to login to DockerHub account and tag your image.

The last stage, needs one job only since all nextcloudpi images will be pushed together to the manifest list. Wondering which of the three caches will be used? None. The previous stage has pushed the images to DockerHub right? Why bother, waiting for any cache then? Just pull the images, login to Docker again and use docker commands manifest create and manifest annotate.

That’s it! The travis.yml is ready to be used. All you have to do is creating an account on Travis, associate it with Github, enable builds through Travis page, pass you DockerHub credentials and drop the yml we just created inside your repo as .travis.yml.

You can find the travis.yml I created for nextcloudpi here: https://github.com/eellak/gsoc2019-NextCloudPi/blob/gsoc2019-travis/.travis.yml

Single architecture travis.yml

In order to provide an option for testing single architecture, or even pushing only one architecture to DockerHub, I created a script that generates mini travis.yml files for single architecture.

This script prompts the user to choose the architecture he wants and generates the respective file.

You can find it here: https://github.com/eellak/gsoc2019-NextCloudPi/blob/gsoc2019-travis/.travis/travis_instances/single_arch_travis.sh

Extra tools

There are also some extra tools I created in order to automate the process of restarting failed jobs and cleaning the caches. You can read more about these tools here: Travis tools: Restart failed jobs and clean caches