How we supercharge Continuous Integration (CI) tools at Rakuten Viki!

Weiyuan
Viki Blog
Published in
9 min readDec 21, 2018

--

Some time ago, my colleague and manager of our platform team at Rakuten Viki, Omkiran, started a company-wide conversation on re-hauling most of our entire infrastructure. This was done at a time where several of our services were deployed within Virtual Machines (VMs), essentially using Infrastructure as a Service (IaaS) services. The premise of the conversation was as follows —

If limited resources of an organisation, such as engineers, were to be tied down with solving low level tasks like managing the health and operability of various processes, how can we, as a company, spare resources as we scale up in services? How can we be able to automate our entire infrastructure, so that we can move on and innovate on newer technologies, such as Machine Learning?

The proposed solution was to make use of cloud providers, or more specifically to today’s implementation, Google Cloud Platform (GCP).

First, let us discuss why cloud providers are essential to meeting the highlighted premise above:

Fig 1. Comparing cloud service management requirements with physical solutions (original article)
  • Localized server racks suffer from the requirement to provide continuous support for physical assets. Hardware and software changes are challenges faced at this level.
  • VMs provided by IaaS providers assuage the hardware requirements in Fig 1. However, we’ll still face software challenges like managing OS upgrades, updates and scaling to demand.

In the above, we see that both hardware and software pose challenges, to continue operations and ever-expanding business requirements of a growing organization. These challenges, at large, have already been solved by the industry as seen in Fig 1.

This leads us to turn to cloud providers, which provide managed services, such as Platform as a Service (PaaS, so that we can free engineering resources and focus on solving challenges that are specific to growing our organization.

But how does this change tie in with CI tooling?

Fig 2. Why? Why? Why?

Firstly, what does CI mean to Viki:
Services in Viki uses CI in the development part of our pipeline, often for automating the execution of new and regression tests to validate changes under multi-branch development. There are other usages pertaining to specific teams, such as pre-compiling assets for front-end services, and creating APK files for Android applications. Sometimes it goes beyond CI, using a limited form of Continuous Deployment (CD) for staging and canary environments in the delivery mindset of the service.

How was CI done in Viki?
Before, we used three different CI tools — two different versions of Drone, and Jenkins, in carrying out CI within the company. The CI tools were deployed as internal services, on top of provisioned VMs, similar to external services.

Why should CI change to be a managed tool, if it is an internal service and not tied to client (user) utilization?
Client linked services have an explicit requirement to auto-scale as the users increase or decrease, which is already provided for in most PaaS services. In the case of CI, it is the “service” for our services as its “users”. As the number of services developed by the organization increases, the same issues faced by client services surface — namely increased latency along with more frequent failures. We can vertically or horizontally scale the machines which power our CI to solve this issue, but reserving better or more instances does not make for cost efficiency where the engineering core is based in similar time zones (usage is low at night) and CI jobs differ in required processing power resulting in unpredictable fluctuations in usage. Horizontal auto-scaling is the solution here, but it would have to either be already supported by the CI, or we have to build custom logic to spin up and spin down VMs whenever it is required.

Another issue was supporting multiple versions of CI or different CI tools altogether. Unfortunately, this was a result of incomplete migrations and incompatible instructions across versions - which leads to another familiar conundrum we have seen in Fig 1 — maintenance of different “OS”.

How does GCP solve these problems?

We looked at GCP’s offering, Cloud Build, for the problems we faced with present iterations of our CI.

Fig 3. Cloud Build!

Firstly, I have to declare that Cloud Build by itself is not a complete CI tool. Instead, I would categorize it as the base which anyone can build kick-ass CI tools on.

Cloud Build works by providing a trigger configuration and build configuration. The trigger configuration is as follows:

Fig 4. Cloud Build trigger configuration

By linking to either Github, Bitbucket, or Google Code Repository, builds can be configured to trigger on watched files, or everything-but-ignored files, on top of selected branches (with regex to provide selection criteria, but mind you, no negative look-ahead supported at this point). Another option is providing a Dockerfile or a configuration file (YAML) specifically for Cloud Build, as the build configuration required to pass or fail a triggered build.

For Viki, we opted to use the Cloud Build’s configuration file as a common standard, as opposed to a Dockerfile, due to the fact that the “base images” of the Cloud Build’s configuration are essentially machine types (of differing processing power and number of cores). This solves our problem of latency and failures due to scaling issues, and even allow us to speed up individual builds for critical services as and when we need.

The following configuration shows a Node application with test and linter checks, along with building an image for deployment:

steps:
- id: retrieveNodeModules
waitFor: ['-']
name: gcr.io/${PROJECT_ID}/retrieve:1.0.0
args: [$REPO_NAME, 'nodeModules']

- id: installTestNpm
waitFor: ['retrieveNodeModules']
name: 'gcr.io/cloud-builders/npm'
args: ['--loglevel=error', 'install', '--prefer-offline']
- id: runJasmineTests
waitFor: ['installTestNpm']
name: 'gcr.io/cloud-builders/npm'
args: ['run', 'jasmine']

- id: runLinterChecks
waitFor: ['installTestNpm']
name: 'gcr.io/cloud-builders/npm'
args: ['run', 'linter']

- id: cacheNodeModules
waitFor: ['installTestNpm']
name: gcr.io/${PROJECT_ID}/cache:1.0.0
args: [$REPO_NAME, 'nodeModules']
- id: dockerBuild
waitFor: ['runJasmineTests', 'runLinterChecks']
name: 'gcr.io/cloud-builders/docker'
args: ['build', '-f', 'Dockerfile', '-t', 'gcr.io/$PROJECT_ID/project:$SHORT_SHA', '.']

- id: dockerPush
waitFor: ['dockerBuild']
name: 'gcr.io/cloud-builders/docker'
args: ['push', 'gcr.io/$PROJECT_ID/project:$SHORT_SHA', '.']
timeout: 1800s

options:
machineType: 'N1_HIGHCPU_8'

From the above example, we see a few configuration properties. The first thing we’ll touch on would be the steps’ properties for the build. “name” and “args” are synonymous to the common usage of Docker images, in identifying the image used and arguments in spinning up the image as a dockerized container. “id” and “waitFor” creates the build dependency between steps, allowing for steps to execute in parallel as discussed before.

Another property here is “machineType”, where one can select the machine to run the build on. This stacks with the ability to run builds in parallel, whereby selecting a multi-core machine will help to speed up build time.

For some CI tools, a single build run steps serially in a unique environment. This is a result of each service with their own requirements. By using Cloud Build’s steps as seen above, this translates to the advantage that we can create helper steps which can be shared across different services — such as caching utility to speed up build times, or deployment steps to Kubernetes using helm to staging or canary clusters.

Usage of Cloud Build expands even further with the use of GCP. In the above configuration YAML file, the Docker images of the different steps come from both public and private image repositories hosted on GCP. Through the use of service accounts, we can enable Cloud Build to access private images, without having to configure and manage complex access management, such as creating and uploading public-private SSH key-pairs.

Better still, only the administrators could perform access management rights’ assignments, and regular developers possess only read privileges. This safeguards our CI from failures caused by accidental changes.

From the above, we have created an automated build pipeline for a Node service, where essential tests run and Docker image builds for deployment in each triggered build. Migrating from our legacy CI tools, we are able to execute steps in parallel and utilize machines on demand to not only save usage costs, but increase processing needs when required.

So, how can we “supercharge” this experience even further?

One of the first things we identified in Cloud Build was that it was not a complete CI tool. Rather, it was something that we could build our tools on top of. If we compare this to a CI service like Drone, Drone provides plugins which are added to the service, while Cloud Build allows for Docker images to run as individual steps in the build process. Cloud Build makes for easy development of helper steps, where Dockerfiles are already an adopted standard for deployment usages, as opposed to plugin development which are on a different platform. Another point to note was that anyone could contribute to the ecosystem of helper steps by uploading and pooling the maintenance of these steps in a single repository. It was also easier to use these helper steps, as opposed to plugin installation that requires service updates and access rights to the VMs.

Within a month of development, we’ve came up with multitude of helper images that are common requirements across different service builds.

Fig 5. A common repository for all helper images

We’ve discussed caching and retrieving before. But one secret to this was the usage of GCP Cloud Storage’s Storage Bucket to store the cached assets. By providing the documentation to use these images, colleagues could quickly adopt the caching tools without any knowledge of the underlying bucket access management and bucket allocation across repositories.

Another “supercharged” experience was making it easy to access private assets that we own. By using GCP Key Management Service (KMS), “vgit” and “vdocker” were created as specialised “git” and “docker” binaries in images , which could access our private repositories in Github (such as initializing private sub-modules), and pushing or pulling docker images in Dockerhub in supporting migration of CI (without breaking deployment dependency through using GCP image repository alone). Usage pattern was similar to the public variant, making it easy to “plug and play” for each build.

Fig 6. Leveraging Cloud Functions to build various notifiers

Build notifications were also not available within the native Cloud Build. However, what was provided was the ability to subscribe to GCP Pub/Sub for on a “Cloud Build” topic. From this, we built a Cloud Function based on NodeJS to publish to a list of class objects as seen in Fig 6. Currently, we support a SlackNotifier class, which is initialized with a slack channel “webhook”, “repoName” regex to execute the logic on, and a “moduleName” which is a separate list of modules that can be contributed by any developer to decide the display format of the pass/fail slack message.

Fig 7a. Cloud Build badges
Fig 7b. Cloud Build Github status as viewed in pull requests

Other innovations here include status notifications and build badges, observed in Fig 7a/b above. The status notification gives an account of elapsed time for each step and indicates which step cause the failure if it occurs, allowing for a quick understanding of build issues. Build badges gives an account of the repository status for recent builds.

The notifications do not end here — other innovations such as email alerts, or perhaps even paging alerts can be built and reused across different teams easily, with the current notification boilerplate that we have created here.

Lastly, another colleague of mine, Donovan, came up with the idea of parsing Cloud Build configuration files to create graphs for visual understanding the parallel execution of the steps. This was especially important to understanding projects with extensive and complex build steps. Building on his idea, a gem was created to make the graph generation process easier to perform:

Fig 8. Graph visualization generated from sample Cloud Build configuration

All in all, we migrated from various CI tools to Cloud Build and solved our initial problems of latency and failures by increasing machine availability through each build running on a separate VM as managed CI. Through parallelizing build steps and controlling VM types, we optimize build time as compared to previous CI tools. Then, by using community assets and in-house development of helper content, we optimize both build time, feedback and development experience of builds even further.

--

--

Senior Engineering Manager, Ascenda Loyalty | Former Engineering Manager, Grab | Former Director of Engineering, ZilLearn | bit.ly/weiyuan