Planet TripleO

Subscriptions

May 18, 2018

Carlos Camacho

Testing Undercloud backup and restore using Ansible

Testing the Undercloud backup and restore

It is possible to test how the Undercloud backup and restore should be performed using Ansible.

The following Ansible playbooks will show how can be used Ansible to test the backups execution in a test environment.

Creating the Ansible playbooks to run the tasks

Create a yaml file called uc-backup.yaml with the following content:

---
- hosts: localhost
  tasks:
  - name: Remove any previously created UC backups
    shell: |
      source ~/stackrc
      openstack container delete undercloud-backups --recursive
    ignore_errors: True
  - name: Create UC backup
    shell: |
      source ~/stackrc
      openstack undercloud backup --add-path /etc/ --add-path /root/

Create a yaml file called uc-backup-download.yaml with the following content:

---
- hosts: localhost
  tasks:
  - name: Print destroy warning.
    vars:
      msg: |
        We are about to destroy the UC, as we are not
        moving outside the UC the backup tarball, we will
        download it and unzip it in a temporary folder to
        recover the UC using those files.
    debug:
      msg: ""
  - name: Make sure the temp folder used for the restore does not exist
    become: true
    file:
      path: "/var/tmp/test_bk_down"
      state: absent
  - name: Create temp folder to unzip the backup
    become: true
    file:
      path: "/var/tmp/test_bk_down"
      state: directory
      owner: "stack"
      group: "stack"
      mode: "0775"
      recurse: "yes"
  - name: Download the UC backup to a temporary folder (After breaking the UC we won't be able to get it back)
    shell: |
      source ~/stackrc
      cd /var/tmp/test_bk_down
      openstack container save undercloud-backups
  - name: Unzip the backup
    become: true
    shell: |
      cd /var/tmp/test_bk_down
      tar -xvf UC-backup-*.tar
      gunzip *.gz
      tar -xvf filesystem-*.tar
  - name: Make sure stack user can get the backup files
    become: true
    file:
      path: "/var/tmp/test_bk_down"
      state: directory
      owner: "stack"
      group: "stack"
      mode: "0775"
      recurse: "yes"

Create a yaml file called uc-destroy.yaml with the following content:

---
- hosts: localhost
  tasks:
  - name: Remove mariadb
    become: true
    yum: pkg=
         state=absent
    with_items:
      - mariadb
      - mariadb-server
  - name: Remove files
    become: true
    file:
      path: ""
      state: absent
    with_items:
      - /root/.my.cnf
      - /var/lib/mysql

Create a yaml file called uc-restore.yaml with the following content:

---
- hosts: localhost
  tasks:
    - name: Install mariadb
      become: true
      yum: pkg=
           state=present
      with_items:
        - mariadb
        - mariadb-server
    - name: Restart MariaDB
      become: true
      service: name=mariadb state=restarted
    - name: Restore the backup DB
      shell: cat /var/tmp/test_bk_down/all-databases-*.sql | sudo mysql
    - name: Restart MariaDB to perms to refresh
      become: true
      service: name=mariadb state=restarted
    - name: Register root password
      become: true
      shell: cat /var/tmp/test_bk_down/root/.my.cnf | grep -m1 password | cut -d'=' -f2 | tr -d "'"
      register: oldpass
    - name: Clean root password from MariaDB to reinstall the UC
      shell: |
        mysqladmin -u root -p password ''
    - name: Clean users
      become: true
      mysql_user: name="" host_all="yes" state="absent"
      with_items:
        - ceilometer
        - glance
        - heat
        - ironic
        - keystone
        - neutron
        - nova
        - mistral
        - zaqar
    - name: Reinstall the undercloud
      shell: |
        openstack undercloud install

Running the Undercloud backup and restore tasks

To test the UC backup and restore procedure, run from the UC after creating the Ansible playbooks:

  # This playbook will create the UC backup
  ansible-playbook uc-backup.yaml
  # This playbook will download the UC backup to be used in the restore
  ansible-playbook uc-backup-download.yaml
  # This playbook will destroy the UC (remove DB server, remove DB files, remove config files)
  ansible-playbook uc-destroy.yaml
  # This playbook will reinstall the DB server, restore the DB backup, fix permissions and reinstall the UC
  ansible-playbook uc-restore.yaml

Checking the Undercloud state

After finishing the Undercloud restore playbook the user should be able to execute again any CLI command like:

  source ~/stackrc
  openstack stack list

Source code available in GitHub

by Carlos Camacho at May 18, 2018 12:00 AM

May 07, 2018

Juan Antonio Osorio

Kubecon Copenhagen 2018

Kubecon/CloudNativeCon Europe CPH from an OpenStack/TripleO point of view

I recently had the opportunity to attend Kubecon/CloudNativeCon Europe in Copenhagen. Although the event was very Kubernetes oriented, I chose to focus on the general security bits of the conference, as well as the service-mesh related topics. This was with the following reasoning:

  • Given that we’re aiming to containerize as much as we can from OpenStack, we really need to catch up and take more container security practices into use.

  • A lot of problems that we’re tackling on OpenStack are also being tackled in the kubernetes/cloud native community. We should try to converge whenever possible instead of trying to brew our own solutions.

  • The service mesh use-cases resonate very well with a lot of the security user-stories that we’ve been working on lately.

With this in mind, what I gathered from the different projects.

Service mesh

Lately I’ve been quite interested in the service-mesh topic since it’s brought up by folks tackling the same issues we’ve been facing lately in OpenStack & TripleO.

Background

The concept of a “service mesh” is the concept of offloading all the network interaction from the service itself to a layer that sits somewhere in the host. This layer usually takes the form of a proxy. It can be in the form of a side-car container that runs with the application (so you’ll have a proxy per-service), or it can run as a singleton in the host and catch all traffic that goes towards the applications. The proxy will then be aware of all the applications that are part of the “mesh”, and handle load-balancing, service discovery, monitoring, policy, circuit breaking and it can even be your TLS proxy for the service(s).

One important concept to understand is the separation between the control plane and the data plane. The control plane is how you configure the mesh itself (probably in the form of a service that has an API or a set of APIs for this purpose), and a data plane, which is handled by the proxy, and it’s where all your application traffic flows. So, with this in mind, there are some mesh solutions that will already have a control plane implemented, whereas for other solutions, you have to brew your own.

For more info on the service mesh, I recommend these blog posts:

Tying it up with OpenStack/TripleO

One of the features we did recently was enabling TLS everywhere for TripleO, and if I may say so… It was a big pain. First off we had the issue of every service doing TLS in their own way and having to configure (or even enable the configuration) them for each technology with all their own knobs and handles. Some services were even hard-coding ‘http’ in their endpoints, or were limited to just using IPs (not FQDNs). These are details and nits, but still stuff that you have to do and takes up time.

The service mesh addresses this issue by allowing you to offload that to a proxy, which is where you configure TLS. So there is ONE way to set things up. Yes, it has its own knobs and handles, but at least there is only one set of knobs and handles to worry about.

There’s also the issue of getting an acceptable PKI with all the necessary features, as opposed to copying in a bunch of certificates and forgetting the rest. For this, in TripleO, we used FreeIPA (which I still think was a good choice).

The way this is addressed by service mesh solutions depends on the implementation. Some solutions, such as Istio and Conduit, provide their own PKI solution, so you’ll get TLS by default. In other implementations you have to provide your own. Given that we already have a chosen PKI, it shouldn’t be too hard to take it into use for this purpose; although, Istio’s PKI (the one that I checked out in the conference) is not pluggable yet.

The proxy will also take care of metrics for you, so we could replace the OpenStack-specific OSProfiler and take that into use instead. This would give us more visibility on the overall OpenStack service performance, and help us identify bottle necks.

Finally, a potential benefit would be to take the service-discovery and loadbalancing capabilities into use:

  • We could reduce the amount of hops through the network, since service-to-service communication would no longer need to go through HAProxy in all cases (Which is what happens when a service needs to talk to keystone, for instance).

  • We could potentially deploy several versions of the same service at the same time, and do a rolling-upgrade with relative ease thanks to the proxy (Although this would only be possible for services using Oslo.Versioned objects and with zero-downtime upgrades figured out).

While this is not something we have in the timeline at the moment, I would like to investigate this approach further, as it seems to provide quite a lot of benefits.

Istio

The first project that I had in mind tracking was Istio, which is a service mesh implementation that’s been getting a lot of momentum lately. Istio uses Envoy as the default underlying proxy that does the actual heavy lifting, and provides several services for configuring, monitoring and securing your service mesh. They have their own CA component (that’s is now called Citadel) which uses the SPIFFE specification to identify workloads and effectively give them certificates (more on SPIFFE later).

There seems to be a lot of work on-going to make Istio’s configuration easier, these enhancements include an API to self-configure Istio (instead of using Custom Resource Definitions). This API will hopefully also address the update issues they’ve been seeing, and also enable configuration rollbacks.

Istio also handles policy for the mesh traffic. For this, there are two built-in policy adaptors: One based on Kubernetes’ RBAC, and the other one based on Open Policy Agent (more on OPA later).

Unfortunately, it seems that all the efforts are right now being placed on Kubernetes. So, if we would like to take Istio into there are several things we would need to work on first in order to have it address our needs:

  • Make the CA component extendible

    • Currently the CA component is not extendible, so either we ditch our FreeIPA solution, or we code the pluggability into “Citadel”
  • Make the CA auth customizable

    • The built-in CA auth only take Kubernetes into account, so not only would we need to get customizable auth policies into Istio, we would also need to come up with relevant policies to assure that we’re giving out certs to the correct workloads/components.

    • Right now we have some assurance that only specific hosts can request certificates, and they can only do so for specific services. This all thanks to FreeIPA’s kerberos instance. We would need to add similar support to Istio.

  • Make Istio run in environments that are not Kubernetes

    • The current configuration and deployment scripts are only meant for kubernetes. Making it run in other environments would require a significant effort, as well as re-packaging of all the components.

    • There are options to make Istio add VMs to the service mesh, but they still require Kubernetes to be running somewhere, and is limited to Google Cloud VMs. So this doesn’t necessarily help us.

Even though we could run Istio without some components to have an easier setup, It seems to me that adopting it would take a significant amount of effort.

If we want to take the benefits of the service mesh into use in TripleO, it might be better to take a slower approach, and just take fewer components into use, instead of going with the full-blown solution. This lead me to take a deeper look into just using Envoy. This conclusion was also inspired by the recommendations in other talks in the conference that mentioned how other companies started adopting the service mesh: going for a cautious and thoughtful approach instead of a full-on migration.

Envoy

Envoy is a magical proxy that folks have been using to enable service mesh architectures in their deployments. While it’s not the only proxy out there, it seems to be getting a lot of momentum in the community.

It supports proxying several protocols, which include: HTTP, HTTP/2, gRPC, MongoDB, Redis, among others. It’ll also provide L3/L4 filtering, health checking and TLS termination. It also provides statistics (with statsd integration) and distributed tracing via plugins.

For a long time it was configured with static files, but it now has support for dynamic configuration via gRPC. This is called xDS API, and is what control planes need to implement in order to take Envoy into use (this is what Istio does, for instance). So, in order for us to take it into use in OpenStack, we would need to either expand and run their reference xDS API or implement our own (which is what other folks seem to be doing for some reason).

A nice feature of Envoy is it’s support for hot-restarts which is envoy’s ability to reload and even do a binary update without dropping connections. I can see this feature to be very useful for our updates/upgrades in TripleO.

Currently, it seems to me that if we want to experiment with proxies to try to bring service mesh benefits into TripleO, starting with Envoy and some services would be a good start.

SPIFFE

There were a lot of talks about SPIFFE/SPIRE in Kubecon. Given that it’s a security component and it seems to be taken into use by several other projects, I decided to take a look at it.

SPIFFE’s main goal is to establish trust between workloads and have the workload answer the question “who am I?”. So basically it aims to establish the workload’s identity. SPIFFE is an open specification which defines the overall mechanisms and how the API should look like. It introduces a couple of concepts:

  • SPIFFE ID

    • It’s how we identify the given workload

    • Takes the form of a URL such as the following: spiffe://acme.com/billing/payments

    • acme.com would be the trust domain (could be a kerberos realm?)

    • billing/payments would identify the workload.

  • SVID (SPIFFE Verifiable Identity Document)

    • The current proposal is an x509 certificate that contains your SPIFFE ID as part of the certificate’s SAN.

    • It’s meant to be a specification of the minimum requirements for a verifiable document, x509 is one implementation, but there could be more.

    • The advantage of the SVID being an x509 certificate, is that it could effectively be used for TLS, which is what Istio is doing.

SPIFFE then dictates ways that the workload could get its SVID, which is via customizable and deployer-defined attestation policies. The policies are executed by an agent that runs on every node, and there should be a central server that ultimately verifies the claims and provides the agents with the actual SVIDs.

The team provided a reference implementation called SPIRE. This implements the SPIFFE specification and contains code for the agents and the server. There is already Vault support, and work on-going to hook up the server to HSMs.

It communicates with the workloads via a unix domain socket, so there’s no explicit authentication to communicate with the API. It is then the responsibility of the node-agent to check if the workload’s request is actually valid. Which is where the attestation policies kick in.

SPIRE already has customizable attestation policies which we could extend, and it seems that some folks from VMWare already implemented a Kerberos-based attestation plugin which ended up in very interesting conversations. I can definitely see that attestor being used with the kerberos instance provided by FreeIPA.

Currently SPIRE uses short-lived certificates, but support for actual revocation is coming in later this year. There is still no decision as to what mechanism will be used (could be CRLs or OCSP).

There is also another limitation, which is HA support. Currently SPIRE is implemented with SQLite and assumes one server node. How to get SPIRE to use a distributed database and make it scale is still an active problem the team is tackling.

There will also be a public security audit of the code in the near future.

Unfortuntely, even though Istio uses the SPIFFE specification for it’s CA (Citadel), it doesn’t use SPIRE, and instead contains its own implementation in the code-base. Asking around, it seems that the reason is to not lock in both project’s speeds together; so this allows Istio to move at it’s own pace.

Open Policy Agent

Open Policy Agent aims to unify policy configurations. It takes the form of an agent that your service communicates with in order to enforce the policy decision. It has it’s own format to configure policies which the agent interprets.

It seems to be getting some adoption, with integration with Istio and Kubernetes.

The main reason I took a look into it, was because I thought it would be a nice way to get the centralized policy policy story going for OpenStack. This would require us to implement an OPA driver in oslo in order to do policy and have an agent running with each OpenStack service. Unfortunately it seems we would need to implement a control plane for OPA (centralized configuration), since currently only the agent is open source. So we would have an OpenStack specific control plane that can read the already-existing policy.json files and transforms them into OPA documents; these would then be taken into use by the agents. The separate agent approach would give us the advantage that we wouldn’t need to reload OpenStack services in order to apply new policies (although this is not a very common operation). Another benefit would be to have the same policy format for Kubernetes and OpenStack.

Conclusion

The conference was great and gave me a lot of ideas that we could take into use in OpenStack. Although we might not use all of them (or even none of them, time will tell), I still learned a lot and got me very excited about the future in the cloud industry. I wish to see more interaction between the Cloud Native and the OpenStack communities, as I think there are a lot of convergence points where we could use the wisdom of both communities. I really hope I can attend again.

May 07, 2018 12:58 PM

May 04, 2018

Ben Nemec

OpenStack Virtual Baremetal on a Public Cloud

Background

At long last, OVB has reached the goal we set for it way back when it was more idea than reality. The plan was to come up with a system where we could test baremetal deployment against OpenStack instances in an arbitrary cloud. That would allow something like TripleO CI to scale, for all practical purposes, infinitely. For years this wasn't possible because OVB required some configuration and/or hacks that were not appropriate for production clouds, which limited it to use in specific CI-oriented clouds. Theoretically it has been possible to use OVB on normal clouds for a couple of years now, but in practice public clouds were either running OpenStack versions that were too old or didn't expose the necessary features. Fortunately, this is no longer true.

Enter Vexxhost (I should note that I have no affiliation with Vexxhost). Generally speaking, the OVB requirements are:

  • OpenStack Mitaka or newer
  • Heat
  • Access to the Neutron port-security extension
  • The ability to upload images

That may not be an exhaustive list, but it covers the things that frequently blocked me from running OVB on a cloud. Vexxhost ticks all the boxes.

Details

First, the OVB on Public Cloud demo video. As I mentioned in the video, there are some details that I glossed over in the interest of time (the video is nearly 20 minutes as it is). Here's a more complete explanation of those:

  • Quota. The default quota when you create an account on Vexxhost is somewhat restrictive for OVB use. You can fit an OVB environment into it, but you have to use the absolute minimum sizes for each instance. In some cases this may not work well. As I recall, the main restriction was CPU cores. To address this, just open a support ticket and explain what you're doing. I got my cpu core limit raised to 50, which enabled any nonha configuration I wanted. If you're doing an ha deployment you may need to have some other quota items raised, but you'll have to do the math on that.
  • Performance. Performance was a mixed bag, which is one of the downsides of running in a general purpose cloud. Some of my test runs were as fast as my dedicated local cloud, others were much slower.
  • Flavor disk sizes. As noted in the video, some of the Vexxhost flavors have a disk size of 0. Initially I thought this meant they could only be used with boot from volume, but fortunately it turns out you can still boot instances from ephemeral storage. This is important because on a public cloud you need to be able to nova rebuild the baremetal instances between deployments so they can PXE boot again. Nova can't currently rebuild instances booted from volume, so make sure to avoid doing that with OVB.

    Because of the 0 disk size, the virtual size of the images you boot will determine how much storage your instances get. By default, the CentOS 7 image only has 8 GB. That is not enough for a TripleO undercloud, although it's fine for the BMC. This can be worked around by either using qemu-img to resize a stock CentOS image and upload it to Glance, or by snapshotting an instance booted from a flavor that has a larger disk size. I did the latter because it allowed me to make some customizations (injecting ssh keys, setting a root password, etc.) to my base undercloud image. Either should work though.

    The original OVB ipxe-boot image has a similar problem. There is now an ipxe-boot-41 image with a virtual size of 41 GB that should provide enough storage for most deployment scenarios. If it doesn't, this image can also be resized with qemu-img.

    Another impact of this configuration is that the nodes.json created by build-nodes-json may have an incorrect disk size. It currently can't determine the size of a disk in an instance booted from a flavor with a 0 disk size. There are two options to deal with this. First, nodes.json can be hand-edited before importing to Ironic. Second, introspection can be run on node import, which is what I did in the video. It's a bit cleaner and demonstrates that introspection functionality works as expected. Introspection will look at the instance details from the inside so it can determine the correct disk size.

  • Cost. Obviously public cloud resources are not free. If you created the environment I used in my demo and left it running for a full month, at the current Vexxhost prices it would cost around $270 US. That's just for the instances, there will also be some small cost for image storage and bandwidth, although in my experience those would be negligible compared to the per-instance costs. That's not pocket change, but compared to the cost for purchasing and administering your own cloud it may be worth it.
  • [Edit 2018-5-7] Baremetal rebuilds. This is mentioned in the video, but it's important so I wanted to reiterate it here too. You must rebuild the baremetal instances between deployments in a public cloud environment. There is a rebuild-baremetal script provided that can automate this for you. In the case of the demo environment, it would have been called like this: rebuild-baremetal 2 baremetal demo. Note that this is run using the host cloud credentials, not the undercloud.

by bnemec at May 04, 2018 06:04 PM

March 19, 2018

Giulio Fidente

Ceph integration topics at OpenStack PTG

I wanted to share a short summary of the discussions happened around the Ceph integration (in TripleO) at the OpenStack PTG.

ceph-{container,ansible} branching

Together with John Fulton and Guillaume Abrioux (and after PTG, Sebastien Han) we put some thought into how to make the Ceph container images and ceph-ansible releases fit better the OpenStack model; the container images and ceph-ansible are in fact loosely coupled (not all versions of the container images work with all versions of ceph-ansible) and we wanted to move from a "rolling release" into a "point release" approach, mainly to permit regular maintenance of the previous versions known to work with the previous OpenStack versions. The plan goes more or less as follows:

  • ceph-{container,ansible} should be released together with the regular ceph updates
  • ceph-container will start using tags and stable branches like ceph-ansible does

The changes for the ceph/daemon docker images are visible already: https://hub.docker.com/r/ceph/daemon/tags/

Multiple Ceph clusters

In the attempt to support better the "edge computing" use case, we discussed adding support for the deployment of multiple Ceph clusters in the overcloud.

Together with John Fulton and Steven Hardy (and after PTG, Gregory Charot) we realized this could be done using multiple stacks and by doing so, hopefully simplify managament of the "cells" and avoid potential issues due to orchestration of large clusters.

Much of this will build on Shardy's blueprint to split the control plane, see spec at: https://review.openstack.org/#/c/523459/

The multiple Ceph clusters specifics will be tracked via another blueprint: https://blueprints.launchpad.net/tripleo/+spec/deploy-multiple-ceph-clusters

ceph-ansible testing with TripleO

We had a very good chat with John Fulton, Guillaume Abrioux, Wesley Hayutin and Javier Pena on how to get tested new pull requests for ceph-ansible with TripleO; basically trigger an existing TripleO scenario on changes proposed to ceph-ansible.

Given ceph-ansible is hosted on github, Wesley's and Javier suggested this should be possible with Zuul v3 and volunteered to help; some of the complications are about building an RPM from uncommitted changes for testing.

Move ceph-ansible triggering from workflow_tasks to external_deploy_tasks

This is a requirement for the Rocky release; we want to migrate away from using workflow_tasks and use external_deploy_tasks instead, to integrate into the "config-download" mechanism.

This work is tracked via a blueprint and we have a WIP submission on review: https://blueprints.launchpad.net/tripleo/+spec/ceph-ansible-external-deploy-tasks

We're also working with Sofer Athlan-Guyot on the enablement of Ceph in the upgrade CI jobs and with Tom Barron on scenario004 to deploy Manila with Ganesha (and CephFS) instead of the CephFS native backend.

Hopefully I didn't forget much; to stay updated on the progress join #tripleo on freenode or check our integration squad status at: https://etherpad.openstack.org/p/tripleo-integration-squad-status

by Giulio Fidente at March 19, 2018 02:32 AM

March 06, 2018

Marios Andreou

My summary of the OpenStack Rocky PTG in Dublin


My summary of the OpenStack Rocky PTG in Dublin

I was fortunate to be part of the OpenStack PTG in Dublin this February. Here is a summary of the sessions I was able to be at. In the end the second day of the TripleO meetup thursday was disrupted as we had to leave the PTG venue. However we still managed to cover a wide range of topics some of which are summarized here.

In short and in the order attended: * FFU * Release cycles * TripleO


FFU

  • session etherpad
  • There are at least 5 different ways of doing FFU! Deployment projects update (tripleo, openstack-ansible, kolla, charms)
  • Some folks trying to do it manually (via operator feedback)
  • We will form a SIG (freenode #openstack-upgrades? ) –> first order of business is documenting something! Agreeing on best practices when FFU. –> meetings every 2 weeks?

Release Cycles

  • session etherpad
  • Release cadence to stay at 6 months for now. Wide discussion about the potential impacts of a longer release cycle including maintenance of stable branches, deployment project/integration testing and d/stream product release cycles, marketing, documentation and others. In the end the merits of a frequent upstream release cycle won, or at least, there was no consensus about getting a longer cycle.
  • On the other hand operators still think upgrades suck and don’t want to do it every six months. FFU is being relied on as the least painfull way to do upgrades at a longer cadence than the upstream 6 month development cycle which for now will stay as is.
  • There will be an extended maintenance tag or policy introduced for projects that will support the LTS long term support for stable branches

TripleO

  • main tracking etherpad

  • retro session (emilienm) session etherpad some main points here are ‘do more and better ci’, communicate more and review at least a bit outside your squad, improve bugs triage, bring back deepdives.

  • ci session (weshay) session etherpad some main points here are ‘we need more attention on promotion’, upcoming features like new jobs (containerized undercloud, upgrades jobs), more communication with squads (upgrades ongoing for ex and continue to integrate the tripleo-upgrade role), python3 testing.

  • config download (slagle) session etherpad some main points are Rocky will bring config download and ansible-playbook worfklow for deployment of the environment, not just upgrade.

  • all in one (dprince) session etherpad some main points: using containerized undercloud have an ‘all-in-one’ role with only those services you need for your development at the given time. Some discussion around the potential CLI and pointers to more info https://review.openstack.org/#/c/547038/

  • tripleo for generic provisioning (shadower) session etherpad some main points are re-using the config download with external_deploy_tasks (idea is kubernetes or openshift deployed in a tripleo overcloud), some work still needed on the interfaces and discussion around ironic nodes and ansible.

  • upgrades (marios o/, chem, jistr, lbezdick) at session etherpad , some main points are improvements in the ci - tech debt (moving to using the tripleo-upgrade role now), containerized undercloud upgrade is coming in Rocky (emilien investigating), Rocky will be a stabilization cycle with focus on improvements to the operator experience including validations, backup/restore, documentation and cli/ui. Integration with UI might be considered during Rocky to be revisitied with UI squad.

  • containerized undercloud (dprince, emilienm) session etherpad dprince gave a demonstration of a running containerized undercloud environment and reviewed the current work from the trello board. It is running well today and we can consider switching to default containerized undercloud in Rocky.

  • multiple ceph clusters (gfidente, johfulto), linked bug , discussion around possible approaches including having multiple heat stacks. gfidente or jfulton are better sources of info you are interested in this feature.

  • workflows api (thrash) session etherpad , some main points are fixing inconsistencies in workflows (should all have an output value, and not trying to get that from a zaqar message) and fixing usability, make a v2 tripleo mistral workflows api (tripleo-common) and re-organise the directories moving existing things under v1, look into optimizing the calls to swift to avoid a large number of individual object GET as currently happens.

  • UI (jtomasek) session etherpad some main points here are adding UI support for the new composable networks configuration, integration with coming config-download deployment, continue to increase UI/CLI feature parity, allow deployment of multiple plans, prototype workflows to derive parameters for the operator based on input for specific scenarios (like HCI), investigate root device hints support and setting physical_network on particular nodes. Florian led a side session in the Hotel on Thursday morning after we were kicked out of Croke Park stadium because nodublin where we discussed allowing operators to upload customvalidations and prototyping the use of swift for storing validations.

  • You might note that there are errors in the html validator for this post, but its late here and I’m in no mood to fight that right now. Yes, I know. cool story bro

March 06, 2018 03:00 PM

March 01, 2018

Carlos Camacho

My 2nd birthday as a Red Hatter

This post will be about to speak about my experience working in TripleO as a Red Hatter for the last 2 years. In my 2nd birthday as a Red Hatter, I have learned about many technologies, really a lot… But the most intriguing thing is that here you never stop learning. Not just because you just don’t want to learn new things, instead, is because of the project’s nature, this project… TripleO…

TripleO (Openstack On Openstack) is a software aimed to deploy OpenStack services using the same OpenStack ecosystem, this means that we will deploy a minimal OpenStack instance (Undercloud) and from there, deploy our production environment (Overcloud)… Yikes! What a mouthful, huh? Put simply, TripleO is an installer which should make integrators/operators/developers lives easier, but the reality sometimes is far away from the expectation.

TripleO is capable of doing wonderful things, with a little of patience, love, and dedication, your hands can be the right hands to deploy complex environments at easy.

One of the cool things being one of the programmers who write TripleO, from now on TripleOers, is that many of us also use the software regularly. We are writing code not just because we are told to do it, but because we want to improve it for our own purposes.

Part of the programmers’ motivation momentum have to do with TripleO’s open‐source nature, so if you code in TripleO you are part of a community.

Congratulations! As a TripleO user or a TripleOer, you are a part of our community and it means that you’re joining a diverse group that spans all age ranges, ethnicities, professional backgrounds, and parts of the globe. We are a passionate bunch of crazy people, proud of this “little” monster and more than willing to help others enjoy using it as much as we do.

Getting to know the interface (the templates, Mistral, Heat, Ansible, Docker, Puppet, Jinja, …) and how all components are tight together, probably is one of the most daunting aspects of TripleO for newcomers (and not newcomers). This for sure will raise the blood pressure of some of you who tried using TripleO in the past, but failed miserably and gave up in frustration when it did not behave as expected. Yeah.. sometimes that “$h1t” happens…

Although learning TripleO isn’t that easy, the architecture updates, the decoupling of the role services “compostable roles”, the backup and restore strategies, the integration of Ansible among many others have made great strides toward alleviating that frustration, and the improvements continue through to today.

So this is the question…

Is TripleO meant to be “fast to use” or “fast to learn”?

There is a significant way of describing software products, but we need to know what our software will be used for… TripleO is designed to work at scale, it might be easier to deploy manually a few controllers and computes, but what about deploying 100 computes, 3 controllers and 50 cinder nodes, all of them configured to be integrated and work as one single “cloud”? Buum!. So there we find the TripleO benefits if we want to make it scale we need to make it fast to use…

This means that we will find several customizations, hacks, workarounds, to make it work as we need it.

The upside to this approach is that TripleO evolved to be super-ultra-giga customizable so operators are enabled to produce great environments blazingly fast..

The downside, Jaja, yes.. there is a downside “or several”. As with most things that are customized, TripleO became somewhat difficult for new people to understand. Also, it’s incredibly hard to test all the possible deployments, and when a user does non-standard or not supported customizations, the upgrades are not as intuitive as they need…

This trade‐off is what I mean when I say “fast to use versus fast to learn.” You can be extremely productive with TripleO after you understand how it thinks “yes, it thinks”.

However, your first few deployments and patches might be arduous. Of course, alleviating that potential pain is what our work is about. IMHO the pros are more than the cons and once you find a niche to improve it will be a really nice experience.

Also, we have the TripleO YouTube channel a place to push video tutorials and deep dive sessions driven by the community for the community.

For the Spanish community we have a 100% translated TripleO UI, go to https://translate.openstack.org and help us to reach as many languages as possible!!!

www.anstack.com was born on July 5th of 2016 (first GitHub commit), yeah is my way of expressing my gratitude to the community doing some CtrlC + CtrlV recipes to avoid the frustration of working with TripleO and not having something deployed and easy to be used ASAP.

Anstack does not have much traffic but it reached superuser.openstack.org, the TripleO cheatsheets were on devconf.cz and FOSDEM, so in general, is really nice. When people reference your writings anywhere. Maybe in the future can evolve to be more related to ANsible and openSTACK ;) as TripleO is adding more and more support for Ansible.

What about Red Hat? Yeahp, I have a long time speaking about the project but haven’t spoken about the company making it all real. Red Hat is the world’s leading provider of open source solutions, using a community-powered approach to provide reliable and high-performing cloud, virtualization, storage, Linux, and middleware technologies.

There is a strong feeling of belonging in Red Hat, you are part of a team, a culture and you are able to find a perfect balance between your work and life. Also, having all people from all over the globe makes a perfect place for sharing ideas and collaborate. Not all of it is good, i.e. Working mostly remotely in upstream communities can be really hard to manage if you are not 100% sure about the tasks that need to be done.

Keep rocking and become part of the TripleO community!

by Carlos Camacho at March 01, 2018 12:00 AM

February 09, 2018

Steven Hardy

Debugging TripleO revisited - Heat, Ansible & Puppet

Some time ago I wrote a post about debugging TripleO heat templates, which contained some details of possible debug workflows when TripleO deployments fail.

In recent releases (since the Pike release) we've made some major changes to the TripleO architecture - we makes more use of Ansible "under the hood", and we now support deploying containerized environments.  I described some of these architectural changes in a talk at the recent OpenStack Summit in Sydney.

In this post I'd like to provide a refreshed tutorial on typical debug workflow, primarily focussing on the configuration phase of a typical TripleO deployment, and with particular focus on interfaces which have changed or are new since my original debugging post.

We'll start by looking at the deploy workflow as a whole, some heat interfaces for diagnosing the nature of the failure, then we'll at how to debug directly via Ansible and Puppet.  In a future post I'll also cover the basics of debugging containerized deployments.

The TripleO deploy workflow, overview

A typical TripleO deployment consists of several discrete phases, which are run in order:

Provisioning of the nodes


  1. A "plan" is created (heat templates and other files are uploaded to Swift running on the undercloud
  2. Some validation checks are performed by Mistral/Heat then a Heat stack create is started (by Mistral on the undercloud)
  3. Heat creates some groups of nodes (one group per TripleO role e.g "Controller"), which results in API calls to Nova
  4. Nova makes scheduling/placement decisions based on your flavors (which can be different per role), and calls Ironic to provision the baremetal nodes
  5. The nodes are provisioned by Ironic

This first phase is the provisioning workflow, after that is complete and the nodes are reported ACTIVE by nova (e.g the nodes are provisioned with an OS and running).

Host preparation

The next step is to configure the nodes in preparation for starting the services, which again has a specific workflow (some optional steps are omitted for clarity):

  1. The node networking is configured, via the os-net-config tool
  2. We write hieradata for puppet to the node filesystem (under /etc/puppet/hieradata/*)
  3. We write some data files to the node filesystem (a puppet manifest for baremetal configuration, and some json files that are used for container configuration)

Service deployment, step-by-step configuration

The final step is to deploy the services, either on the baremetal host or in containers, this consists of several tasks run in a specific order:

  1. We run puppet on the baremetal host (even in the containerized architecture this is still needed, e.g to configure the docker daemon and a few other things)
  2. We run "docker-puppet.py" to generate the configuration files for each enabled service (this only happens once, on step 1, for all services)
  3. We start any containers enabled for this step via the "paunch" tool, which translates some json files into running docker containers, and optionally does some bootstrapping tasks.
  4. We run docker-puppet.py again (with a different configuration, only on one node the "bootstrap host"), this does some bootstrap tasks that are performed via puppet, such as creating keystone users and endpoints after starting the service.

Note that these steps are performed repeatedly with an incrementing step value (e.g step 1, 2, 3, 4, and 5), with the exception of the "docker-puppet.py" config generation which we only need to do once (we just generate the configs for all services regardless of which step they get started in).

Below is a diagram which illustrates this step-by-step deployment workflow:
TripleO Service configuration workflow

The most common deployment failures occur during this service configuration phase of deployment, so the remainder of this post will primarily focus on debugging failures of the deployment steps.

 

Debugging first steps - what failed?

Heat Stack create failed.
 

Ok something failed during your TripleO deployment, it happens to all of us sometimes!  The next step is to understand the root-cause.

My starting point after this is always to run:

openstack stack failures list --long <stackname>

(undercloud) [stack@undercloud ~]$ openstack stack failures list --long overcloud
overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0:
resource_type: OS::Heat::StructuredDeployment
physical_resource_id: 421c7860-dd7d-47bd-9e12-de0008a4c106
status: CREATE_FAILED
status_reason: |
Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
deploy_stdout: |

PLAY [localhost] ***************************************************************

...

TASK [Run puppet host configuration for step 1] ********************************
ok: [localhost]

TASK [debug] *******************************************************************
fatal: [localhost]: FAILED! => {
"changed": false,
"failed_when_result": true,
"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
"Debug: Runtime environment: puppet_version=4.8.2, ruby_version=2.0.0, run_mode=user, default_encoding=UTF-8",
"Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain"
]
}
to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/8dd0b23a-acb8-4e11-aef7-12ea1d4cf038_playbook.retry

PLAY RECAP *********************************************************************
localhost : ok=18 changed=12 unreachable=0 failed=1
 

We can tell several things from the output (which has been edited above for brevity), firstly the name of the failing resource

overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0
  • The error was on one of the Controllers (ControllerDeployment)
  • The deployment failed during the per-step service configuration phase (the AllNodesDeploySteps part tells us this)
  • The failure was during the first step (Step1.0)
Then we see more clues in the deploy_stdout, ansible failed running the task which runs puppet on the host, it looks like a problem with the puppet code.

With a little more digging we can see which node exactly this failure relates to, e.g we copy the SoftwareDeployment ID from the output above, then run:

(undercloud) [stack@undercloud ~]$ openstack software deployment show 421c7860-dd7d-47bd-9e12-de0008a4c106 --format value --column server_id
29b3c254-5270-42ae-8150-9fc3f67d3d89
(undercloud) [stack@undercloud ~]$ openstack server list | grep 29b3c254-5270-42ae-8150-9fc3f67d3d89
| 29b3c254-5270-42ae-8150-9fc3f67d3d89 | overcloud-controller-0 | ACTIVE | ctlplane=192.168.24.6 | overcloud-full | oooq_control |
 

Ok so puppet failed while running via ansible on overcloud-controller-0.

 

Debugging via Ansible directly

Having identified that the problem was during the ansible-driven configuration phase, one option is to re-run the same configuration directly via ansible-ansible playbook, so you can either increase verbosity or potentially modify the tasks to debug the problem.

Since the Queens release, this is actually very easy, using a combination of the new "openstack overcloud config download" command and the tripleo dynamic ansible inventory.

(undercloud) [stack@undercloud ~]$ openstack overcloud config download
The TripleO configuration has been successfully generated into: /home/stack/tripleo-VOVet0-config
(undercloud) [stack@undercloud ~]$ cd /home/stack/tripleo-VOVet0-config
(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ls
common_deploy_steps_tasks.yaml external_post_deploy_steps_tasks.yaml templates
Compute global_vars.yaml update_steps_playbook.yaml
Controller group_vars update_steps_tasks.yaml
deploy_steps_playbook.yaml post_upgrade_steps_playbook.yaml upgrade_steps_playbook.yaml
external_deploy_steps_tasks.yaml post_upgrade_steps_tasks.yaml upgrade_steps_tasks.yaml
 

Here we can see there is a "deploy_steps_playbook.yaml", which is the entry point to run the ansible service configuration steps.  This runs all the common deployment tasks (as outlined above) as well as any service specific tasks (these end up in task include files in the per-role directories, e.g Controller and Compute in this example).

We can run the playbook again on all nodes with the tripleo-ansible-inventory from tripleo-validations, which is installed by default on the undercloud:

(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ansible-playbook -i /usr/bin/tripleo-ansible-inventory deploy_steps_playbook.yaml --limit overcloud-controller-0
...
TASK [Run puppet host configuration for step 1] ********************************************************************
ok: [192.168.24.6]

TASK [debug] *******************************************************************************************************
fatal: [192.168.24.6]: FAILED! => {
"changed": false,
"failed_when_result": true,
"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
"Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend",
"exception: connect failed",
"Warning: Undefined variable '::deploy_config_name'; ",
" (file & line not available)",
"Warning: Undefined variable 'deploy_config_name'; ",
"Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile
/base/docker.pp:181:5 on node overcloud-controller-0.localdomain"

]
}

NO MORE HOSTS LEFT *************************************************************************************************
to retry, use: --limit @/home/stack/tripleo-VOVet0-config/deploy_steps_playbook.retry

PLAY RECAP *********************************************************************************************************
192.168.24.6 : ok=56 changed=2 unreachable=0 failed=1
 

Here we can see the same error is reproduced directly via ansible, and we made use of the --limit option to only run tasks on the overcloud-controller-0 node.  We could also have added --tags to limit the tasks further (see tripleo-heat-templates for which tags are supported).

If the error were ansible related, this would be a good way to debug and test any potential fixes to the ansible tasks, and in the upcoming Rocky release there are plans to switch to this model of deployment by default.

 

Debugging via Puppet directly

Since this error seems to be puppet related, the next step is to reproduce it on the host (obviously the steps above often yield enough information to identify the puppet error, but this assumes you need to do more detailed debugging directly via puppet):

Firstly we log on to the node, and look at the files in the /var/lib/tripleo-config directory.

(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ssh heat-admin@192.168.24.6
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
Last login: Fri Feb 9 14:30:02 2018 from gateway
[heat-admin@overcloud-controller-0 ~]$ cd /var/lib/tripleo-config/
[heat-admin@overcloud-controller-0 tripleo-config]$ ls
docker-container-startup-config-step_1.json docker-container-startup-config-step_4.json puppet_step_config.pp
docker-container-startup-config-step_2.json docker-container-startup-config-step_5.json
docker-container-startup-config-step_3.json docker-container-startup-config-step_6.json
 

The puppet_step_config.pp file is the manifest applied by ansible on the baremetal host

We can debug any puppet host configuration by running puppet apply manually. Note that hiera is used to control the step value, this will be at the same value as the failing step, but it can also be useful sometimes to manually modify this for development testing of different steps for a particular service.

[root@overcloud-controller-0 tripleo-config]# hiera -c /etc/puppet/hiera.yaml step
1
[root@overcloud-controller-0 tripleo-config]# cat /etc/puppet/hieradata/config_step.json
{"step": 1}[root@overcloud-controller-0 tripleo-config]# puppet apply --debug puppet_step_config.pp
...
Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain
 

Here we can see the problem is a typo in the /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp file at line 181, I look at the file, fix the problem (ugeas should be augeas) then re-run puppet apply to confirm the fix.

Note that with puppet module fixes you will need to get the fix either into an updated overcloud image, or update the module via deploy artifacts for testing local forks of the modules.

That's all for today, but in a future post, I will cover the new container architecture, and share some debugging approaches I have found helpful when deployment failures are container related.

by Steve Hardy (noreply@blogger.com) at February 09, 2018 05:04 PM

January 10, 2018

Ben Nemec

Zuul Status Page Production-Ready Configuration

About a year ago I got fed up with all the dynamic Zuul status pages (like this one). They're extremely heavyweight pages that make my browser grind almost constantly and there's no way to disable the auto-refresh while you're looking through the list. At least not that I found, and when I asked on openstack-dev the responses agreed.

So I wrote my own.

It was originally designed to keep track of the check-tripleo queue for OVB jobs, but since then I've added better support for the other queues as well. I'm very happy with how it works (shocking!), with one major exception: Periodically the webapp behind it will just stop responding. I suspect it has to do with connectivity issues to the Zuul status JSON endpoint and I've tried to set some timeouts to avoid that, but nothing I've tried has worked. Once a request gets stuck the entire app stops responding.

I suspect a big part of the problem is that I was using the WSGI server built in to Python instead of a production-ready one like uWSGI. It does work, but it seems to be a single-threaded implementation with no ability to handle parallel requests. When a request gets stuck this obviously is a problem. To address this, I've switched to running the app in uWSGI fronted by nginx. This seems to be a popular configuration and it's being used by some OpenStack services as well so it seemed like a useful thing to learn about.

You can see my uWSGI configuration file in the Git repo. I'll include my nginx config file at the end of the post. The command I use to start uWSGI is sudo uwsgi --ini zuul-status.ini

There are a few things I suspect are not ideal (socket in /tmp and running as root), but this is just running in a throwaway vm so if someone does manage to hose it up through some sort of exploit it won't be the end of the world. Be aware that my configuration is not necessarily following best practices though.

I figured I would write this up since I had some trouble finding complete, working configurations for the Python/Nginx/uWSGI combination. Found lots of people with non-working configs though. :-)

A few other environmental notes:

  • I installed nginx on CentOS 7 using the repo provided by nginx. It sounds like it might also be available in EPEL.
  • I also enabled EPEL to get uWSGI.
  • The uwsgi-plugin-python package is also needed to run Python apps.
  • I create a virtualenv for the webapp in the venv directory that contains all of the dependencies for it.
  • I also shut off SELinux. Yeah, yeah, see above about throwaway VM.

nginx conf.d/default.conf:

server {
    listen       80;
    server_name  localhost;

    location / { try_files $uri @zuul-status; }
    location @zuul-status {
        uwsgi_pass unix:///tmp/uwsgi.sock;
        include uwsgi_params;
    }

    error_page   500 502 503 504  /50x.html;
    location = /50x.html {
        root   /usr/share/nginx/html;
    }
}

And that's pretty much it. The app has only been running in uWSGI for a few hours so I can't say whether it will actually help with stability, but hopefully this will be helpful to someone anyway. I'll try to remember to update this post after a while to report how it worked.

by bnemec at January 10, 2018 08:06 PM

December 13, 2017

James Slagle

TripleO and Ansible (Part 2)

In my last post, I covered some of the details about using Ansible to deploy
with TripleO. If you haven’t read that yet, I suggest starting there:
http://blog-slagle.rhcloud.com/?p=355

I’ll now cover interacting with Ansible more directly.

When using --config-download as a deployment argument, a Mistral workflow will be enabled that runs ansible-playbook to apply the deployment and configuration data to each node. When the deployment is complete, you can interact with the files that were created by the workflow.

Let’s take a look at how to do that.

You need to have a shell on the Undercloud. Since the files used by the workflow potentially contain sensitive data, they are only readable by the mistral user or group. So either become the root user, or add your interactive shell user account (typically “stack”) to the mistral group:

sudo usermod -a -G mistral stack
# Activate the new group
newgrp mistral

Once the permissions are sorted, change to the mistral working directory for
the config-download workflows:

cd /var/lib/mistral

Within that directory, there will be directories named according to the Mistral
execution uuid. An easy way to find the most recent execution of
config-download is to just cd into the most recently created directory and list
the files in that directory:

cd 2747b55e-a7b7-4036-82f7-62f09c63d671
ls

The following files (or a similar set, as things could change) will exist:

ansible.cfg
ansible.log
ansible-playbook-command.sh
common_deploy_steps_tasks.yaml
Controller
deploy_steps_playbook.yaml
deploy_steps_tasks.yaml
external_deploy_steps_tasks.yaml
external_post_deploy_steps_tasks.yaml
group_vars
ssh_private_key
templates
tripleo-ansible-inventory
update_steps_playbook.yaml
update_steps_tasks.yaml
upgrade_steps_playbook.yaml
upgrade_steps_tasks.yaml

All the files that are needed to re-run ansible-playbook are present. The exact ansible-playbook command is saved in ansible-playbook-command.sh. Let’s take a look at that file:

$ cat ansible-playbook-command.sh
 #!/bin/bash

OS_AUTH_TOKEN="gAAAAABaMD3b3UQziKRzm2jjutrxBbYgqfWSTZWAMXyU5DcTA83Nn28eBVUvr0darSl0LcF3kb-I7OYnMxAp3dBs39ejrINYmsuBmT7ZE3SjYjWqtgivQyYWOHJmgKscl2VuBnWF8Jq-kd3wOHpHQVpJ0ILls35uFPUQvf91ckpr2QsEg67i9Ys"
 OS_AUTH_URL="http://192.168.24.1:5000/v3"
 OS_PROJECT_NAME="admin"
 OS_USERNAME="admin"
 ANSIBLE_CONFIG="/var/lib/mistral/2747b55e-a7b7-4036-82f7-62f09c63d671/ansible.cfg"
 HOME="/var/lib/mistral/2747b55e-a7b7-4036-82f7-62f09c63d671"

ansible-playbook -v /var/lib/mistral/2747b55e-a7b7-4036-82f7-62f09c63d671/deploy_steps_playbook.yaml --user tripleo-admin --become --ssh-extra-args "-o StrictHostKeyChecking=no" --timeout 240 --inventory-file /var/lib/mistral/2747b55e-a7b7-4036-82f7-62f09c63d671/tripleo-ansible-inventory --private-key /var/lib/mistral/2747b55e-a7b7-4036-82f7-62f09c63d671/ssh_private_key $@

You can see how the call to ansible-playbook is reproduced in this script. Also notice that $@ is used to pass any additional arguments directly to ansible-playbook when calling this script, such as --check, --limit, --tags, --start-at-task, etc.

Some of the other files present are:

  • tripleo-ansible-inventory
    • Ansible inventory file containing hosts and vars for all the Overcloud nodes.
  • ansible.log
    • Log file from the last run of ansible-playbook.
  • ansible.cfg
    • Config file used when running ansible-playbook.
  • ansible-playbook-command.sh
    • Executable script that can be used to rerun ansible-playbook.
  • ssh_private_key
    • Private ssh key used to ssh to the Overcloud nodes.

Within the group_vars directory, there is a corresponding file per role. In my
example, I have a Controller role. If we take a look at group_vars/Controller we see it contains:

$ cat group_vars/Controller
Controller_pre_deployments:
- HostsEntryDeployment
- DeployedServerBootstrapDeployment
- UpgradeInitDeployment
- InstanceIdDeployment
- NetworkDeployment
- ControllerUpgradeInitDeployment
- UpdateDeployment
- ControllerDeployment
- SshHostPubKeyDeployment
- ControllerSshKnownHostsDeployment
- ControllerHostsDeployment
- ControllerAllNodesDeployment
- ControllerAllNodesValidationDeployment
- ControllerArtifactsDeploy
- ControllerHostPrepDeployment

Controller_post_deployments: []

The <RoleName>_pre_deployments and <RoleName>_post_deployments variables contain the list of Heat deployment names to run for that role. Suppose we wanted to just rerun a single deployment. That command would be:

$ ./ansible-playbook-command.sh --tags pre_deploy_steps -e Controller_pre_deployments=ControllerArtifactsDeploy -e force=true

That would run just the ControllerArtifactsDeploy deployment. Passing -e force=true is necessary to force the deployment to rerun. Also notice we restrict what tags get run with --tags pre_deploy_steps.

For documentation on what tags are available see:
https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/ansible_config_download.html#tags

Finally, suppose we wanted to just run the 5 deployment steps that are the same for all nodes of a given role. We can use --limit <RoleName>, as the role names are defined as groups in the inventory file. That command would be:

$ ./ansible-playbook-command.sh --tags deploy_steps --limit Controller

I hope this info is helpful. Let me know what you want to see next.

 


 

P.S.

Cross posted at: https://blogslagle.wordpress.com/2017/12/13/tripleo-and-ansible-part-2/

 

by slagle at December 13, 2017 01:12 PM

December 11, 2017

James Slagle

TripleO and Ansible deployment (Part 1)

In the Queens release of TripleO, you’ll be able to use Ansible to apply the
software deployment and configuration of an Overcloud.

Before jumping into some of the technical details, I wanted to cover some
background about how the Ansible integration works along side some of the
existing tools in TripleO.

The Ansible integration goes as far as offering an alternative to the
communication between the existing Heat agent (os-collect-config) and the Heat
API. This alternative is opt-in for Queens, but we are exploring making it the
default behavior for future releases.

The default behavior for Queens (and all prior releases) will still use the
model where each Overcloud node has a daemon agent called os-collect-config
that periodically polls the Heat API for deployment and configuration data.
When Heat provides updated data, the agent applies the deployments, making
changes to the local node such as configuration, service management,
pulling/starting containers, etc.

The Ansible alternative instead uses a “control” node (the Undercloud) running
ansible-playbook with a local inventory file and pushes out all the changes to
each Overcloud node via ssh in the typical Ansible fashion.

Heat is still the primary API, while the parameter and environment files that
get passed to Heat to create an Overcloud stack remain the same regardless of
which method is used.

Heat is also still fully responsible for creating and orchestrating all
OpenStack resources in the services running on the Undercloud (Nova servers,
Neutron networks, etc).

This sequence diagram will hopefully provide a clear picture:
https://slagle.fedorapeople.org/tripleo-ansible-arch.png

Replacing the application and transport layer of the deployment with Ansible
allows us to take advantage of features in Ansible that will hopefully make
deploying and troubleshooting TripleO easier:

  • Running only specific deployments
  • Including/excluding specific nodes or roles from an update
  • More real time sequential output of the deployment
  • More robust error reporting
  • Faster iteration and reproduction of deployments

Using Ansible instead of the Heat agent is easy. Just include 2 extra cli args
in the deployment command:

-e /path/to/templates/environments/config-download-environment.yaml \
--config-download

Once Heat is done creating the stack (will be much faster than usual), a
separate Mistral workflow will be triggered that runs ansible-playbook to
finish the deployment. The output from ansible-playbook will be streamed to
stdout so you can follow along with the progress.

Here’s a demo showing what a stack update looks like:

(I suggest making the demo fully screen or watch it here: https://slagle.fedorapeople.org/tripleo-ansible-deployment-1.mp4)

Note that we don’t get color output from ansible-playbook since we are
consuming the stdout from a Zaqar queue. However, in my next post I will go
into how to execute ansible-playbook manually, and detail all of the related
files (inventory, playbooks, etc) that are available to interact with manually.

If you want to read ahead, have a look at the official documentation:
https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/ansible_config_download.html


P.S.

The infrastructure that hosts this blog may go away soon. In which case I’m
also cross posting to: https://blogslagle.wordpress.com


 

by slagle at December 11, 2017 03:14 PM

November 30, 2017

Juan Antonio Osorio

TripleO IPSEC setup

For some time in June we looked into adding encryption in all networks to TripleO. At the time, TLS everywhere wasn’t available, so we looked into IPSEC as an alternative. Recently, we’ve been looking into formalizing that work and integrate it better to the way TripleO currently does things.

Why IPSEC?

In the network, TLS goes above the transport (TCP or UDP) layer and is transparent to the application. This means that the application deployer has to explicitly enable TLS in order to take advantage of the benefits it provides. One then has to configure the application to use the ‘https’ protocol when accessing a certain URL of a server, which also has to match what’s configured in the certificate. Besides this, one has to make sure that the certificate is trusted, so configuring the trusted CA (the one that issued the server’s certificate) is also part of the application’s configuration. One can also configure several other aspects of the communication, such as the cipher to be used, and in case mutual authentication is required, one also has configure the ‘client’ certificate and key to be used.

IPSEC offers an alternative that’s easier on the application configuration point of view. It sits below the application layer. So, when enabled, the applications have no notion that they’re using secure communications. Which means, one can keep the same old configurations for the applications, and still have a secure setup. One still has to manage secrets (either Pre-Shared Keys or PKI setups) and configure encryption setups, but this is now an IPSEC configuration problem, instead of being something one has to do for each and every application in the cloud. This made IPSEC a great candidate to provide out-of-band security for the TripleO setup.

With TLS we have an application that’s securely serving content on a specific endpoint. With IPSEC we have a secure tunnel between two interfaces in the network. Everything passing through this tunnel is encrypted.

Note that if one uses certificates and private keys for IPSEC’s authentication mechanism, one still has to maintain a PKI, similarly to the TLS case. So we still need a CA, and we still need to provision certificates and keys for each node.

TripleO considerations

The IPSEC configuration is tightly integrated with the way the TripleO network is set up. For a regular deployment, one expects there to be network-isolation. This means that we will have several networks on which the nodes are connected. The different networks handle different types of traffic, and not all nodes are connected to all networks.

Using the default configuration, the controllers (which belong to the Controller role), are connected to the following networks:

  • External
  • InternalApi
  • Storage
  • StorageMgmt
  • Tenant
  • Ctlplane

The computes on the other side, are connected to these instead:

  • InternalApi
  • Tenant
  • Storage
  • Ctlplane

Each role might have different networks available and these roles are also configurable, so we could create a custom role with a very specific set of networks, or even modify the networks that a certain role uses.

A network could be enabled or disabled entirely. There also could (or not) be a Virtual IP address (VIP) being served on this network, which also needs to be secured.

Finally, we could also create custom networks and use them in the roles.

This means that we can’t make assumptions about the network setup of a TripleO-deployed overcloud. And thus, everything has to be dynamic.

We also have to take into account that the VIPs are handled by pacemaker, which can at any given moment move the VIP to another node depending on the node’s load or other heuristics. So this also needs to be taken into account.

IPSEC setup

For our IPSEC configuration, we use libreswan, which supports several types of schemes or configurations.

From these, we mainly use host-to-host and the “server for remote clients” (or roadwarrior) configurations.

host-to-host

For communications between regular IPs in the networks, we use the host-to-host configuration. It’s a very straight-forward configuration where we tell libreswan to establish an IPSEC tunnel for communication between one IP and another one.

For instance, lets say we have two nodes connected via the InternalApi network. The controller-0 and controller-1 nodes have the IPs 172.16.2.13 and 172.16.2.14 respectively.

In this case, we would tell libreswan to explicitly set up a tunnel between the 172.16.2.13 and 172.16.2.14 IPs. And we could do so with the following configuration:

conn my-cool-tunnel
        type=tunnel
        authby=secret
        leftid=172.16.2.13
        left=172.16.2.13
        rightid=172.16.2.14
        right=172.16.2.14
        failureshunt=drop
        ikev2=insist
        auto=start
        keyingtries=1
        retransmit-timeout=2s
        phase2alg=aes_gcm128-null

Here we create a connection with the identifier my-cool-tunnel (It can be arbitrary but unique; in the real setup we try to give a meaningful value to ease debugging). It authenticates using a pre-shared secret (authby=secret). And if libreswan fails to establish a tunnel, it will drop all packets (failureshunt=drop).

Each of the hosts can have this exact configuration applied to them, or have the left and right values interchanged (libreswan will take care of figuring out which is which).

The caveat here is that we need to set up a tunnel for every IP address in the network that the node can communicate with. On the other hand, we also need to establish similar tunnels for every other network. Thus, in a fairly big and realistic environment, the number of tunnels can be quite big.

Virtual IPs

For VIPs we would have a very similar problem. Since we would need to specify every node that’s gonna connect to the VIP before hand, which can be very tedious. Using the roadwarrior setup avoids this issue.

Assuming the same example as above and having a VIP in the network with the 172.16.2.4 address. The configuration on the node that holds the VIP looks as follows:

conn my-cool-roadwarrior-tunnel
    left=172.16.2.4
    leftid=@internal_apivip
    right=%any
    rightid=@overcloudinternal_apicluster
    authby=secret
    auto=add
    dpdaction=hold
    dpddelay=5
    dpdtimeout=15
    phase2alg=aes_gcm128-null
    failureshunt=drop

As we noted in the host-to-host configuration, the connection’s name can be arbitrary but unique. Here we tell libreswan to establish tunnels from the VIP (172.16.2.4) to any host that can authenticate correctly (as noted by the %any value in the configuration key ‘right’). The ‘leftid’ and ‘rightid’ configuration values are used, in this case, to identify the specific PSK to use for this tunnel. We also establish Dead Peer Detection parameters (dpd), which are useful for when there’s a failover for the VIP, and libreswan has to re-establish the tunnel.

Failover handling

Besides the aforementioned Dead Peer Detection that libreswan does. We have a pacemaker resource agent that takes ownership of the VIP’s IPSEC tunnel and puts it up or down depending on the location of the VIP itself (which is also managed by pacemaker). We do this by using a pacemaker colocation constraint with the VIP.

Setting it all up (Enter Ansible)

The setup was made by creating an ansible role that would set up the appropriate IPSEC tunnels.

Initially the networks and VIPs and even the roles this would apply to were hardcoded. But, thanks to additions to the dynamic inventory from TripleO we can now get all the information we need to have a dynamic setup that takes into account custom roles, custom networks and the VIPs they might contain.

HOWTO

To use it, you need to generate a playbook that looks like the following:

- hosts: overcloud
  become: true
  vars:
    ipsec_psk: "<a very secure pre-shared key>"
  roles:
  - tripleo-ipsec

Once you have that, you can call ansible as follows:

ansible-playbook -i /usr/bin/tripleo-ansible-inventory /path/to/playbook

And that’s it! It will run and set up tunnels for your overcloud.

Note that the overcloud needs to be set up already.

Where is this magical thing?

All the work is currently in github, but we’re looking into making it officially part of TripleO, under the OpenStack umbrella.

UPDATE: It’s now officially part of OpenStack

Future work

TripleO Integration

Currently this is a role that one runs after a TripleO deployment. Hopefully in the near future this will become yet another TripleO service that one can enable using a heat environment file.

Pluggable authentication

Currently the role sets up tunnels using a Pre-Shared Key for authentication. While this is not ideal, it’s better than nothing. In-coming work is to add certificates into the mix, and even use certificates provided by FreeIPA.

November 30, 2017 10:14 AM

September 27, 2017

Steven Hardy

OpenStack Days UK

OpenStack Days UK

Yesterday I attended the OpenStack Days UK event, held in London.  It was a very good day and there were a number of interesting talks, and it provided a great opportunity to chat with folks about OpenStack.

I gave a talk, titled "Deploying OpenStack at scale, with TripleO, Ansible and Containers", where I gave an update of the recent rework in the TripleO project to make more use of Ansible and enable containerized deployments.

I'm planning some future blog posts with more detail on this topic, but for now here's a copy of the slide deck I used, also available on github.



by Steve Hardy (noreply@blogger.com) at September 27, 2017 11:17 AM

July 19, 2017

Giulio Fidente

Understanding ceph-ansible in TripleO

One of the goals for the TripleO Pike release was to introduce ceph-ansible as an alternative to puppet-ceph for the deployment of Ceph.

More specifically, to put operators in control of the playbook execution as if they were launching ceph-ansible from the commandline, except it would be Heat starting ceph-ansible at the right time during the overcloud deployment.

This demanded for some changes in different tools used by TripleO and went through a pretty long review process, eventually putting in place some useful bits for the future integration of Kubernetes and migration to an ansible driven deployment of the overcloud configuration steps in TripleO.

The idea was to add a generic functionality allowing triggering of a given Mistral workflow during the deployment of a service. Mistral could have then executed any action, including for example an ansible playbook, provided it was given all the necessay input data for the playbook to run and the roles list to build the hosts inventory.

This is how we did it.

Run ansible-playbook from Mistral (1)
An initial submission added support for the execution of ansible playbooks as workflow tasks in Mistral https://github.com/openstack/tripleo-common/commit/e6c8a46f00436edfa5de92e97c3a390d90c3ce54

A generic action for Mistral which workflows can use to run an ansible playbook. +2 to Dougal and Ryan.

Deploy external resources from Heat (2)
We also needed a new resource in Heat to be able to drive Mistral workflow executions https://github.com/openstack/heat/commit/725b404468bdd2c1bdbaf16e594515475da7bace so that we could orchestrate the executions like any other Heat resource. This is described much in detail in a Heat spec.

With these two, we could run an ansible playbook from a Heat resource, via Mistral. +2 to Zane and Thomas for the help! Enough to start messing in TripleO and glue things together.

Describe what/when to run in TripleO (3)
We added a mechanim in the TripleO templates to make it possible to describe, from within a service, a list of tasks or workflows to be executed at any given deployment step https://github.com/openstack/tripleo-heat-templates/commit/71f13388161cbab12fe284f7b251ca8d36f7635c

There aren't restrictions on what the tasks or workflows in the new section should do. These might deploy the service or prepare the environment for it or execute code (eg. build Swift rings). The commit message explains how to use it:

service_workflow_tasks:
  step2:
    - name: my_action_name
      action: std.echo
      input:
        output: 'hello world'

The above snippet would make TripleO to run the Mistral std.echo action during the overcloud deployment, precisely at step 2, assuming you create a new service with the code above and enable it on a role.

For Ceph we wanted to run the new Mistral action (1) and needed to provide it with the config settings for the service, normally described within the config_settings structure of the service template.

Provide config_settings to the workflows (4)
The decision was to make available all config settings into the Mistral execution environment so that ansible actions could, for example, use them as extra_vars https://github.com/openstack/tripleo-heat-templates/commit/8b81b363fd48b0080b963fd2b1ab6bfe97b0c204

Now all config settings normally consumed by puppet were available to the Mistral action and playbook settings could be added too, +2 Steven.

Build the data for the hosts inventory (5)
Together with the above, another small change provided into the execution environment a dictionary mapping every enabled service to the list of IP address of the nodes where the service is deployed https://github.com/openstack/tripleo-heat-templates/commit/9c1940e461867f2ce986a81fa313d7995592f0c5

This was necessary to be able to build the ansible hosts inventory.

Create a workflow for ceph-ansible (6)
Having all pieces available to trigger the workflow and pass to it the service config settings, we needed the workflow which would run ceph-ansible https://github.com/openstack/tripleo-common/commit/fa0b9f52080580b7408dc6f5f2da6fc1dc07d500 plus some new, generic Mistral actions, to run smoothly multiple times (eg. stack updates) https://github.com/openstack/tripleo-common/commit/f81372d85a0a92de455eeaa93162faf09be670cf

This is the glue which runs a ceph-ansible playbook with the given set of parameters. +2 John.

Deploy Ceph via ceph-ansible (7)
Finally, the new services definition for Tripleo https://review.openstack.org/#/c/465066/ to deploy Ceph in containers via ceph-ansible, including a couple of params operators can use to push into the Mistral environment arbitrary extra_vars for ceph-ansible.

The deployment with ceph-ansible is activated with the ceph-ansible.yaml environment file.

Interestingly the templates to deploy Ceph using puppet-ceph are unchanged and continue to work as they used to so that for new deployments it is possible to use alternatively the new implementation with ceph-ansible or the pre-existing implementation using puppet-ceph. Only ceph-ansible allows for the deployment of Ceph in containers.

Big +2 also to Jiri (who doesn't even need a blog or twitter) and all the people who helped during the development process with feedback, commits and reviews.

Soon another article with some usage examples and debugging instructions!

by Giulio Fidente at July 19, 2017 09:00 AM

July 07, 2017

Julie Pichon

TripleO Deep Dive: Internationalisation in the UI

Yesterday, as part of the TripleO Deep Dives series I gave a short introduction to internationalisation in TripleO UI: the technical aspects of it, as well as a quick overview of how we work with the I18n team.

You can catch the recording on BlueJeans or YouTube, and below's a transcript.

~

Life and Journey of a String

Internationalisation was added to the UI during Ocata - just a release ago. Florian implemented most of it and did the lion's share of the work, as can be seen on the blueprint if you're curious about the nitty-gritty details.

Addition to the codebase

Here's an example patch from during the transition. On the left you can see how things were hard-coded, and on the right you can see the new defineMessages() interface we now use. Obviously new patches should directly look like on the right hand-side nowadays.

The defineMessages() dictionary requires a unique id and default English string for every message. Optionally, you can also provide a description if you think there could be confusion or to clarify the meaning. The description will be shown in Zanata to the translators - remember they see no other context, only the string itself.

For example, a string might sound active like if it were related to an action/button but actually be a descriptive help string. Or some expressions are known to be confusing in English - "provide a node" has been the source of multiple discussions on list and live so might as well pre-empt questions and offer additional context to help the translators decide on an appropriate translation.

Extraction & conversion

Now we know how to add an internationalised string to the codebase - how do these get extracted into a file that will be uploaded to Zanata?

All of the following steps are described in the translation documentation in the tripleo-ui repository. Assuming you've already run the installation steps (basically, npm install):

$ npm run build

This does a lot more than just extracting strings - it prepares the code for being deployed in production. Once this ends you'll be able to find your newly extracted messages under the i18n directory:

$ ls i18n/extracted-messages/src/js/components

You can see the directory structure is kept the same as the source code. And if you peek into one of the files, you'll note the content is basically the same as what we had in our defineMessages() dictionary:

$ cat i18n/extracted-messages/src/js/components/Login.json
[
  {
    "id": "UserAuthenticator.authenticating",
    "defaultMessage": "Authenticating..."
  },
  {
    "id": "Login.username",
    "defaultMessage": "Username"
  },
  {
    "id": "Login.usernameRequired",
    "defaultMessage": "Username is required."
  },
[...]

However, JSON is not a format that Zanata understands by default. I think the latest version we upgraded to, or the next one might have some support for it, but since there's no i18n JSON standard it's somewhat limited. In open-source software projects, po/pot files are generally the standard to go with.

$ npm run json2pot

> tripleo-ui@7.1.0 json2pot /home/jpichon/devel/tripleo-ui
> rip json2pot ./i18n/extracted-messages/**/*.json -o ./i18n/messages.pot

> [react-intl-po] write file -> ./i18n/messages.pot ✔️

$ cat i18n/messages.pot
msgid ""
msgstr ""
"POT-Creation-Date: 2017-07-07T09:14:10.098Z\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"MIME-Version: 1.0\n"
"X-Generator: react-intl-po\n"


#: ./i18n/extracted-messages/src/js/components/nodes/RegisterNodesDialog.json
#. [RegisterNodesDialog.noNodesToRegister] - undefined
msgid ""No Nodes To Register""
msgstr ""

#: ./i18n/extracted-messages/src/js/components/nodes/NodesToolbar/NodesToolbar.json
#. [Toolbar.activeFilters] - undefined
#: ./i18n/extracted-messages/src/js/components/validations/ValidationsToolbar.json
#. [Toolbar.activeFilters] - undefined
msgid "Active Filters:"
msgstr ""

#: ./i18n/extracted-messages/src/js/components/nodes/RegisterNodesDialog.json
#. [RegisterNodesDialog.addNew] - Small button, to add a new Node
msgid "Add New"
msgstr ""

#: ./i18n/extracted-messages/src/js/components/plan/PlanFormTabs.json
#. [PlanFormTabs.addPlanName] - Tooltip for "Plan Name" form field
msgid "Add a Plan Name"
msgstr ""
[...]

This messages.pot file is what will be automatically uploaded to Zanata.

Infra: from the git repo, to Zanata

The following steps are done by the infrastructure scripts. There's infra documentation on how to enable translations for your project, in our case as the first internationalised JavaScript project we had to update the scripts a little as well. This is useful to know if an issue happens with the infra jobs; debugging will probably bring you here.

The scripts live in the project-config infra repo and there are three files of interest for us:

In this case, upstream_translation_update.sh is the file of interest to us: it simply sets up the project on line 76, then sends the pot file up to Zanata on line 115.

What does "setting up the project" entails? It's a function in common_translations_update.sh, that pretty much runs the steps we talked about in the previous section, and also creates a config file to talk to Zanata.

Monitoring the post jobs

Post jobs run after a patch has already merged - usually to upload tarballs where they should be, update the documentation pages, etc, and also upload messages catalogues onto Zanata. Being a 'post' job however means that if something goes wrong, there is no notification on the original review so it's easy to miss.

Here's the OpenStack Health page to monitor 'post' jobs related to tripleo-ui. Scroll to the bottom - hopefully tripleo-ui-upstream-translation-update is still green! It's good to keep an eye on it although it's easy to forget. Thankfully, AJaeger from #openstack-infra has been great at filing bugs and letting us know when something does go wrong.

Debugging when things go wrong: an example

We had a couple of issues whereby a linebreak gets introduced into one of the strings, which works fine in JSON but breaks our pot file. If you look at the content from the bug (the full logs are no longer accessible):

2017-03-16 12:55:13.468428 | + zanata-cli -B -e push --copy-trans False
[...]
2017-03-16 12:55:15.391220 | [INFO] Found source documents:
2017-03-16 12:55:15.391405 | [INFO]            i18n/messages
2017-03-16 12:55:15.531164 | [ERROR] Operation failed: missing end-quote

You'll notice the first line is the last function we call in the upstream_translation_update.sh script; for debugging that gives you an idea of the steps to follow to reproduce. The upstream Zanata instance also lets you create toy projects, if you want to test uploads yourself (this can't be done directly on the OpenStack Zanata instance.)

This particular newline issue has popped up a couple of times already. We're treating it with band-aids at the moment, ideally we'd get a proper test on the gate to prevent it from happening again: this is why this bug is still open. I'm not very familiar with JavaScript testing and haven't had a chance to look into it yet; if you'd like to give it a shot that'd be a useful contribution :)

Zanata, and contributing translations

The OpenStack Zanata instance lives at https://translate.openstack.org. This is where the translators do their work. Here's the page for tripleo-ui, you can see there is one project per branch (stable/ocata and master, for now). Sort by "Percent Translated" to see the languages currently translated. Here's an example of the translator's view, for Spanish: you can see the English string on the left, and the translator fills in the right side. No context! Just strings.

At this stage of the release cycle, the focus would be on 'master,' although it is still early to do translations; there is a lot of churn still.

If you'd like to contribute translations, the I18n team has good documentation about how to go about how to do it. The short version: sign up on Zanata, request to join your language team, once you're approved - you're good to go!

Return of the string

Now that we have our strings available in multiple languages, it's time for another infra job to kick in and bring them into our repository. This is where propose_translation_update.sh comes in. We pull the po files from Zanata, convert them to JSON, then do a git commit that will be proposed to Gerrit.

The cleanup step does more than it might seem. It checks if files are translated over a certain ratio (~75% for code), which avoids adding new languages when there might only be one or two words translated (e.g. someone just testing Zanata to see how it works). Switching to your language and yet having the vast majority of the UI still appear in English is not a great user experience.

In theory, files that were added but are now below 40% should get automatically removed, however this doesn't quite work for JavaScript at the moment - another opportunity to help! Manual cleanups can be done in the meantime, but it's a rare event so not a major issue.

Monitoring the periodic jobs

Zanata is checked once a day every morning, there is an OpenStack Health page for this as well. You can see there are two jobs at the moment (hopefully green!), one per branch: tripleo-ui-propose-translation-update and tripleo-ui-propose-translation-update-ocata. The job should run every day even if there are no updates - it simply means there might not be a git review proposed at the end.

We haven't had issues with the periodic job so far, though the debugging process would be the same: figure out based on the failure if it is happening at the infra script stage or in one of our commands (e.g. npm run po2json), try to reproduce and fix. I'm sure super-helpful AJaeger would also let us know if he were to notice an issue here.

Automated patches

You may have seen the automated translations updates pop up on Gerrit. The commit message has some tips on how to review these: basically don't agonise over the translation contents as problems there should be handled in Zanata anyway, just make sure the format looks good and is unlikely to break the code. A JSON validation tool runs during the infra prep step in order to "prettify" the JSON blob and limit the size of the diffs, therefore once the patch  makes it out to Gerrit we know the JSON is well-formed at least.

Try to review these patches quickly to respect the translators' work. Not very nice to spend a lot of time on translating a project and yet not have your work included because no one was bothered to merge it :)

A note about new languages...

If the automated patch adds a new language, there'll be an additional step required after merging the translations in order to enable it: adding a string with the language name to a constants file. Until recently, this took 3 or 4 steps - thanks to Honza for making it much simpler!

This concludes the technical journey of a string. If you'd like to help with i18n tasks, we have a few related bugs open. They go from very simple low-hanging-fruits you could use to make your first contribution to the UI, to weird buttons that have translations available yet show in English but only in certain modals, to the kind of CI resiliency tasks I linked to earlier. Something for everyone! ;)

Working with the I18n team

It's really all about communication. Starting with...

Release schedule and string freezes

String freezes are noted on the main schedule but tend to fit the regular cycle-with-milestones work. This is a problem for a cycle-trailing project like tripleo-ui as we could be implementing features up to 2 weeks after the other projects, so we can't freeze strings that early.

There were discussions at the Atlanta PTG around whether the I18n should care at all about projects that don't respect the freeze deadlines. That would have made it impossible for projects like ours to ever make it onto the I18n official radar. The compromise was that cycle-trailing project should have a I18n cross-project liaison that communicates with the I18n PTL and team to inform them of deadlines, and also to ignore Soft Freeze and only do a Hard Freeze.

This will all be documented under an i18n governance tag; while waiting for it the notes from the sessions are available for the curious!

What's a String Freeze again?

The two are defined on the schedule: soft freeze means not allowing changes to strings, as it invalidates the translator's work and forces them to retranslate; hard freeze means no additions, changes or anything else in order to give translators a chance to catch up.

When we looked at Zanata earlier, there were translation percentages beside each language: the goal is always the satisfaction of reaching 100%. If we keep adding new strings then the goalpost keeps moving, which is discouraging and unfair.

Of course there's also an "exception process" when needed, to ask for permission to merge a string change with an explanation or at least a heads-up, by sending an email to the openstack-i18n mailing list. Not to be abused :)

Role of the I18n liaison

...Liaise?! Haha. The role is defined briefly on the Cross-Projects Liaison wiki page. It's much more important toward the end of the cycle, when the codebase starts to stabilise, there are fewer changes and translators look at starting their work to be included in the release.

In general it's good to hang out on the #openstack-i18n IRC channel (very low traffic), attend the weekly meeting (it alternates times), be available to answer questions, and keep the PTL informed of the I18n status of the project. In the case of cycle-trailing projects (quite a new release model still), it's also important to be around to explain the deadlines.

A couple of examples having an active liaison helps with:

  • Toward the end or after the release, once translations into the stable branch have settled, the stable translations get copied into the master branch on Zanata. The strings should still be fairly similar at that point and it avoids translators having to re-do the work. It's a manual process, so you need to let the I18n PTL know when there are no longer changes to stable/*.
  • Last cycle, because the cycle-trailing status of tripleo-ui was not correctly documented, a Zanata upgrade was planned right after the main release - which for us ended up being right when the codebase had stabilised enough and several translators had planned to be most active. Would have been solved with better, earlier communication :)

Post-release

After the Ocata release, I sent a few screenshots of tripleo-ui to the i18n list so translators could see the result of their work. I don't know if anybody cared :-) But unlike Horizon, which has an informal test system available for translators to check their strings during the RC period, most of the people who volunteered translations had no idea what the UI looked like. It'd be cool if we could offer a test system with regular string updates next release - maybe just an undercloud on the new RDO cloud? Deployment success/failures strings wouldn't be verifiable but the rest would, while the system would be easier to maintain than a full dev TripleO environment - better than nothing. Perhaps an idea for the Queens cycle!

The I18n team has a priority board on the Zanata main page (only visible when logged in I think). I'm grateful to see TripleO UI in there! :) Realistically we'll never move past Low or perhaps Medium priority which is fair, as TripleO doesn't have the same kind of reach or visibility that Horizon or the installation guides do. I'm happy that we're included! The OpenStack I18n team is probably the most volunteer-driven team in OpenStack. Let's be kind, respect string freezes and translators' time! \o/

</braindump>

by jpichon at July 07, 2017 11:45 AM

April 14, 2017

Emilien Macchi

My Journey As An OpenStack PTL

This story explains why I started to stop working as a anarchistic-multi-tasking-schedule-driven and learnt how to become a good team leader.

How it started

March 2015, Puppet OpenStack project just moved under the Big Tent. What a success for our group!

One of the first step was to elect a Project Team Lead. Our group was pretty small (~10 active contributors) so we thought that the PTL would be just a facilitator for the group, and the liaison with other projects that interact with us.
I mean, easy, right?

At that time, I was clearly an unconsciously incompetent PTL. I thought I knew what I was doing to drive the project to success.

But situation evolved. I started to deal with things that I didn’t expect to deal with like making sure our team works together in a way that is efficient and consistent. I also realized nobody knew what
a PTL was really supposed to do (at least in our group), so I took care of more tasks, like release management, organizing Summit design sessions, promoting core reviewers, and welcoming newcomers.
That was the time where I realized I become a consciously incompetent PTL. I was doing things that nobody taught me before.

In fact, there is no book telling you how to lead an OpenStack project so I decided to jump in this black hole and hopefully I would make mistakes so I can learn something.

 

Set your own expectations

I made the mistake of engaging myself into a role where expectations were not cleared with the team. The PTL guide is not enough to clear expectations of what your team will wait from you. This is something you have to figure out with the folks you’re working with. You would be surprised by the diversity of expectations that project contributors have for their PTL.
Talk with your team and ask them what they want you to be and how they see you as a team lead.
I don’t think there is a single rule that works for all projects, because of the different cultures in OpenStack community.

 

Embrace changes

… and accept failures.
There is no project in OpenStack that didn’t had outstanding issues (technical and human).
The first step as a PTL is to acknowledge the problem and share it with your team. Most of the conflicts are self-resolved when everyone agrees that yes, there is a problem. It can be a code design issue or any other technical disagreement but also human complains, like the difficulty to start contributing or the lack of reward for very active contributors who aren’t core yet.
Once a problem is resolved: discuss with your team about how we can avoid the same situation in the future.
Make a retrospective if needed but talk and document the output.

I continuously encourage at welcoming all kind of changes in TripleO so we can adopt new technologies that will make our project better.

Keep in mind it has a cost. Some people will disagree but that’s fine: you might have to pick a rate of acceptance to consider that your team is ready to make this change.

 

Delegate

We are humans and have limits. We can’t be everywhere and do everything.
We have to accept that PTLs are not supposed to be online 24/7. They don’t always have the best ideas and don’t always take the right decisions.
This is fine. Your project will survive.

I learnt that when I started to be PTL of TripleO in 2016.
The TripleO team has become so big that I didn’t realize how many interruptions I would have every day.
So I decided to learn how to delegate.
We worked together and created TripleO Squads where each squad focus on a specific area of TripleO.
Each squad would be autonomous enough to propose their own core reviewers or do their own meetings when needed.
I wanted small teams working together, failing fast and making quick iterations so we could scale the project, accept and share the work load and increase the trust inside the TripleO team.

This is where I started to be a Consciously Competent PTL.

 

Where am I now

I have reached a point where I think that projects wouldn’t need a PTL to run fine if they really wanted.
Instead, I start to believe about some essential things that would actually help to get rid of this role:

  • As a team, define the vision of the project and document it. It will really help to know where we want to
    go and clear all expectations about the project.
  • Establish trust to each individual by default and welcome newcomers.
  • Encourage collective and distributed leadership.
  • Try, Do, Fail, Learn, Teach. and start again. Don’t stale.

This long journey helped me to learn many things in both technical and human areas. It has been awesome to work with such groups so far.
I would like to spend more time on technical work (aka coding) but also in teaching and mentoring new contributors in OpenStack.
Therefore, I won’t be PTL during the next cycle and my hope is to see new leaders in TripleO, who would come up with fresh ideas and help us to keep TripleO rocking.

 

Thanks for reading so far, and also thanks for your trust.

by Emilien at April 14, 2017 08:56 PM

March 02, 2017

Julie Pichon

OpenStack Pike PTG: TripleO, TripleO UI | Some highlights

For the second part of the PTG (vertical projects), I mainly stayed in the TripleO room, moving around a couple of times to attend cross-project sessions related to i18n.

Although I always wish I understood more/everything, in the end my areas of interest (and current understanding!) in TripleO are around the UI, installing and configuring it, the TripleO CLI, and the tripleo-common Mistral workflows. Therefore the couple of thoughts in this post are mainly relevant to these - if you're looking for a more exhaustive summary of the TripleO discussions and decisions made during the PTG, I recommend reading the PTL's excellent thread about this on the dev list, and the associated etherpads.

Random points of interest

  • Containers is the big topic and had multiple sessions dedicated to it, both single and cross-projects. Many other sessions ended up revisiting the subject as well, sometimes with "oh that'll be solved with containers" and sometimes with "hm good but that won't work with containers."
  • A couple of API-breaking changes may need to happen in Tripleo Heat Templates (e.g. for NFV, passing a role mapping vs a role name around). The recommendation is to get this in as early as possible (by the first milestone) and communicate it well for out of tree services.
  • When needing to test something new on the CI, look at the existing scenarios and prioritise adding/changing something there to test for what you need, as opposed to trying to create a brand new job.
  • Running Mistral workflows as part of or after the deployment came up several times and was even a topic during a cross-project Heat / Mistral / TripleO sessions. Things can get messy, switching between Heat, Mistral and Puppet. Where should these workflows live (THT, tripleo-common)? Service-specific workflows (pre/post-deploy) are definitely something people want and there's a need to standardise how to do that. Ceph's likely to be the first to try their hand at this.
  • One lively cross-project session with OpenStack Ansible and Kolla was about parameters in configuration files. Currently whenever a new feature is added to Nova or whatever service, Puppet and so on need to be updated manually. The proposal is to make a small change to oslo.config to enable it to give an output in machine-readable YAML which can then be consumed (currently the config generated is only human readable). This will help with validations, and it may help to only have to maintain a structure as opposed to a template.
  • Heat folks had a feedback session with us about the TripleO needs. They've been super helpful with e.g. helping to improve our memory usage over the last couple of cycles. My takeaway from this session was "beware/avoid using YAQL, especially in nested stacks." YAQL is badly documented and everyone ends up reading the source code and tests to figure out how to things. Bringing Jinja2 into Heat or some kind of way to have repeated patterns from resources (e.g. based on a file) also came up and was cautiously acknowledged.
  • Predictable IP assignment on the control plane is a big enough issue that some people are suggesting dropping Neutron in the undercloud over it. We'd lose so many other benefits though, that it seems unlikely to happen.
  • Cool work incoming allowing built-in network examples to Just Work, based on a sample configuration. Getting the networking stuff right is a huge pain point and I'm excited to hear this should be achievable within Pike.

Python 3

Python 3 is an OpenStack community goal for Pike.

Tripleo-common and python-tripleoclient both have voting unit tests jobs for Python 3.5, though I trust them only moderately for a number of reasons. For example many of the tests tend to focus on the happy path and I've seen and fixed Python 3 incompatible code in exceptions several times (no 'message' attribute seems easy to get caught into), despite the unit testing jobs being all green. Apparently there are coverage jobs we could enable for the client, to ensure the coverage ratio doesn't drop.

Python 3 for functional tests was also brought up. We don't have functional tests in any of our projects and it's not clear the value we would get out of it (mocking servers) compared to the unit testing and all the integration testing we already do. Increasing unit test coverage was deemed a more valuable goal to pursue for now.

There are other issues around functional/integration testing with Python 3 which will need to be resolved (though likely not during Pike). For example our integration jobs run on CentOS and use packages, which won't be Python 3 compatible yet (cue SCL and the need to respin dependencies). If we do add functional tests, perhaps it would be easier to have them run on a Fedora gate (although if I recall correctly gating on Fedora was investigated once upon a time at the OpenStack level, but caused too many issues due to churn and the lack of long-term releases).

Another issue with Python 3 support and functional testing is that the TripleO client depends on Mistral server (due to the Series Of Unfortunate Dependencies I also mentioned in the last post). That means Mistral itself would need to fully support Python 3 as well.

Python 2 isn't going anywhere just yet so we still have time to figure things out. The conclusions, as described in Emilien's email seem to be:

  • Improve the unit test coverage
  • Enable the coverage job in CI
  • Investigate functional testing for python-tripleoclient to start with, see if it makes sense and is feasible

Sample environment generator

Currently environment files in THT are written by hand and quite inconsistent. This is also important for the UI, which needs to display this information. For example currently the environment general description is in a comment at the top of the file (if it exists at all), which can't be accessed programmatically. Dependencies between environment files are not described either.

To make up for this, currently all that information lives in the capabilities map but it's external to the template themselves, needs to be updated manually and gets out of sync easily.

The sample environment generator to fix this has been out there for a year, and currently has two blockers. First, it needs a way to determine which parameters are private (that is, parameters that are expected to be passed in by another template and shouldn't be set by the user).

One way could be to use a naming convention, perhaps an underscore prefix similar to Python. Parameter groups cannot be used because of a historical limitation, there can only be one group (so you couldn't be both Private and Deprecated). Changing Heat with a new feature like Nested Groups or generic Parameter Tags could be an option. The advantage of the naming convention is that it doesn't require any change to Heat.

From the UI perspective, validating if an environment or template is redefining parameters already defined elsewhere also matters. Because if it has the same name, then it needs to be set up with the same value everywhere or it's uncertain what the final value will end up as.

I think the second issue was that the general environment description can only be a comment at the moment, there is no Heat parameter to do this. The Heat experts in the room seemed confident this is non-controversial as a feature and should be easy to get in.

Once the existing templates are updated to match the new format, the validation should be added to CI to make sure that any new patch with environments does include these parameters. Having "description" show up as an empty string when generating a new environment will make it more obvious that something can/should be there, while it's easy to forget about it with the current situation.

The agreement was:

  • Use underscores as a naming convention to start with
  • Start with a comment for the general description

Once we get the new Heat description attribute we can move things around. If parameter tags get accepted, likewise we can automate moving things around. Tags would also be useful to the UI, to determine what subset of relevant parameters to display to the user in smaller forms (easier to understand that one form with several dozens of fields showing up all at once). Tags, rather than parameter groups are required because of the aforementioned issue: it's already used for deprecation and a parameter can only have one group.

Trusts and federation

This was a cross-project session together with Heat, Keystone and Mistral. A "Trust" lets you delegate or impersonate a user with a subset of their rights. From my experience in TripleO, this is particularly useful with long running Heat stacks as a authentication token expires after a few hours which means you lose the ability to do anything in the middle of an operation.

Trusts have been working very well for Heat since 2013. Before that they had to encrypt the user password in order to ask for a new token when needed, which all agreed was pretty horrible and not anything people want to go back to. Unfortunately with the federation model and using external Identity Providers, this is no longer possible. Trusts break, but some kind of delegation is still needed for Heat.

There were a lot of tangents about security issues (obviously!), revocation, expiration, role syncing. From what I understand Keystone currently validates Trusts to make sure the user still has the requested permissions (that the relevant role hasn't been removed in the meantime). There's a desire to have access to the entire role list, because the APIs currently don't let us check which role is necessary to perform a specific action. Additionally, when Mistral workflows launched from Heat get in, Mistral will create its own Trusts and Heat can't know what that will do. In the end you always kinda end up needing to do everything. Karbor is running into this as well.

No solution was discovered during the session, but I think all sides were happy enough that the problem and use cases have been clearly laid out and are now understood.

TripleO UI

Some of the issues relevant to the UI were brought up in other sessions, like standardising the environment files. Other issues brought up were around plan management, for example why do we use the Mistral environment in addition to Swift? Historically it looks like it's because it was a nice drop-in replacement for the defunct TripleO API and offered a similar API. Although it won't have an API by default, the suggestion is to move to using a file to store the environment during Pike and have a consistent set of templates: this way all the information related to a deployment plan will live in the same place. This will help with exporting/importing plans, which itself will help with CLI/UI interoperability (for instance there are still some differences in how and when the Mistral environment is generated, depending on whether you deployed with the CLI or the UI).

A number of other issues were brought up around networking, custom roles, tracking deployment progress, and a great many other topics but I think the larger problems around plan management was the only expected to turn into a spec, now proposed for review.

I18n and release models

After the UI session I left the TripleO room to attend a cross-project session about i18n, horizon and release models. The release model point is particularly relevant because the TripleO UI is a newly internationalised project as of Ocata and the first to be cycle-trailing (TripleO releases a couple of weeks after the main OpenStack release).

I'm glad I was able to attend this session. For one it was really nice to collaborate directly with the i18n and release management team, and catch up with a couple of Horizon people. For second it turns out tripleo-ui was not properly documented as cycle-trailing (fixed now!) and that led to other issues.

Having different release models is a source of headaches for the i18n community, already stretched thin. It means string freezes happen at different times, stable branches are cut at different points, which creates a lot of tracking work for the i18n PTL to figure which project is ready and do the required manual work to update Zanata upon branching. One part of the solution is likely to figure out if we can script the manual parts of this workflow so that when the release patch (which creates the stable branch) is merged, the change is automatically reflected in Zanata.

For the non-technical aspects of the work (mainly keeping track of deadlines and string freezes) the decision was that if you want to be translated, then you need to respect the same deadlines than the cycle-with-milestones projects do on the main schedule, and if a project doesn't want to - if it doesn't respect the freezes or cut branches when expected, then they will be dropped from the i18n priority dashboard in Zanata. This was particularly relevant for Horizon plugins, as there's about a dozen of them now with various degrees of diligence when it comes to doing releases.

These expectations will be documented in a new governance tag, something like i18n:translated.

Obviously this would mean that cycle-trailing projects would likely never be able to get the tag. The work we depend on lands late and so we continue making changes up to two weeks after each of the documented deadlines. ianychoi, the i18n PTL seemed happy to take these projects under the i18n wing and do the manual work required, as long as there is an active i18n liaison for the project communicating with the i18n team to keep them informed about freezes and new branches. This seemed to work ok for us during Ocata so I'm hopeful we can keep that model. It's not entirely clear to me if this will also be documented/included in the governance tag so I look forward to reading the spec once it is proposed! :)

In the case of tripleo-ui we're not a priority project for translations nor looking to be, but we still rely on the i18n PTL to create Zanata branches and merge translations for us, and would like to continue with being able to translate stable branches as needed.

CI Q&A

The CI Q&A session on Friday morning was amazingly useful and unanimously agreed it should be moved to the documentation (not done yet). If you've ever scratched your head about something related to TripleO CI, have a look at the etherpad!

by jpichon at March 02, 2017 09:55 AM

January 31, 2017

Dougal Matthews

Interactive Mistral Workflows over Zaqar

It is possible to do some really nice automation with the Mistral Workflow engine. However, sometimes user input is required or desirable. I set about to write an interactive Mistral Workflow, one that could communicate with a user over Zaqar.

If you are not familiar with Mistral Workflows you may want to start here, here or here.

The Workflow

Okay, this is what I came up with.

---
version: '2.0'

interactive-workflow:

  input:
    - input_queue: "workflow-input"
    - output_queue: "workflow-output"

  tasks:

    request_user_input:
      action: zaqar.queue_post
      retry: count=5 delay=1
      input:
        queue_name: <% $.output_queue %>
        messages:
          body: "Send some input to '<% $.input_queue %>'"
      on-success: read_user_input

    read_user_input:
      pause-before: true
      action: zaqar.queue_pop
      input:
        queue_name: <% $.input_queue %>
      publish:
        user_input: <% task(read_user_input).result[0].body %>
      on-success: done

    done:
      action: std.echo output=<% $.user_input %>
      action: zaqar.queue_post
      retry: count=5 delay=1
      input:
        queue_name: <% $.output_queue %>
        messages:
          body: "You sent: '<% $.user_input %>'"

Breaking it down...

  1. The Workflow uses two queues one for input and one for output - it would be possible to use the same for both but this seemed simpler.

  2. On the first task, request_user_input, the Workflow sends a Zaqar message to the user requesting a message be sent to the input_queue.

  3. The read_user_input task pauses before it starts, see the pause-before: true. This means we can unpause the Workflow after we send a message. It would be possible to create a loop here that polls for messages, see below for more on this.

  4. After the input is provided, the Workflow must be un-paused manually. It then reads from the queue and sends the message back to the user via the output Zaqar queue.

See it in Action

We can demonstrate the Workflow with just the Mistral client. First you need to save it to a file and use the mistral workflow-create command to upload it.

First we trigger the Workflow execution.

$ mistral execution-create interactive-workflow
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| ID                | e8e2bfd5-3ae4-46db-9230-ada00a2c0c8c |
| Workflow ID       | bdd1253e-68f8-4cf3-9af0-0957e4a31631 |
| Workflow name     | interactive-workflow                 |
| Description       |                                      |
| Task Execution ID | <none>                               |
| State             | RUNNING                              |
| State info        | None                                 |
| Created at        | 2017-01-31 08:22:17                  |
| Updated at        | 2017-01-31 08:22:17                  |
+-------------------+--------------------------------------+

The Workflow will complete the first task and then move to the PAUSED state before read_user_input. This can be seen with the mistral execution-list command.

In this Workflow we know there will now be a message in Zaqar now. The Mistral action zaqar.queue_pop can be used to receive it...

$ mistral run-action zaqar.queue_pop '{"queue_name": "workflow-output"}'
{"result": [{"body": "Send some input to 'workflow-input'", "age": 4, "queue": {"_metadata": null, "client": null, "_name": "workflow-output"}, "href": null, "ttl": 3600, "_id": "589049397dcad341ecfb72cf"}]}

The JSON is a bit hard to read, but you can see the message body Send some input to 'workflow-input'.

Great. We can do that with another Mistral action...

$ mistral run-action zaqar.queue_post '{"queue_name": "workflow-input", "messages":{"body": {"testing": 123}}}'
{"result": {"resources": ["/v2/queues/workflow-input/messages/589049447dcad341ecfb72d0"]}}

After sending the message back to the requested Workflow we can unpause it. This can be done like this...

$ mistral execution-update -s RUNNING e8e2bfd5-3ae4-46db-9230-ada00a2c0c8c
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| ID                | e8e2bfd5-3ae4-46db-9230-ada00a2c0c8c |
| Workflow ID       | bdd1253e-68f8-4cf3-9af0-0957e4a31631 |
| Workflow name     | interactive-workflow                 |
| Description       |                                      |
| Task Execution ID | <none>                               |
| State             | RUNNING                              |
| State info        | None                                 |
| Created at        | 2017-01-31 08:22:17                  |
| Updated at        | 2017-01-31 08:22:38                  |
+-------------------+--------------------------------------+

Finally we can confirm it worked by getting a message back from the Workflow...

$ mistral run-action zaqar.queue_pop '{"queue_name": "workflow-output"}'
{"result": [{"body": "You sent: '{u'testing': 123}'", "age": 6, "queue": {"_metadata": null, "client": null, "_name": "workflow-output"}, "href": null, "ttl": 3600, "_id": "5890494f7dcad341ecfb72d1"}]}

You can see a new message is returned which shows the input we sent.

Caveats

As mentioned above, the main limitation here is that you need to manually unpause the Workflow. It would be nice if there was a way for the Zaqar message to automatically do this.

Polling for messages in the Workflow would be quite easy, with a retry loop and Mistral's continue-on. However, that would be quite resource intensive. If you wanted to do this, a Workflow task like this would probably do the trick.

  wait_for_message:
    action: zaqar.queue_pop
    input:
      queue_name: <% $.input_queue %>
    timeout: 14400
    retry:
      delay: 15
      count: <% $.timeout / 15 %>
      continue-on: <% len(task(wait_for_message).result) > 0 %>

The other limitation is that this Workflow now requires a specific interaction pattern that isn't obvious and documenting it might be a little tricky. However, I think the flexible execution it provides might be worthwhile in some cases.

by Dougal Matthews at January 31, 2017 07:40 AM

January 25, 2017

Dan Prince

Docker Puppet

Today TripleO leverages Puppet to help configure and manage the deployment of OpenStack services. As we move towards using Docker one of the big questions people have is how will we generate config files for those containers. We'd like to continue to make use of our mature configuration interfaces (Heat parameters, Hieradata overrides, Puppet modules) to allow our operators to seamlessly take the step towards a fully containerized deployment.

With the recently added composable service we've got everything we need. This is how we do it...

Install puppet into our base container image

Turns out the first thing you need of you want to generate config files with Puppet is well... puppet. TripleO uses containers from the Kolla project and by default they do not install Puppet. In the past TripleO uses an 'agent container' to manage the puppet installation requirements. This worked okay for the compute role (a very minimal set of services) but doesn't work as nicely for the broader set of OpenStack services because packages need to be pre-installed into the 'agent' container in order for config file generation to work correctly (puppet overlays the default config files in many cases). Installing packages for all of OpenStack and its requirements into the agent container isn't ideal.

Enter TripleO composable services (thanks Newton!). TripleO now supports composability and Kolla typically has individual containers for each service so it turns out the best way to generate config files for a specific service is to use the container for the service itself. We do this in two separate runs of a container: one to create config files, and the second one to launch the service (bind mounting/copying in the configs). It works really well.

But we still have the issue of how do we get puppet into all of our Kolla containers. We were happy to discover that Kolla supports a template-overrides mechanism (A jinja template) that allows you to customize how containers are built. This is how you can use that mechanism to add puppet into the Centos base image used for all the OpenStack docker containers generated by Kolla build scripts.

$ cat template-overrides.j2
{% extends parent_template %}
{% set base_centos_binary_packages_append = ['puppet'] %}

kolla-build --base centos --template-override template-overrides.j2

Control the Puppet catalog

A puppet manifest in TripleO can do a lot of things like installing packages, configuring files, starting a service, etc. For containers we only want to generate the config files. Furthermore we'd like to do this without having to change our puppet modules.

One mechanism we use is the --tags option for 'puppet apply'. This option allows you to specify which resources within a given puppet manifest (or catalog) should be executed. It works really nicely to allow you to select what you want out of a puppet catalog.

An example of this is listed below where we have a manifest to create a '/tmp/foo' file. When we run the manifest with the 'package' tag (telling it to only install packages) it does nothing at all.

$ cat test.pp 
file { '/tmp/foo':
  content => 'bar',
}
$ puppet apply --tags package test.pp
Notice: Compiled catalog for undercloud.localhost in environment production in 0.10 seconds
Notice: Applied catalog in 0.02 seconds
$ cat /tmp/foo
cat: /tmp/foo: No such file or directory

When --tags doesn't work

The --tags option of 'puppet apply' doesn't always give us the behavior we are after which is to generate only config files. Some puppet modules have custom resources with providers that can execute commands anyway. This might be a mysql query or an openstackclient command to create a keystone endpoint. Remember here that we are trying to re-use puppet modules from our baremetal configuration and these resources are expected to be in our manifests... we just don't want them to run at the time we are generating config files. So we need an alternative mechanism to suppress (noop out) these offending resources.

To do this we've started using a custom built noop_resource function that exists in puppet-tripleo. This function dynamically configures a default provider for the named resource. For mysql this ends up looking like this:

['Mysql_datadir', 'Mysql_user', 'Mysql_database', 'Mysql_grant', 'Mysql_plugin'].each |String $val| { noop_resource($val) }

Running a puppet manifest with this at the top will noop out any of the named resource types and they won't execute. Thus allowing puppet apply to complete and finish generating the config files within the specified manifest.

The good news is most of our services don't require the noop_resource in order to generate config files cleanly. But for those that do the interface allows us to effectively disable the resources we don't want to execute.

Putting it all together: docker-puppet.py

Bringing everything together in tripleo-heat-templates to create one container configuration interface that will allow us to configurably generate per-service config files. It looks like this:

  • manifest: the puppet manifest to use to generate config files (Thanks to composable services this is now per service!)
  • puppet_tags: the puppet tags to execute within this manifest
  • config_image: the docker image to use to generate config files. Generally we use the same image as the service itself.
  • config_volume: where to output the resulting config tree (includes /etc/ and some other directories).

And then we've created a custom tool to drive this per-service configuration called docker-puppet.py. The tool supports using the information above in a Json file format drive generation of the config files in a single action.

It ends up working like this:

Video demo: Docker Puppet

And thats it. Our config interfaces are intact. We generate the config files we need. And we get to carry on with our efforts to deploy with containers.

Links:

by Dan Prince at January 25, 2017 02:00 PM

January 12, 2017

Dougal Matthews

Calling Ansible from Mistral Workflows

I have spoken with a few people that were interested in calling Ansible from Mistral Workflows.

I finally got around to trying to make this happen. All that was needed was a very small and simple custom action that I put together, uploaded to github and also published to pypi.

Here is an example of a simple Mistral Workflow that makes use of these new actions.

---
version: 2.0

run_ansible_playbook:
  type: direct
  tasks:
    run_playbook:
      action: ansible-playbook
      input:
        playbook: path/to/playbook.yaml

Installing and getting started with this action is fairly simple. This is how I done it in my TripleO undercloud.

sudo pip install mistral-ansible-actions;
sudo mistral-db-manage populate;
sudo systemctrl restart openstack-mistral*;

There is one gotcha that might be confusing. The Mistral Workflow runs as the mistral user, this means that the user needs permission to access the Ansible playbook files.

After you have installed the custom actions, you can test it with the Mistral CLI. The first command should work without anything extra setup, the second requires you to create a playbook somewhere and provide access.

mistral run-action ansible '{"hosts": "localhost", "module": "setup"}'
mistral run-action ansible-playbook '{"playbook": "path/to/playbook.yaml"}'

The action supports a few other input parameters, they are all listed for now in the README in the git repo. This is a very young project, but I am curious to know if people find it useful and what other features it would need.

If you want to write custom actions, check out the Mistral documentation.

by Dougal Matthews at January 12, 2017 02:20 PM

September 08, 2016

Emilien Macchi

Scaling-up TripleO CI coverage with scenarios

TripleO CI up to eleven!

testing

 

When the project OpenStack started, it was “just” a set of services with the goal to spawn a VM. I remember you run everything on your laptop and test things really quickly.
The project has now grown, and thousands of features have been implemented, more backends / drivers are supported and new projects joined the party.
It makes testing very challenging because everything can’t be tested in CI environment.

TripleO aims to be an OpenStack installer, that takes care of services deployment. Our CI was only testing a set of services and a few plugins/drivers.
We had to find a way to test more services, more plugins, more drivers, in a efficient way, and without wasting CI resources.

So we thought that we could create some scenarios with a limited set of services, configured with a specific backend / plugin, and one CI job would deploy and test one scenario.
Example: scenario001 would be the Telemetry scenario, testing required services like Keystone, Nova, Glance, Neutron, but also Aodh, Ceilometer and Gnocchi.

Puppet OpenStack CI is using this model for a while and it works pretty well. We’re going to reproduce it into TripleO CI to have consistency.

 

How scenarios are run when patching TripleO?

We are using a feature in Zuul that allows to select which scenario we want to test, depending on the files we try to patch in a commit.
For example, if I submit a patch in TripleO Heat Templates and I try to modify “puppet/service/ceilometer-api.yaml” which is the composable service for Ceilometer-API, Zuul will trigger scenario001. See Zuul layout:

- name : ^gate-tripleo-ci-centos-7-scenario001-multinode.*$
  files:
    - ^puppet/services/aodh.*$
    - ^manifests/profile/base/aodh.*$
    - ^puppet/services/ceilometer.*$
    - ^manifests/profile/base/ceilometer.*$
    - ^puppet/services/gnocchi.*$
    - ^manifests/profile/base/gnocchi.*$

 

How can I bring my own service in a scenario?

The first step is to look at Puppet CI matrix and see if we already test the service in a scenario. If yes, please keep this number consistent with TripleO CI matrix. If not, you’ll need to pick a scenario, usually the less loaded to avoid performances issues.
Now you need to patch openstack-infra/project-config and specify the files that are deploying your service.
For example, if your service is “Zaqar”, you’ll add something like:

- name : ^gate-tripleo-ci-centos-7-scenario002-multinode.*$
  files:
    ...
    - ^puppet/services/zaqar.*$
    - ^manifests/profile/base/zaqar.*$

Everytime you’ll send a patch to TripleO Heat Templates in puppet/services/zaqar* files or in puppet-tripleo manifests/profile/base/zaqar*, scenario002 will be triggered.

Finally, you need to send a patch to openstack-infra/tripleo-ci:

  • Modify README.md to add the new service in the matrix.
  • Modify templates/scenario00X-multinode-pingtest.yaml and add a resource to test the service (in Zaqar, it could be a Zaqar Queue).
  • Modify test-environments/scenario00X-multinode.yaml and add the TripleO composable services and parameters to deploy the service.

Once you send the tripleo-ci patch, you can block it with -1 workflow to avoid accidental merge. Now go on openstack/tripleo-heat-templates and try to modify zaqar composable service by adding a comment or something you actually want to test. In the commit message, add “Depends-On: XXX” where XXX is the commit ID of the tripleo-ci patch. When you’ll send the patch, you’ll see that Zuul will trigger the appropriate scenario and your service will be tested.

 

 

What’s next?

  • Allow to extend testing outside pingtest. Some services, for example Ironic, can’t be tested with pingtest. Maybe run Tempest for a set of services would be something to investigate.
  • Zuul v3 is the big thing we’re all waiting to extend the granularity of our matrix. A current limitation current Zuul version (2.5) is that we can’t run scenarios in Puppet OpenStack modules CI because we don’t have a way combine both files rules that we saw before AND running the jobs for a specific project without files restrictions (ex: puppet-zaqar for scenario002). In other words, our CI will be better with Zuul v3 and we’ll improve our testing coverage by running the right scenarios on the right projects.
  • Extend the number of nodes. We currently use multinode jobs which deploy an undercloud and a subnode for overcloud (all-in-one). Some use-cases might require a third node (example with Ironic).

Any feedback on this blog post is highly welcome, please let me know if you want me to cover something more in details.

by Emilien at September 08, 2016 10:52 PM


Last updated: May 23, 2018 08:06 PM

TripleO: OpenStack Deployment   Documentation | Code Reviews | CI Status | CI Extended | Zuul Queue | Planet