Planet TripleO

Subscriptions

May 05, 2019

Marios Andreou

My summary of the OpenStack Stein Infrastructure Summit and Train PTG aka Denver III


My summary of the OpenStack Stein Infrastructure Summit and Train PTG aka Denver III

This was the first re-combined event with both summit and project teams gathering happening in the same week and the third consecutive year that OpenStack has descended on Denver. This is also the first Open Infrastructure summit - the foundation is expanding to allow other non openstack projects to use the Open Infrastructure foundation for housing their projects.

This is a brief summary with pointers of the sessions or rooms I attended in the order they happened. The full summit schedule is here and the PTG schedule is here.

There is a list of some of the etherpads used in various summit sessions in this wiki page thanks to T. Carrez who let me take a photo of his screen for the URL :).


Photos


Summit Day One

General impression is a slightly reduced attendance - though I should note the last summit I attended was Austin unless I’m mistaken, attending PTG but not summit. There were about ~2000 summit attendees according to one of the keynote speakers. Having said that however J. Bryce gave some interesting numbers in his keynote, hilighting that Stein is the 19th on time release for OpenStack, that OS is still the 3rd largest open source project in the world with 105,000 members across 180 countries and with 65000 merged changes in the last year.

It was interesting to hear from Deutche Telekom - especially that they are using and contributing to zuul upstream and that they rely on CI for their ever growing deployments. One of the numbers given is they are adding capacity at 400 servers per week.

Some other interesting points from the keynotes are

  • the increasing use of Ironic as a standalone service outside of OpenStack deployments, for managing the baremetal infrastructure (further hilighting the OpenInfra vs OpenStack only theme),
  • the increasing adoption of zuul for CI and that it is being adopted as a foundation project
  • ericsson brought a 5g network to summit, apparently the first 5G network (?) in the United States that was available at their booth and which uses OpenStack for their infrastructure. There was also a demonstration of the latency differences between 3/4/5G networks involving VR headsets.

Besides the keynotes I attended the OpenStack Ansible project update - there was a shout out for the TripleO team by Mohammed Nasser who higlighted the excellent cross team collaboration story by the TripleO tempest team and the Ansible project. Finally I attended a talk called “multicloud ci/cd with openstack and kubernetes” where the presented setup a simple ‘hello world’ application across a number of different geographic locations and showed how CI/CD meant he could make a simple change to the app and have it be tested then deployed across the different clouds that run that application.


Summit Day Two

I attended the Zuul project BOF (‘birds of a feather’) where it was interesting to hear about various folks that are running Zuul internally - some on older versions and wanting to upgrade.

I also caught the “Deployment Tools: defined common capabilities” where folks that work on or are knowledgable about the various OpenStack deployment tools including TripleO got together and used this etherpad to try and compile a list of ‘tags’ which the various tools can claim to implement. Examples include containerized (i.e. support for containerized deployments), version support, day 2 operations etc. The first step will be to socialize further distill and then socialize these ‘capabilities’ via the openstack-discuss mailing list.

The Airship project update was the next session I went to and was quite well attended. In general it was interesting to hear about the similarities in the concepts and approach taken in Airship compared to TripleO. Especially the concept of an ‘undercloud’ and that deployment is driven by yaml files which define the deployment and service configuration values. In Airship these yaml files are known as charts. The equivalence in TripleO is the tripleo heat templates repo which holds the deployment and service configuration for TripleO deployments.

Finally an interesting session on running zuul ontop of Kubernetes and using Helm Charts. The presenters said they would make the charts used in their deployment would be made available upstream “soon”. This then spawned a side conversation with weshay and sshnaidm about using kubernetes for the TripleO CI squad’s zuul based reproducer. Prompted by weshay we micro-hackfest explored the use of k3s - 5 less than k8s. Taking the docker-compose file we tried to convert it using the kompose tool. We got far enough running the k3s service but stumbled on the lack of support for dependencies in kompose. We could investigate writing some Helm charts to do this but it is still TBD if k3s is a direction we will adopt for the reproducer this cycle or if we will keep podman which replaced docker (sshnaidm++ was working on this).


Summit Day Three

On Wednesday the first session I attended was a comparison of TripleO, Kolla and Airship as a deployment tool. The common requirement was support for container based deployments. You can see event details here - apparently there should be a recording though this isn’t available at time of writing. Again it was interesting to hear about the similarities between The Airship and TripleO project approach to config management including the management node ‘undercloud’.

I then went to the very well attended and well lead (by slagle and emilienm) TripleO project update. Again there should be a recording available at some point via that link but it isn’t there at present time. Besides a general stein update, slagle introduced the concepts of scaling (thousand not hundred) and edge as one of the main use cases for these ‘thousand node deployments’. These concepts were then further discussed in subsequent TripleO sessions noted in following paragraphs.

The first of these TripleO sessions was the forum that was devoted to scale and lead by slagle - etherpad is here. There is a good list of the identified and discussed “bottleneck services” on the undercloud - including Heat, Ironic, Mistral&Zaqar, Neutron, Keyston and Ansible and the technical challenges around possibly removing these. This was further explored during the PTG.

Finally I was at the Open Infrastructure project update given by C. Boylan which hilighted the move to opendev.org and then the zuul project update by J. Blair.


Project Teams Gathering Day 1

I spent the PTG in the TripleO room Room etherpad and picture

The etherpad contains notes from the various discussions but I hilight some of the main themes here. As usual there was a brief retrospective on the stein cycle and some of that was captured in this etherpad. This was followed by an operator feedback session - one of the main issues raised was ‘needs more scale’.

Slagle lead the discussion on Edge which introduced and discussed the requirements for The Distributed Compute Node architecture, where we will have a central deployment for our controllers and compute nodes spread across a number of edge locations. There was participation here from both the Edge working group as well as the Ironic project.

Then fultonj and gfidente lead the storage squad update (notes on the main tripleo room etherpad. Among other things, there was discussion around ceph deployments ‘at the edge’ and the challenges, as well as the trigerring of tripleo jobs in ceph-ansible pull requests.

Finally emilien lead the Deployment squad topics (notes on tripleo room etherpad). In particular there was further discussion around making the undercloud ‘lighter’ by considering which services we might remove. For this cycle it is likely that we keep Mistral albeit changing the way we use it so that is only executes ansible, keeping Neutron and os-net-config as is, but making the network configuration be applied more directly by ansible. There was also discussion around the use of Nova and whether we can just use Ironic directly. There will be exploration around the use of metalsmith to provide the information about the nodes in our deployment that we lose by removing Nova.


Project Teams Gathering Day 2

Room etherpad and day two picture

Slagle lead the first session which revisited the “thousand node scale” topic introduced in the tripleo operator forum and captured in the tripleo-forum-scale etherpad.

The HA session was introduced by bandini and dciabrin (see main room etherpad for notes). Some of the topics raised here were the need for a new workflow for minor deployment configuration changes such as changing a service password, how we can improve the issue posed by a partial/temporary disconection of one of the cluster/controlplane nodes and whether pacemaker should be the default in upstream deployments (this is a topic revisited most summits…) and there was no strong push back on this however this is still to be proposed as a gerrit change so is still TBD.

The upgrades squad was represented by chem, jfrancoa and ccamacho. There are notes in this upgrades session etherpad. Amongst other topics there was discussion around ‘FFWD II’ which is Queens to Train (and which includes the upgrade from Centos7 to Centos8) as well as a discussion around a completely fresh approach to the upgrades workflow that uses a separate set of nodes for the controlplane. The idea is to replicate the existing controlplane onto 3 new nodes but deploying the target upgrade version. This could mean more than 3 nodes if you have distributed the controlplane services across a number of dedicated nodes like Networker for example. Once the ‘new’ controlplane is ready you would migrate the data from your old controloplane and at that point there would be controlplane outage. However since the target controlplane is ready to go, the hope is that the switch over from old to new controlplane will be a relatively painless process once the details are worked out in this cycle. For the rest of the nodes Compute etc the existing workflow would be used with the tripleoclient running the relevant ansible playbooks to deliver upgrades on per node basis.

The TripleO CI squad was represented by weshay, quiquell, sshnaidm and myself. The session was introduced by weshay and we had a good discussion lasting well over an hour about numerous topics (captured in the main triplo room etherpad) including the performance gains from moving to standalone jobs, plans around the standalone-upgrade in particular that for stable/stein this should be green and voting now taiga story in progress, the work around rhel7/8 on baremetal and the software factory jobs, using browbeat to monitor changes to the deployment time and possibly alert of even block if this is significant.

Finally weshay showed off the shiny new zuul-based reproducer (kudos quiquell and sshnaidm). In short you can find the reproducer-quickstart in any TripleO ci job and follow the related reproducer README to have your own zuul and gerrit running the given job using either libvirt or ovb (i.e. on rdocloud). This is the first time the new reproducer was introduced to the wider team and whilst we (TripleO squad) would probably still call this a beta, we think its ready enough for any early adopters that might find this interesting and useful enough to try it out and the CI squad would certainly appreciate any feedback.

May 05, 2019 03:00 PM

April 26, 2019

Cedric Jeanneret

In-flight Validations II

Here’s a quick demo for the in-flight validations, with some (edited) cast!

As previously stated, being able to call validations during the deploy/update itself provides a quick way to get early failures, avoiding head scratching and time loss.

This quick demo shows how it can be done easily, with a real validations. It uses the (hopefully) soon-to-be merged new “image-serve” validation and calls it just after the service is configured.

Doing so allows to ensure the configuration is actually working fine. In this demo, the httpd service is stopped before calling the validation, in order to show the early failure occuring even before we actually need that service.

Preparation

You need to build a tripleo-validations package with the new validation. You can do so using the tripleo-lab.

Once you have built and installed the package, you need to edit tripleo-heat-templates content, in our case:

sudo vim /usr/share/openstack-tripleo-heat-templates/deployment/image-serve/image-serve-baremetal-ansible.yaml

Go to the host_prep_tasks section, and, at the end of the Install, Configure and Run Apache to serve container images block, insert this:

          - name: DEMO - stop httpd
            service:
              name: httpd
              state: stopped
          - include_role:
              role: image-serve

Of course, the DEMO - stop httpd should not be added on the prod, since it will make the validation fail ;). This entry is only for the demo effect.

Save the edited file, and… Well. That’s it. You have just added a simple validation that will ensure the container image registry is working as expected!

And, after so many words, here’s the promised cast! asciicast

Do you validate this feature/content? ;)

April 26, 2019 10:00 AM

April 25, 2019

Cedric Jeanneret

In-flight Validations

We’ve seen in the previous post how the Validation Framework will help getting the whole TripleO deploy more stable. I’ve shown how running the validations before and after a deploy is easy - but that’s not all we can do.

Lately, I’ve also worked on the so-called “in-flight validations” - a way to run validations (being from the Framework or not) during the run.

This provides multiple advantages:

  • early failure
  • ensuring things are in place before going forward
  • provides clear outputs in case of something’s missing or crashed

This quick example shows how we can use the already existing health checks directly inside the deploy - doing so ensures we have a working service.

Is Horizon working?

Take the Horizon service. It’s an easy one, with only one template, one container, and a simple deploy path.

Opening deployment/horizon/horizon-container-puppet.yaml, you need to add a new entry in the output:

      deploy_steps_tasks:
        - name: ensure horizon is running
          when: step|int == 4
          shell: |
            podman exec -u root horizon /usr/share/openstack-tripleo-common/healthcheck/horizon

You can add it wherever you want, for instance right before the # BEGIN DOCKER SETTINGS comment.

Some explanations:

The deploy_steps_tasks is a “new” (not THAT new though) task list running on the host directly. Using the when condition, you can ensure it’s launched at the right step - for instance, since Horizon container is deployed at step 3, we want to ensure it’s running OK at step 4.

We can, of course, inject some other kind of validations - for instance, we can call the roles provided by the tripleo-validations package, the very same providing all the existing validations for the Validation Framework.

Also, instead of hard-coding the “podman” call, we should use the ContainerCli used in the tripleo-heat-templates. Of course, keeping clean code is as important as being able to test the deploy ;).

Make it crash!

The above example should succeed on every deploy. If you want to see how adding in-flight validation make it crash early, you can edit the command and set it to:

podman exec -u root horizon /usr/share/openstack-tripleo-common/healthcheck/glance-api

Doing so will make the whole deploy crash at step 4, with the following message:

TASK [ensure horizon is running] *************************************************************************************************************************************************************************************************************$
fatal: [undercloud]: FAILED! => {
  "changed": true,
  "cmd": "podman exec -u root horizon /usr/share/openstack-tripleo-common/healthcheck/glance-api",
  "delta": "0:00:00.324740",
  "end": "2019-04-25 09:56:40.100641",
  "msg": "non-zero return cod$",
  "rc": 1,
  "start": "2019-04-25 09:56:39.775901",
  "stderr": "curl: (7) Failed connect to 127.0.0.1:9292; Connection refused\nError: exit status 1",
  "stderr_lines": [
    "curl: (7) Failed connect to 127.0.0.1:9292; Connection refused",
    "Error$ exit status 1"
  ],
  "stdout": "\n000 127.0.0.1:9292 0.001 seconds",
  "stdout_lines": ["", "000 127.0.0.1:9292 0.001 seconds"]
}

Which is perfect: since Horizon isn’t working, we don’t need to wait until the end of the deploy in order to detect it. And we even get a nice error message :).

Using “real” validations from the Framework

In order to call a role from the Framework, you’ll need to use the include_role ansible module, and provide mandatory variables if any.

You have to include it in the deploy_steps_tasks entry, and… Well. That’s pretty all in fact :).

Final words

Deploying is a long process. Sometimes it fails, and it might be hard to find out the root cause of the failure. Messages aren’t always helpful, and we might have to search among a lot of different log files, with a lot of “acceptable failures” being ignored.

Using in-flight validations, being either simple health check calls or deeper checks/validations can help the operator as well as the developers to find and understand the issue. It also can prevent a huge time loss, especially for services that aren’t used during the deploy itself - we will see them as crashed only at the end of the 5 steps + post-deploy tasks. Meaning “a fair amount of time”.

Make life easier, make validations!

April 25, 2019 10:00 AM

April 24, 2019

Emilien Macchi

Day 2 operations in OpenStack TripleO (episode 1: scale-down)

Scale-up and scale-down are probably the most common operations done after the initial deployment. Let’s see how they are getting improved. This first episode is about scale-down precisely.

How it works now

Right now when an operator runs “openstack overcloud node delete” command, it’ll update the Heat stack to remove the resources associated to the node(s) that we delete. It can be problematic for some services like Nova, Neutron and the Subscription Manager, which needs to be teared down before the server is deleted.

Proposal

The idea is to create an interface where we can run Ansible tasks which will be executed during the scale-down, before the nodes get deleted by Heat. The Ansible tasks will live near to the deployment / upgrade / … tasks that are in TripleO Heat Templates. Here is an example with Red Hat Subscription Management:

It involves 3 changes:

What’s next?

  • Getting reviews & feedback on the 3 patches
  • Implement scale down tasks for Neutron, Nova and Ceph, waiting for this feature
  • Looking at scale-up tasks

Demo 1

This demo shows a node being unsubscribed when the overcloud is scaled down.

Demo 2

This demo show a compute node being removed from the Overcloud.

by Emilien at April 24, 2019 05:37 PM

April 11, 2019

Emilien Macchi

OpenStack Containerization with Podman – Part 2 (SystemD)

In the first post, we demonstrated that we can now use Podman to deploy a containerized OpenStack TripleO Undercloud. Let’s see how we can operate the containers with SystemD.

Podman, by design, doesn’t have any daemon running to manage the containers lifecycle; while Docker runs dockerd-current and docker-containerd-current which take care of a bunch of things, such as restarting the containers when they are in failure (and configured to do it, with restart policies).

In OpenStack TripleO, we still want our containers to restart when they are configured to, so we thought about managing the containers with SystemD. I recently wrote a blog post about how Podman can be controlled by SystemD, and we finally implemented it in TripleO.

The way it works, as of today, is that any container managed by Podman with a restart policy in Paunch container configuration, will be managed by SystemD.

Let’s take the example of Glance API. This snippet is the configuration of the container at step 4:

As you can see, the Glance API container was configured to always try to restart (so Docker would do so). With Podman, we re-use this flag and we create (+ enable) a SystemD unit file:

How it works underneath:

  • Paunch will run podman run –conmon-pidfile=/var/run/glance_api.pid (…) to start the container, during the deployment steps.
  • If there is a restart policy, Paunch will create a SystemD unit file.
  • The SystemD service is named by the container name, so if you were used to the old services name before the containerization, you’ll have to refresh your mind. By choice, we decided to go with the container name to avoid confusion with the podman ps output.
  • Once the containers are deployed, they need to be stopped / started / restarted by SystemD. If you run Podman CLI to do it, SystemD will take over (see in the demo).

Note about PIDs:

If you configure the service to start the container with “podman start -a” then systemd will monitor that process for the service. The problem is that this leaves podman start processes around which have a bunch of threads and is attached to the STDOUT/STDIN. Rather than leaving this start process around, we use a forking type in systemd and specify a conmon pidfile for monitoring the container. This removes 500+ threads from the system at the scale of TripleO containers. (Credits to Alex Schultz for the finding).

Note about PIDs:

If you configure the service to start the container with “podman start -a” then systemd will monitor that process for the service. The problem is that this leaves podman start processes around which have a bunch of threads and is attached to the STDOUT/STDIN. Rather than leaving this start process around, we use a forking type in systemd and specify a conmon pidfile for monitoring the container. This removes 500+ threads from the system at the scale of TripleO containers. (Credits to Alex Schultz for the finding).

Stay in touch for the next post in the series of deploying TripleO and Podman!

by Emilien at April 11, 2019 02:46 AM

February 05, 2019

Carlos Camacho

TripleO - Deployment configurations

This post is a summary of the deployments I usually test for deploying TripleO using quickstart.

The following steps need to run in the Hypervisor node in order to deploy both the Undercloud and the Overcloud.

You need to execute them one after the other, the idea of this recipe is to have something just for copying/pasting.

Once the last step ends you can/should be able to connect to the Undercloud VM to start operating your Overcloud deployment.

The usual steps are:

01 - Prepare the hypervisor node.

Now, let’s install some dependencies. Same Hypervisor node, same root user.

# In this dev. env. /var is only 50GB, so I will create
# a sym link to another location with more capacity.
# It will take easily more tan 50GB deploying a 3+1 overcloud
sudo mkdir -p /home/libvirt/
sudo ln -sf /home/libvirt/ /var/lib/libvirt

# Disable IPv6 lookups
# sudo bash -c "cat >> /etc/sysctl.conf" << EOL
# net.ipv6.conf.all.disable_ipv6 = 1
# net.ipv6.conf.default.disable_ipv6 = 1
# EOL
# sudo sysctl -p

# Enable IPv6 in kernel cmdline
# sed -i s/ipv6.disable=1/ipv6.disable=0/ /etc/default/grub
# grub2-mkconfig -o /boot/grub2/grub.cfg
# reboot

sudo yum groupinstall "Virtualization Host" -y
sudo yum install git lvm2 lvm2-devel -y
sudo yum install libvirt-python python-lxml libvirt -y

02 - Create the toor user (from the Hypervisor node, as root).

sudo useradd toor
echo "toor:toor" | sudo chpasswd
echo "toor ALL=(root) NOPASSWD:ALL" \
  | sudo tee /etc/sudoers.d/toor
sudo chmod 0440 /etc/sudoers.d/toor
sudo su - toor

cd
mkdir .ssh
ssh-keygen -t rsa -N "" -f .ssh/id_rsa
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
cat .ssh/id_rsa.pub | sudo tee -a /root/.ssh/authorized_keys
echo '127.0.0.1 127.0.0.2' | sudo tee -a /etc/hosts

export VIRTHOST=127.0.0.2
ssh root@$VIRTHOST uname -a

Now, follow as the toor user and prepare the Hypervisor node for the deployment.

03 - Clone repos and install deps.

git clone \
  https://github.com/openstack/tripleo-quickstart
chmod u+x ./tripleo-quickstart/quickstart.sh
bash ./tripleo-quickstart/quickstart.sh \
  --install-deps
sudo setenforce 0

Export some variables used in the deployment command.

04 - Export common variables.

export CONFIG=~/deploy-config.yaml
export VIRTHOST=127.0.0.2

Now we will create the configuration file used for the deployment, depending on the file you choose you will deploy different environments.

05 - Click on the environment description to expand the recipe.

OpenStack [Containerized & HA] - 1 Controller, 1 Compute

cat > $CONFIG << EOF
overcloud_nodes:
  - name: control_0
    flavor: control
    virtualbmc_port: 6230
  - name: compute_0
    flavor: compute
    virtualbmc_port: 6231
node_count: 2
containerized_overcloud: true
delete_docker_cache: true
enable_pacemaker: true
run_tempest: false
extra_args: >-
  --libvirt-type qemu
  --ntp-server pool.ntp.org
  -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml
EOF
OpenStack [Containerized & HA] - 3 Controllers, 1 Compute

cat > $CONFIG << EOF
overcloud_nodes:
  - name: control_0
    flavor: control
    virtualbmc_port: 6230
  - name: control_1
    flavor: control
    virtualbmc_port: 6231
  - name: control_2
    flavor: control
    virtualbmc_port: 6232
  - name: compute_1
    flavor: compute
    virtualbmc_port: 6233
node_count: 4
containerized_overcloud: true
delete_docker_cache: true
enable_pacemaker: true
run_tempest: false
extra_args: >-
  --libvirt-type qemu
  --ntp-server pool.ntp.org
  --control-scale 3
  --compute-scale 1
  -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml
EOF
OpenShift [Containerized] - 1 Controller, 1 Compute

cat > $CONFIG << EOF
# Original from https://github.com/openstack/tripleo-quickstart/blob/master/config/general_config/featureset033.yml
composable_scenario: scenario009-multinode.yaml
deployed_server: true

network_isolation: false
enable_pacemaker: false
overcloud_ipv6: false
containerized_undercloud: true
containerized_overcloud: true

# This enables TLS for the undercloud which will also make haproxy bind to the
# configured public-vip and admin-vip.
undercloud_generate_service_certificate: false
undercloud_enable_validations: false

# This enables the deployment of the overcloud with SSL.
ssl_overcloud: false

# Centos Virt-SIG repo for atomic package
add_repos:
  # NOTE(trown) The atomic package from centos-extras does not work for
  # us but its version is higher than the one from the virt-sig. Hence,
  # using priorities to ensure we get the virt-sig package.
  - type: package
    pkg_name: yum-plugin-priorities
  - type: generic
    reponame: quickstart-centos-paas
    filename: quickstart-centos-paas.repo
    baseurl: https://cbs.centos.org/repos/paas7-openshift-origin311-candidate/x86_64/os/
  - type: generic
    reponame: quickstart-centos-virt-container
    filename: quickstart-centos-virt-container.repo
    baseurl: https://cbs.centos.org/repos/virt7-container-common-candidate/x86_64/os/
    includepkgs:
      - atomic
    priority: 1

extra_args: ''

container_args: >-
  # If Pike or Queens
  #-e /usr/share/openstack-tripleo-heat-templates/environments/docker.yaml
  # If Ocata, Pike, Queens or Rocky
  #-e /home/stack/containers-default-parameters.yaml
  # If >= Stein
  -e /home/stack/containers-prepare-parameter.yaml

  -e /usr/share/openstack-tripleo-heat-templates/openshift.yaml
# NOTE(mandre) use container images mirrored on the dockerhub to take advantage
# of the proxy setup by openstack infra
docker_openshift_etcd_namespace: docker.io/
docker_openshift_cluster_monitoring_namespace: docker.io/tripleomaster
docker_openshift_cluster_monitoring_image: coreos-cluster-monitoring-operator
docker_openshift_configmap_reload_namespace: docker.io/tripleomaster
docker_openshift_configmap_reload_image: coreos-configmap-reload
docker_openshift_prometheus_operator_namespace: docker.io/tripleomaster
docker_openshift_prometheus_operator_image: coreos-prometheus-operator
docker_openshift_prometheus_config_reload_namespace: docker.io/tripleomaster
docker_openshift_prometheus_config_reload_image: coreos-prometheus-config-reloader
docker_openshift_kube_rbac_proxy_namespace: docker.io/tripleomaster
docker_openshift_kube_rbac_proxy_image: coreos-kube-rbac-proxy
docker_openshift_kube_state_metrics_namespace: docker.io/tripleomaster
docker_openshift_kube_state_metrics_image: coreos-kube-state-metrics

deploy_steps_ansible_workflow: true
config_download_args: >-
  -e /home/stack/config-download.yaml
  --disable-validations
  --verbose
composable_roles: true

overcloud_roles:
  - name: Controller
    CountDefault: 1
    tags:
      - primary
      - controller
    networks:
      - External
      - InternalApi
      - Storage
      - StorageMgmt
      - Tenant
  - name: Compute
    CountDefault: 0
    tags:
      - compute
    networks:
      - External
      - InternalApi
      - Storage
      - StorageMgmt
      - Tenant

tempest_config: false
test_ping: false
run_tempest: false
EOF


From the Hypervisor, as the toor user run the deployment command to deploy both your Undercloud and Overcloud.

06 - Deploy TripleO.

bash ./tripleo-quickstart/quickstart.sh \
      --clean          \
      --release master \
      --teardown all   \
      --tags all       \
      -e @$CONFIG      \
      $VIRTHOST

Updated 2019/02/05: Initial version.

Updated 2019/02/05: TODO: Test the OpenShift deployment.

Updated 2019/02/06: Added some clarifications about where the commands should run.

by Carlos Camacho at February 05, 2019 12:00 AM

January 16, 2019

Ben Nemec

OpenStack Virtual Baremetal Imported to OpenStack Infra

As foretold in a previous post, OVB has been imported to OpenStack Infra. The repo can now be found at https://git.openstack.org/cgit/openstack/openstack-virtual-baremetal. All future development will happen there so you should update any existing references you may have. In addition, changes will now be proposed via Gerrit instead of Github pull requests. \o/

For the moment, the core reviewer list is largely limited to the same people who had commit access to the Github repo. The TripleO PTL and one other have been added, but that will likely continue to change over time. The full list can be found here.

Because of the still-limited core list, not much about the approval process will change as a result of this import. I will continue to review and single-approve patches just like I did on Github. However, there are plans in the works to add CI gating to the repo (another benefit of the import) and once that has happened we will most likely open up the core reviewer list to a wider group.

Questions and comments via the usual channels.

by bnemec at January 16, 2019 06:18 PM

January 10, 2019

Ben Nemec

OpenStack Virtual Baremetal Master is Now 2.0-dev

As promised in my previous update on OVB, the 2.0-dev branch has been merged to master. If this breaks you, switch to the stable/1.0 branch, which is the same as master was prior to the 2.0-dev merge. Note that this does not mean OVB is officially 2.0 yet. I've found a couple more deprecated things that need to be removed before we declare 2.0. That will likely happen soon though.

by bnemec at January 10, 2019 08:30 PM

January 01, 2019

Giulio Fidente

Pushing Openstack and Ceph to the Edge at OpenStack summit Berlin 2018

We presented at the latest OpenStack (Open Infrastructure ?) summit, held in Berlin after te Rocky release, together with Sebastien Han and Sean Cohen, a session discussing how TripleO will support "Edge" deployments with storage at the edge; colocating Ceph with the OpenStack services in the edge zone yet keeping a small hardware footprint.

Thanks to the OpenStack Foundation for giving us the chance.

by Giulio Fidente at January 01, 2019 11:00 PM

September 26, 2018

Juan Antonio Osorio

Oslo Policy Deep Dive (part 2)

In the previous blog post I covered all you need to know to write your own policies and understand where they come from.

Here, We’ll go through some examples of how you would change the policy for a service, and how to take that new policy into use.

For this, I’ve created a repository to try things out and hopefully get you practicing this kind of thing. Of course, things will be slightly different in your environment, depending on how you’re running OpenStack. But you should get the basic idea.

We’ll use Barbican as a test service to do basic policy changes. The configuration that I’m providing is not meant for production, but it makes it easier to make changes and test things out. It’s a very minimal and simple barbican configuration that has the “unauthenticated” context enabled. This means that it doesn’t rely on keystone, and it will use whatever roles and project you provide in the REST API.

The default policy & how to change it

As mentioned in the previous blog post, nowadays, the default policy in “shipped” as part of the codebase. For some services, folks might still package the policy.json file. However, for our test service (Barbican), this is not the case.

You can easily overwrite the default policy by providing a policy.json file yourself. By default, oslo.policy will read the project’s base directory, and try to get the policy.json file from there. For barbican, this will be /etc/barbican/policy.json. For keystone, /etc/keystone/policy.json.

It is worth noting that this file is configurable by setting the policy_file setting in your service’s configuration, which is under the oslo_policy group of the configuration file.

If you have a running service, and you add or modify the policy.json file, the changes will immediately take effect. No need to restart nor reload your service.

The way this works is that olso.policy will attempt to read the file’s modification time (using os.path.getmtime(filename)), and cache that. If on a subsequent read, the modification time has changed, it’ll re-read the policy file and load the new rules.

It is also worth noting that when using policy.json, you don’t need to provide the whole policy, only the rules and aliases you’re planning to change.

If you need to get the policy of a specific service, it’s fairly straightforward given the tools that oslo.policy provides. All you need to do is the following:

oslopolicy-policy-generator --namespace $SERVICE_NAME

It is important to note that this will get you the effective policy that’s being executed. So, any changes that you make to the policy will be reflected in the output of this command.

If you want to get a sample file for the default policy with all the documentation for each option, you’ll do the following:

oslopolicy-sample-generator --namespace $SERVICE_NAME

So, in order to output Barbican’s effective policy, we’ll do the following:

oslopolicy-policy-generator --namespace barbican

Note that this outputs the policy in yaml format, and oslo.policy reads policy.json by default, so you’ll have to tranform such file into json to take it into use.

Setting up the testing environment

NOTE: If you just plan to read through this and not actually do the exercises, you may skip this section.

Lets clone the repository first:

git clone https://github.com/JAORMX/barbican-policy-tests.git
cd barbican-policy-tests

Now that we’re in the repo, you’ll notice several scripts there. To provide you with a consistent environemnt, I decided to rely on containeeeers!!! So, in order to continue, you’ll need to have Docker installed in your system.

(Maybe in the future I’ll update this to run with Podman and Buildah)

To build the minimal barbican container, execute the following:

./build-barbican-container-image.sh

You can verify that you have the barbican-minimal image with the latest tag by running docker images.

To test that the image was built correctly and you can run barbican, execute the following:

./0-run-barbican-simple.sh

You will notice barbican is running, and can see the name of its container with docker ps. You’ll notice its listening on the port 9311 on localhost.

Exercises

Preface

In the following exercises, we’ll do some changes to the Barbican policy. To do this, it’s worth understanding some things about the service and the policy itself.

Barbican is Secret Storage as a service. To simplify things, we’ll focus on the secret storage side of things.

There are the operations you can do on a secret:

  • secrets:get: List all secrets for the specific project.

  • secrets:post: Create a new secret.

  • secret:decrypt: Decrypt the specified secret.

  • secret:get: Get the metadata for the specified secret.

  • secret:put: Modify the specified secret.

  • secret:delete: Delete the specified secret.

Barbican also assumes 5 keystone roles, and bases its policy on the usage of these roles:

  • admin: Can do all operations on secrets (List, create, read, update, delete and decrypt)

  • creator: Can do all operations on secrets; This role is limited on other resources (such as secret containers), but we’ll ignore other resources in this exercises.

  • observer: In the context of secrets, observers can only list secrets and view a specific secret’s metadata.

  • audit: In the context of secrets, auditors can only view a specific secret’s metadata (but cannot do anything else).

  • service_admin: can’t do anything related to secrets. This role is meant for admin operations that change the Barbican service itself (such as quotas).

The Barbican default policy also comes with some useful aliases as defaults:

{
"admin": "role:admin",
"observer": "role:observer",
"creator": "role:creator",
"audit": "role:audit",
"service_admin": "role:key-manager:service-admin",
...
}

So this makes overwriting specific roles fairly straight forward.

Scenario #1

The Keystone default roles proposal proposes the usage of three roles, which should also work with all OpenStack services. These roles are: reader, member and admin.

Lets take this into use in Barbican, and replace our already existing observer role, for reader.

In this case, we can take the alias into use, by doing very minimal changes, we can replace the usage of observer entirely.

I have already defined this role in the aforementioned repo, lets take a look:

{
"observer": "role:reader"
}

And that’s it!

Now in the barbican policy, every instance of the “rule:observer” assertion will actually reference the “reader” role.

Testing scenario #1

There is already a script that runs barbican and takes this policy into use. Lets run it, and verify that we can effectively use the reader role instead of the observer role:

# Run the container
./1-run-barbican-with-reader-role.sh

# Create a sample secret
./create-secret.sh

# Attempt to list the available secrets with the reader role. This
# operation should succeed.
./list-secrets.sh reader

# Attempt to list the available secrets with the observer role. This
# operation should fail.
./list-secrets.sh observer

# Once you're done, you can stop the container

Scenario #2

Barbican’s audit role is meant to only read a very minimal set of things from the barbican’s entities. For some, this role might not be very useful, and it also doesn’t fit with Keystone’s set of default roles, so lets delete it!

As before, I have already defined a policy for this purpose:

{
"audit": "!"
}

As you can see, this replace the audit alias, and any attempt to use it will be rejected in the policy, effectively dissallowing the audit role use.

Testing scenario #2

There is already a script that runs barbican and takes this policy into use. Lets run it, and verify that we can effectively no longer use the audit role:

# run the container
./2-run-barbican-without-audit-policy.sh

# create a secret
./create-secret.sh

# Attempt to view the secret metadata with the creator role. This
# operation should succeed.
curl -H 'X-Project-Id: 1234' -H 'X-Roles: creator' \
    http://localhost:9311/v1/secrets/<some ID> | python -m json.tool

# Attempt to view the secret metadata with the audit role. This
# operation should fail.
curl -H 'X-Project-Id: 1234' -H 'X-Roles: audit' \
    http://localhost:9311/v1/secrets/<some ID> | python -m json.tool

# Once you're done, you can stop the container

Scenario #3

Now that we have tried a couple of things and it has gone fine. Lets put it all together and replicate the Keystone default role recommendation.

Here’s what we’ll do: As before, we’ll replace the observer role with reader. We’ll also replace the creator role with member, and finally, we’ll remove the audit role.

Here’s the policy file:

{
"observer": "role:reader",
"creator": "role:member",
"audit": "!"
}

This time, we’ll change the policy file in-place, as this is something you might need to do or automate in your own deployment.

Testing scenario #3

Here, we’ll run a minimal container that doesn’t take any specific policy into use. We’ll log into it, modify the policy.json file, and test out the results.

# Run the container
./0-run-barbican-simple.sh

# Open a bash session in the container
docker exec -ti (docker ps | grep barbican-minimal | awk '{print $1}') bash

# (In the container) Create the new policy file
cat <<EOF > /etc/barbican/policy.json
{
"observer": "role:reader",
"creator": "role:member",
"audit": "!"
}
EOF

# (In the container) Exit the container
exit

# Attempt to create a sample secret with the creator role. This operation
# should fail
./create-secret.sh creator

# Attempt to create a sample secret with the member role. This operation
# should succeed
./create-secret.sh member

# Attempt to list the available secrets with the observer role. This
# operation should fail.
./list-secrets.sh observer

# Attempt to list the available secrets with the reader role. This
# operation should succeed.
./list-secrets.sh reader

# Attempt to view the secret metadata with the audit role. This
# operation should fail.
curl -H 'X-Project-Id: 1234' -H 'X-Roles: audit' \
    http://localhost:9311/v1/secrets/<some ID> | python -m json.tool

# Attempt to view the secret metadata with the creator role. This
# operation should fail.
curl -H 'X-Project-Id: 1234' -H 'X-Roles: creator' \
    http://localhost:9311/v1/secrets/<some ID> | python -m json.tool

# Attempt to view the secret metadata with the member role. This
# operation should succeed.
curl -H 'X-Project-Id: 1234' -H 'X-Roles: member' \
    http://localhost:9311/v1/secrets/<some ID> | python -m json.tool

# Once you're done, you can stop the container

Scenario #4

For our last case, lets assume that for some reason you need a “super-admin” role that is able to read everybody’s secret metadata. There is no equivalent of this role in Barbican, so we’ll have to modify more things in order to get this to work.

To simplify things, we’ll only modify the GET operation for secret metadata.

Please note that this is only done for learning purposes, do not try this in production.

First thing we’ll need is to retrieve the policy line that actually gets executed for secret metadata. In Barbican, it’s the secret:get policy.

From whithin the container, or if you have the barbican package installed somewhere, you can do the following in order to get this exact policy:

oslopolicy-policy-generator --namespace barbican | grep "secret:get"

This will get us the following line:

"secret:get": "rule:secret_non_private_read or rule:secret_project_creator or rule:secret_project_admin or rule:secret_acl_read"

Note that in the barbican policy, we explicitly check for most users that the user is in the same project as the project that the secret belongs to. In this case, we’ll omit this in order to enable the “super-admin” to retrieve any secret’s metadata.

Here is the final policy.json file we’ll use:

{
"super_admin": "role:super-admin",
"secret:get": "rule:secret_non_private_read or rule:secret_project_creator or rule:secret_project_admin or rule:secret_acl_read or rule:super_admin"
}

Testing scenario #4

Here, we’ll run a minimal container that doesn’t take any specific policy into use. We’ll log into it, modify the policy.json file, and test out the results.

# Run the container
./0-run-barbican-simple.sh

# Open a bash session in the container
docker exec -ti (docker ps | grep barbican-minimal | awk '{print $1}') bash

# (In the container) Lets verify what the current policy is for "secret:get".
# This should output the default rule.
oslopolicy-policy-generator --namespace barbican | grep "secret:get"

# (In the container) Create the new policy file
cat <<EOF > /etc/barbican/policy.json
{
"super_admin": "role:super-admin",
"secret:get": "rule:secret_non_private_read or rule:secret_project_creator or rule:secret_project_admin or rule:secret_acl_read or rule:super_admin"
}
EOF

# (In the container) Lets verify what the current policy is for "secret:get".
# This should output the updated policy.
oslopolicy-policy-generator --namespace barbican | grep "secret:get"

# (In the container) Exit the container
exit

# Lets now create a couple of secrets with the creator role in the default
# project (1234).

# This will be secret #1
./create-secret.sh creator
# This will be secret #2
./create-secret.sh creator

# Lets now create a couple of secrets with the creator role in another project
# (1111).

# This will be secret #3
./create-secret.sh creator 1111

Using the creator role and project ‘1234’, you should only be able to retrieve secrets #1 and #2, but should get an error with secret #3.

# So... this should work
curl -H 'X-Project-Id: 1234' -H 'X-Roles: creator' \
    http://localhost:9311/v1/secrets/<secret #1> | python -m json.tool

# this should work
curl -H 'X-Project-Id: 1234' -H 'X-Roles: creator' \
    http://localhost:9311/v1/secrets/<secret #2> | python -m json.tool

# ...And this should fail
curl -H 'X-Project-Id: 1234' -H 'X-Roles: creator' \
    http://localhost:9311/v1/secrets/<secret #3> | python -m json.tool

Using the creator role and project ‘1111’, you should only be able to retrieve secret #3, but should get an error with secrets #1 and #2

# So... this should fail
curl -H 'X-Project-Id: 1111' -H 'X-Roles: creator' \
    http://localhost:9311/v1/secrets/<secret #1> | python -m json.tool

# this should fail
curl -H 'X-Project-Id: 1111' -H 'X-Roles: creator' \
    http://localhost:9311/v1/secrets/<secret #2> | python -m json.tool

# ...And this should work
curl -H 'X-Project-Id: 1111' -H 'X-Roles: creator' \
    http://localhost:9311/v1/secrets/<secret #3> | python -m json.tool

Finally, lets try our new super-admin role. As you will notice, you don’t even need to be part of the projects to get the metadata:

# So... this should work
curl -H 'X-Project-Id: POLICY' -H 'X-Roles: super-admin' \
    http://localhost:9311/v1/secrets/<secret #1> | python -m json.tool

# this should work
curl -H 'X-Project-Id: IS' -H 'X-Roles: super-admin' \
    http://localhost:9311/v1/secrets/<secret #2> | python -m json.tool

# ...And this should work too
curl -H 'X-Project-Id: COOL' -H 'X-Roles: super-admin' \
    http://localhost:9311/v1/secrets/<secret #3> | python -m json.tool

Conclusion

You have now learned how to do simple modifications to your service’s policy!

With great power comes great responsibility… And all those things… But seriously, be careful! You might end up with unintended results.

In the next blog post, we’ll cover implied roles and how you can use them in your policies!

September 26, 2018 02:01 PM

September 19, 2018

Juan Antonio Osorio

Adding custom databases and database users in TripleO

For folks integrating with TripleO, it has been quite painful to always need to modify puppet in order to integrate with the engine. This has been typically the case for things like adding a HAProxy andpoint and adding a database and a database user (and grants). As mentioned in a previous post, this is no longer the case for HAProxy endpoints, and this ability has been in TripleO for a a couple of releases now.

With the same logic in mind, I added this same functionality for mysql databases and database users. And this relecently landed in Stein. So, all you need to do is add something like this to your service template:

    service_config_settings:
      mysql:
        ...
        tripleo::my_service_name::mysql_user:
          password: 'myPassword'
          dbname: 'mydatabase'
          user: 'myuser'
          host: {get_param: [EndpointMap, MysqlInternal, host_nobrackets]}
          allowed_hosts:
            - '%'
            - "%{hiera('mysql_bind_host')}"

This will create:

  • A database called mydatabase
  • A user that can access that database, called myuser
  • The user myuser will have the password myPassword
  • And grants will be created so that user can connect from the hosts specificed in the host and allowed_hosts parameters.

Now you don’t need to modify puppet to add a new service to TripleO!

September 19, 2018 04:50 AM

September 17, 2018

Marios Andreou

My summary of the OpenStack Stein PTG in Denver


My summary of the OpenStack Stein PTG in Denver

After only 3 take off/landings I was very happy to participate in the Stein PTG in Denver. This is a brief summary with pointers of the sessions or rooms I attended in the order they happened (Stein PTG Schedule)


Upgrades CI with the stand-alone deployment

We had a productive impromptu round table (weshay++) in one of the empty rooms with the tripleo ci folks present (weshay, panda, sshnaidm, arxcruz, marios) the tripleo upgrades folks present (chem and holser) as well emeritus PTL mwahaha around the stand-alone and how we can use it for upgrades ci. We introduced the proposed spec and one of the main topics discussed was, ultimately is it worth it, to solve all of these subproblems to only end up with some approximation of the upgrade?

The consensus was yes since we can have 2 types of upgrades job: use the stand-alone to ci the actual tasks, i.e. upgrade_tasks and deployment_tasks for each service in the tripleo-heat-templates, and another job (the current job which will be adapted) to ci the upgrades workflow tripleoclient/mistral workflows etc. There was general consensus in this approach between the upgrades and ci representatives so that we could try and sell it to the wider team in the tripleo room on wednesday together.


Upgrades Special Interest Group

Room etherpad.

Monday afternoon was spent in the upgrades SIG room. There was first discussion of the placement api extraction and how this would have to be dealt with during the upgrade, with a solution sketched out around the db migrations required.

This lead into discussion around pre-upgrade checks that could deal with things like db migrations (or just check if something is missing and fail accordingly before the upgrade). As I was reminded during the lunchtime presentations pre upgrade checks is one of the Stein community goals (together with python-3). The idea is that each service would own a set of checks that should be performed before an upgrade is run and that they would be invoked via the openstack client (sthing along the lines of ‘openstack pre-upgrade-check nova’ - I believe there is already some implementation (from the nova team) but I don’t readily have details.

There was then a productive discussion about the purpose and direction of the upgrades SIG. One of the points raised was that the SIG should not be just about the fast forward upgrade even though that has been a main focus until now. The pre-upgrade checks are a good example of that and the SIG will try and continue to promote these with adoption by all the OpenStack services. On that note I proposed that whilst the services themselves will own the service specific pre-upgrade checks, it’s the deployment projects which will own the pre-upgrade infrastructure checks, such as healthy cluster/database or responding service endpoints.

There was ofcourse discussion around the fast forward upgrade with status updates from the deployment projects present (kolla-ansible, TripleO, charms, OSA). TripleO is the only project with an implemented workflow at present. Finally there was a discussion about whether we’re doing better in terms of operator experience for upgrades in general and how we can continue to improve (e.g. rolling upgrades was one of the discussed points here).


Edge room

Room etherpad Room etherpad2 Use cases Edge primer

I was only in attendance for the first part of this session which was about understanding the requirements (and hopefully continuing to find the common ground). The room started with a review of the various proposed use cases from dublin and any review of work since then. One of the main points raised by shardy is that in TripleO whilst we have a number of exploratory efforts ongoing (like split controlplane for example) it would be good to have a specific architecture to aim for and that is missing currently. It was agreed that the existing use cases will be extended to include the proposed architecture and that these can serve as a starting point for anyone looking to deploy with edge locations.

There are pointers to the rest of the edge sessions in the etherpad above.


TripleO room

Room etherpad Team picture

The order of sessions was slightly revised from that listed in the etherpad above because the East coast storms forced folks to change travel plans. The following order is to the best of my recollection ;)

TripleO and Edge cloud deployments

Session etherpad

There was first a summary from the Edge room from shardy and then tripleo specific discussion around the current work (split controlplane). There was some discussion around possibly using/repurposing “the multinode job” for multiple stacks to simulate the Edge locations in ci. There was also discussion around the networking aspects (though this will depend on the architecture which we don’t yet have fully targetted) with respect to the tripleo deployment networks (controlplane/internalapi etc) in an edge deployment. Finally there was consideration of the work needed in tripleo-common and the mistral workflows needed for the split controlplane deployment.

OS / Platform

(tracked on main tripleo etherpad linked above)

The main items discussed here were Python 3 support, removing instack-undercloud and “that upgrade” to Centos8 on Stein.

For Python3 the discussion included the fact that in TripleO we are bound by whatever python the deployed services support (as well as what the upstream distribution will be i.e. Centos 7/8 and which python ships where).

For the Centos8/Stein upgrade the upgrades folks chem and holser lead the discussion outlining how we will need a completely new workflow, which may be dictated in large by how the Centos8 is delivered. One of the approaches discussed here was to use a completely external/distinct upgrade workflow for the OS, versus the TripleO driven OpenStack upgrade itself. We got into more details about this during the Baremetal session see below).

TripleO CI

Session etherpad

One of the first items raised was the stand-alone deployment and its use in ci. The general proposal is that we should use a lot more of it! In particular to replace existing jobs (like scenarios 1/2) with a standalone deployement.

There was also discussion around the stand-alone for the upgrades ci as we agreed with the upgrades folks on Monday (spec). The idea of service vs workflow upgrades was presented/solidified here and I have just updated v8 of the spec accordingly to emphasise this point.

Other points discussed in the CI session were testing ovb in infra and how we could make jobs voting. The first move will be towards removing te-broker.

There was also some consideration of the involvement of the ci team with other squads and vice versa. There is a new column in our trello board called “requests from other DFG”.

A further point raised was the reproducer scripts and future directions including running and not only generating this in ci. As related side note it sounds like folks are using the reproducer and having some successes.

Ansible / Framework

(tracked on main tripleo etherpad linked above)

In this session an overview of the work towards splitting out the ansible tasks from the tripleo-heat-templates into re-usable roles was given by jillr and slagle. More info and pointers in the the main tripleo etherpad above.

Security

Session etherpad

Discussion around the workflow to change overcloud/service passwords (this is currently borked!). In particular problems around trying to CI this since the deploy takes too long to have deploy + stack update for the passwords and validation within the timeout. Possibly could be a 3rd party (but then non voting) job for now. There was also an overview of work towards using Castellan with TripleO, as well as discussion around selinux and locking down ssh.

UX / UI

Session etherpad

CLI/UI feature parity is a main goal for this cycle (and further probably it seems there is a lot to do) and plan management operations around this. Also good discussion around validations with Tengu joining remotely via Bluejeans to champion the effort of providing a nice way to run these via the tripleoclient.

Baremetal

Session etherpad

This session started with discussion around metalsmith vs nova on the undercloud and the required upgrade path to make this so. Also considered were the overcloud image customization and discussions around network automation (ansible with python-networking-ansible ml2 driver ).

However unexpectedly and the most interesting part of this session personally was an impromptu design session started by ipilcher (prompted by a question from phuongh who I believe was new to the room). The session was about the upgrade to Centos8 and three main approaches were explored, the “big bang” (everything off upgrade everything back), “some kind of rolling upgrade” and finally supporting either Centos8/Rocky or Centos7/Stein. The first and third were deemed unworkable but there was a very lively and well engaged group design session trying to navigate to a workable process for the ‘rolling upgrade’ aka split personality. Thanks to ipilcher (via bandini) the whiteboards looked like this.

September 17, 2018 03:00 PM

July 24, 2018

Carlos Camacho

Vote for the OpenStack Berlin Summit presentations!

I pushed some presentations for this year OpenStack summit in Berlin, the presentations are related to updates, upgrades, backups, failures and restores.

¡¡¡Please vote!!!

Happy TripleOing!

by Carlos Camacho at July 24, 2018 12:00 AM

June 04, 2018

Steven Hardy

TripleO Containerized deployments, debugging basics

Containerized deployments, debugging basics

Since the Pike release, TripleO has supported deployments with OpenStack services running in containers.  Currently we use docker to run images based on those maintained by the Kolla project.

We already have some tips and tricks for container deployment debugging in tripleo-docs, but below are some more notes on my typical debug workflows.

Config generation debugging overview

In the TripleO container architecture, we still use Puppet to generate configuration files and do some bootstrapping, but it is run (inside a container) via a script docker-puppet.py

The config generation usage happens at the start of the deployment (step 1) and the configuration files are generated for all services (regardless of which step they are started in).

The input file used is /var/lib/docker-puppet/docker-puppet.json, but you can also filter this (e.g via cut/paste or jq as shown below) to enable debugging for specific services - this is helpful when you need to iterate on debugging a config generation issue for just one service.

[root@overcloud-controller-0 docker-puppet]# jq '[.[]|select(.config_volume | contains("heat"))]' /var/lib/docker-puppet/docker-puppet.json | tee /tmp/heat_docker_puppet.json
{
"puppet_tags": "heat_config,file,concat,file_line",
"config_volume": "heat_api",
"step_config": "include ::tripleo::profile::base::heat::api\n",
"config_image": "192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo"
}
{
"puppet_tags": "heat_config,file,concat,file_line",
"config_volume": "heat_api_cfn",
"step_config": "include ::tripleo::profile::base::heat::api_cfn\n",
"config_image": "192.168.24.1:8787/tripleomaster/centos-binary-heat-api-cfn:current-tripleo"
}
{
"puppet_tags": "heat_config,file,concat,file_line",
"config_volume": "heat",
"step_config": "include ::tripleo::profile::base::heat::engine\n\ninclude ::tripleo::profile::base::database::mysql::client",
"config_image": "192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo"
}

 

Then we can run the config generation, if necessary changing the tags (or puppet modules, which are consumed from the host filesystem e.g /etc/puppet/modules) until the desired output is achieved:


[root@overcloud-controller-0 docker-puppet]# export NET_HOST='true'
[root@overcloud-controller-0 docker-puppet]# export DEBUG='true'
[root@overcloud-controller-0 docker-puppet]# export PROCESS_COUNT=1
[root@overcloud-controller-0 docker-puppet]# export CONFIG=/tmp/heat_docker_puppet.json
[root@overcloud-controller-0 docker-puppet]# python /var/lib/docker-puppet/docker-puppet.py2018-02-09 16:13:16,978 INFO: 102305 -- Running docker-puppet
2018-02-09 16:13:16,978 DEBUG: 102305 -- CONFIG: /tmp/heat_docker_puppet.json
2018-02-09 16:13:16,978 DEBUG: 102305 -- config_volume heat_api
2018-02-09 16:13:16,978 DEBUG: 102305 -- puppet_tags heat_config,file,concat,file_line
2018-02-09 16:13:16,978 DEBUG: 102305 -- manifest include ::tripleo::profile::base::heat::api
2018-02-09 16:13:16,978 DEBUG: 102305 -- config_image 192.168.24.1:8787/tripleomaster/centos-binary-heat-api:current-tripleo
...

 

When the config generation is completed, configuration files are written out to /var/lib/config-data/heat.

We then compare timestamps against the /var/lib/config-data/heat/heat.*origin_of_time file (touched for each service before we run the config-generating containers), so that only those files modified or created by puppet are copied to /var/lib/config-data/puppet-generated/heat.

Note that we also calculate a checksum for each service (see /var/lib/config-data/puppet-generated/*.md5sum), which means we can detect when the configuration changes - when this happens we need paunch to restart the containers, even though the image did not change.

This checksum is added to the /var/lib/tripleo-config/hashed-docker-container-startup-config-step_*.json files by docker-puppet.py, and these files are later used by paunch to decide if a container should be restarted (see below).

 

Runtime debugging, paunch 101

Paunch is a tool that orchestrates launching containers for each step, and performing any bootstrapping tasks not handled via docker-puppet.py.

It accepts a json format, which are the /var/lib/tripleo-config/docker-container-startup-config-step_*.json files that are created based on the enabled services (the content is directly derived from the service templates in tripleo-heat-templates)

These json files are then modified via docker-puppet.py (as mentioned above) to add a TRIPLEO_CONFIG_HASH value to the container environment - these modified files are written with a different name, see /var/lib/tripleo-config/hashed-docker-container-startup-config-step_*.json

Note this environment variable isn't used by the container directly, it is used as a salt to trigger restarting containers when the configuration files in the mounted config volumes have changed.

As in the docker-puppet case it's possible to filter the json file with jq and debug e.g mounted volumes or other configuration changes directly.

It's also possible to test configuration changes by manually modifying /var/lib/config-data/puppet-generated/ then either restarting the container via docker restart, or by modifying TRIPLEO_CONFIG_HASH then re-running paunch.

Note paunch will kill any containers tagged for a particular step e.g the --config-id tripleo_step4 --managed-by tripleo-Controller means all containers started during this step for any previous paunch apply will be killed if they are removed from your json during testing.  This is a feature which enables changes to the enabled services on update to your overcloud but it's worth bearing in mind when testing as described here.


[root@overcloud-controller-0]# cd /var/lib/tripleo-config/
[root@overcloud-controller-0 tripleo-config]# jq '{"heat_engine": .heat_engine}' hashed-docker-container-startup-config-step_4.json | tee /tmp/heat_startup_config.json
{
"heat_engine": {
"healthcheck": {
"test": "/openstack/healthcheck"
},
"image": "192.168.24.1:8787/tripleomaster/centos-binary-heat-engine:current-tripleo",
"environment": [
"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS",
"TRIPLEO_CONFIG_HASH=14617e6728f5f919b16c74f1e98d0264"
],
"volumes": [
"/etc/hosts:/etc/hosts:ro",
"/etc/localtime:/etc/localtime:ro",
"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro",
"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro",
"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro",
"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro",
"/dev/log:/dev/log",
"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro",
"/etc/puppet:/etc/puppet:ro",
"/var/log/containers/heat:/var/log/heat",
"/var/lib/kolla/config_files/heat_engine.json:/var/lib/kolla/config_files/config.json:ro",
"/var/lib/config-data/puppet-generated/heat/:/var/lib/kolla/config_files/src:ro"
],
"net": "host",
"privileged": false,
"restart": "always"
}
}
[root@overcloud-controller-0 tripleo-config]# paunch --debug apply --file /tmp/heat_startup_config.json --config-id tripleo_step4 --managed-by tripleo-Controller
stdout: dd60546daddd06753da445fd973e52411d0a9031c8758f4bebc6e094823a8b45

stderr:
[root@overcloud-controller-0 tripleo-config]# docker ps | grep heat
dd60546daddd 192.168.24.1:8787/tripleomaster/centos-binary-heat-engine:current-tripleo "kolla_start" 9 seconds ago Up 9 seconds (health: starting) heat_engine

 

 

Containerized services, logging

There are a couple of ways to access the container logs:

  • On the host filesystem, the container logs are persisted under /var/log/containers/<service>
  • docker logs <container id or name>
It is also often useful to use docker inspect <container id or name> to verify the container configuration, e.g the image in use and the mounted volumes etc.

 

Debugging containers directly

Sometimes logs are not enough to debug problems, and in this case you must interact with the container directly to diagnose the issue.

When a container is not restarting, you can attach a shell to the running container via docker exec:


[root@openstack-controller-0 ~]# docker exec -ti heat_engine /bin/bash
()[heat@openstack-controller-0 /]$ ps ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 /usr/local/bin/dumb-init /bin/bash /usr/local/bin/kolla_start
5 ? Ss 1:50 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
25 ? S 3:05 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
26 ? S 3:06 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
27 ? S 3:06 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
28 ? S 3:05 /usr/bin/python /usr/bin/heat-engine --config-file /usr/share/heat/heat-dist.conf --config-file /etc/heat/heat
2936 ? Ss 0:00 /bin/bash
2946 ? R+ 0:00 ps ax

 

That's all for today, for more information please refer to tripleo-docs,, or feel free to ask questions in #tripleo on Freenode!

by Unknown (noreply@blogger.com) at June 04, 2018 05:09 PM

June 01, 2018

James Slagle

TripleO and Ansible: config-download with Ansible Tower (part 3)

In my 2 previous posts, I’ve talked about TripleO’s config-download. If you
haven’t had a chance to read those yet, I suggest checking them out here and
here.

One of the nice things about config-download is that it integrates nicely with
other Ansible based tooling. In particular, Ansible Tower (or Ansible AWX) can
be used to drive applying the overcloud configuration. For users and
operators who are already familiar with Tower, this provides a nice way to
manage and report on the overcloud deployment status with TripleO and Tower.

At a high level, this integration is broken down into the following steps on
the TripleO undercloud:

  1. Create the Heat stack
  2. Run openstack overcloud config download to download the ansible
    playbooks from Heat
  3. Run tripleo-ansible-inventory to create the Ansible inventory file
  4. Since Ansible Tower uses git or other (SCM’s) to synchronize and manage
    Ansible project directories, we create a git repo from the config-download
    directory on the undercloud.

Switching over to Ansible Tower, we then:

  1. Create an organization
  2. Create SCM (git) credentials and machine credentials
  3. Create the Ansible project, pointing it at the git repository we created on
    the undercloud
  4. Create the inventory and inventory source, pointing it at the inventory file
    within the project directory we created with tripleo-ansible-inventory.
  5. Create a Job Template to run deploy_steps_playbook.yaml from the project
  6. Launch the Job Template

When the job finishes, we have a deployed and configured overcloud ready for
use by tenants.

Here’s a video of the demo showing the above steps:

https://slagle.fedorapeople.org/tripleo-config-download-ansible-tower.mp4

Of course, we wouldn’t want to manually go through those steps every time. We
can instead automate them with an ansible playbook, and then execute the
playbook from the undercloud, or a different management node. An example
playbook that automates all the steps above can be seen here:

https://github.com/slagle/tripleo-config-download-ansible-tower/blob/master/config-download.yaml

by James Slagle at June 01, 2018 09:51 PM

March 19, 2018

Giulio Fidente

Ceph integration topics at OpenStack PTG

I wanted to share a short summary of the discussions happened around the Ceph integration (in TripleO) at the OpenStack PTG.

ceph-{container,ansible} branching

Together with John Fulton and Guillaume Abrioux (and after PTG, Sebastien Han) we put some thought into how to make the Ceph container images and ceph-ansible releases fit better the OpenStack model; the container images and ceph-ansible are in fact loosely coupled (not all versions of the container images work with all versions of ceph-ansible) and we wanted to move from a "rolling release" into a "point release" approach, mainly to permit regular maintenance of the previous versions known to work with the previous OpenStack versions. The plan goes more or less as follows:

  • ceph-{container,ansible} should be released together with the regular ceph updates
  • ceph-container will start using tags and stable branches like ceph-ansible does

The changes for the ceph/daemon docker images are visible already: https://hub.docker.com/r/ceph/daemon/tags/

Multiple Ceph clusters

In the attempt to support better the "edge computing" use case, we discussed adding support for the deployment of multiple Ceph clusters in the overcloud.

Together with John Fulton and Steven Hardy (and after PTG, Gregory Charot) we realized this could be done using multiple stacks and by doing so, hopefully simplify managament of the "cells" and avoid potential issues due to orchestration of large clusters.

Much of this will build on Shardy's blueprint to split the control plane, see spec at: https://review.openstack.org/#/c/523459/

The multiple Ceph clusters specifics will be tracked via another blueprint: https://blueprints.launchpad.net/tripleo/+spec/deploy-multiple-ceph-clusters

ceph-ansible testing with TripleO

We had a very good chat with John Fulton, Guillaume Abrioux, Wesley Hayutin and Javier Pena on how to get tested new pull requests for ceph-ansible with TripleO; basically trigger an existing TripleO scenario on changes proposed to ceph-ansible.

Given ceph-ansible is hosted on github, Wesley's and Javier suggested this should be possible with Zuul v3 and volunteered to help; some of the complications are about building an RPM from uncommitted changes for testing.

Move ceph-ansible triggering from workflow_tasks to external_deploy_tasks

This is a requirement for the Rocky release; we want to migrate away from using workflow_tasks and use external_deploy_tasks instead, to integrate into the "config-download" mechanism.

This work is tracked via a blueprint and we have a WIP submission on review: https://blueprints.launchpad.net/tripleo/+spec/ceph-ansible-external-deploy-tasks

We're also working with Sofer Athlan-Guyot on the enablement of Ceph in the upgrade CI jobs and with Tom Barron on scenario004 to deploy Manila with Ganesha (and CephFS) instead of the CephFS native backend.

Hopefully I didn't forget much; to stay updated on the progress join #tripleo on freenode or check our integration squad status at: https://etherpad.openstack.org/p/tripleo-integration-squad-status

by Giulio Fidente at March 19, 2018 02:32 AM

February 09, 2018

Steven Hardy

Debugging TripleO revisited - Heat, Ansible & Puppet

Some time ago I wrote a post about debugging TripleO heat templates, which contained some details of possible debug workflows when TripleO deployments fail.

In recent releases (since the Pike release) we've made some major changes to the TripleO architecture - we makes more use of Ansible "under the hood", and we now support deploying containerized environments.  I described some of these architectural changes in a talk at the recent OpenStack Summit in Sydney.

In this post I'd like to provide a refreshed tutorial on typical debug workflow, primarily focussing on the configuration phase of a typical TripleO deployment, and with particular focus on interfaces which have changed or are new since my original debugging post.

We'll start by looking at the deploy workflow as a whole, some heat interfaces for diagnosing the nature of the failure, then we'll at how to debug directly via Ansible and Puppet.  In a future post I'll also cover the basics of debugging containerized deployments.

The TripleO deploy workflow, overview

A typical TripleO deployment consists of several discrete phases, which are run in order:

Provisioning of the nodes


  1. A "plan" is created (heat templates and other files are uploaded to Swift running on the undercloud
  2. Some validation checks are performed by Mistral/Heat then a Heat stack create is started (by Mistral on the undercloud)
  3. Heat creates some groups of nodes (one group per TripleO role e.g "Controller"), which results in API calls to Nova
  4. Nova makes scheduling/placement decisions based on your flavors (which can be different per role), and calls Ironic to provision the baremetal nodes
  5. The nodes are provisioned by Ironic

This first phase is the provisioning workflow, after that is complete and the nodes are reported ACTIVE by nova (e.g the nodes are provisioned with an OS and running).

Host preparation

The next step is to configure the nodes in preparation for starting the services, which again has a specific workflow (some optional steps are omitted for clarity):

  1. The node networking is configured, via the os-net-config tool
  2. We write hieradata for puppet to the node filesystem (under /etc/puppet/hieradata/*)
  3. We write some data files to the node filesystem (a puppet manifest for baremetal configuration, and some json files that are used for container configuration)

Service deployment, step-by-step configuration

The final step is to deploy the services, either on the baremetal host or in containers, this consists of several tasks run in a specific order:

  1. We run puppet on the baremetal host (even in the containerized architecture this is still needed, e.g to configure the docker daemon and a few other things)
  2. We run "docker-puppet.py" to generate the configuration files for each enabled service (this only happens once, on step 1, for all services)
  3. We start any containers enabled for this step via the "paunch" tool, which translates some json files into running docker containers, and optionally does some bootstrapping tasks.
  4. We run docker-puppet.py again (with a different configuration, only on one node the "bootstrap host"), this does some bootstrap tasks that are performed via puppet, such as creating keystone users and endpoints after starting the service.

Note that these steps are performed repeatedly with an incrementing step value (e.g step 1, 2, 3, 4, and 5), with the exception of the "docker-puppet.py" config generation which we only need to do once (we just generate the configs for all services regardless of which step they get started in).

Below is a diagram which illustrates this step-by-step deployment workflow:
TripleO Service configuration workflow

The most common deployment failures occur during this service configuration phase of deployment, so the remainder of this post will primarily focus on debugging failures of the deployment steps.

 

Debugging first steps - what failed?

Heat Stack create failed.
 

Ok something failed during your TripleO deployment, it happens to all of us sometimes!  The next step is to understand the root-cause.

My starting point after this is always to run:

openstack stack failures list --long <stackname>

(undercloud) [stack@undercloud ~]$ openstack stack failures list --long overcloud
overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0:
resource_type: OS::Heat::StructuredDeployment
physical_resource_id: 421c7860-dd7d-47bd-9e12-de0008a4c106
status: CREATE_FAILED
status_reason: |
Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
deploy_stdout: |

PLAY [localhost] ***************************************************************

...

TASK [Run puppet host configuration for step 1] ********************************
ok: [localhost]

TASK [debug] *******************************************************************
fatal: [localhost]: FAILED! => {
"changed": false,
"failed_when_result": true,
"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
"Debug: Runtime environment: puppet_version=4.8.2, ruby_version=2.0.0, run_mode=user, default_encoding=UTF-8",
"Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain"
]
}
to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/8dd0b23a-acb8-4e11-aef7-12ea1d4cf038_playbook.retry

PLAY RECAP *********************************************************************
localhost : ok=18 changed=12 unreachable=0 failed=1
 

We can tell several things from the output (which has been edited above for brevity), firstly the name of the failing resource

overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.0
  • The error was on one of the Controllers (ControllerDeployment)
  • The deployment failed during the per-step service configuration phase (the AllNodesDeploySteps part tells us this)
  • The failure was during the first step (Step1.0)
Then we see more clues in the deploy_stdout, ansible failed running the task which runs puppet on the host, it looks like a problem with the puppet code.

With a little more digging we can see which node exactly this failure relates to, e.g we copy the SoftwareDeployment ID from the output above, then run:

(undercloud) [stack@undercloud ~]$ openstack software deployment show 421c7860-dd7d-47bd-9e12-de0008a4c106 --format value --column server_id
29b3c254-5270-42ae-8150-9fc3f67d3d89
(undercloud) [stack@undercloud ~]$ openstack server list | grep 29b3c254-5270-42ae-8150-9fc3f67d3d89
| 29b3c254-5270-42ae-8150-9fc3f67d3d89 | overcloud-controller-0 | ACTIVE | ctlplane=192.168.24.6 | overcloud-full | oooq_control |
 

Ok so puppet failed while running via ansible on overcloud-controller-0.

 

Debugging via Ansible directly

Having identified that the problem was during the ansible-driven configuration phase, one option is to re-run the same configuration directly via ansible-ansible playbook, so you can either increase verbosity or potentially modify the tasks to debug the problem.

Since the Queens release, this is actually very easy, using a combination of the new "openstack overcloud config download" command and the tripleo dynamic ansible inventory.

(undercloud) [stack@undercloud ~]$ openstack overcloud config download
The TripleO configuration has been successfully generated into: /home/stack/tripleo-VOVet0-config
(undercloud) [stack@undercloud ~]$ cd /home/stack/tripleo-VOVet0-config
(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ls
common_deploy_steps_tasks.yaml external_post_deploy_steps_tasks.yaml templates
Compute global_vars.yaml update_steps_playbook.yaml
Controller group_vars update_steps_tasks.yaml
deploy_steps_playbook.yaml post_upgrade_steps_playbook.yaml upgrade_steps_playbook.yaml
external_deploy_steps_tasks.yaml post_upgrade_steps_tasks.yaml upgrade_steps_tasks.yaml
 

Here we can see there is a "deploy_steps_playbook.yaml", which is the entry point to run the ansible service configuration steps.  This runs all the common deployment tasks (as outlined above) as well as any service specific tasks (these end up in task include files in the per-role directories, e.g Controller and Compute in this example).

We can run the playbook again on all nodes with the tripleo-ansible-inventory from tripleo-validations, which is installed by default on the undercloud:

(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ansible-playbook -i /usr/bin/tripleo-ansible-inventory deploy_steps_playbook.yaml --limit overcloud-controller-0
...
TASK [Run puppet host configuration for step 1] ********************************************************************
ok: [192.168.24.6]

TASK [debug] *******************************************************************************************************
fatal: [192.168.24.6]: FAILED! => {
"changed": false,
"failed_when_result": true,
"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
"Notice: hiera(): Cannot load backend module_data: cannot load such file -- hiera/backend/module_data_backend",
"exception: connect failed",
"Warning: Undefined variable '::deploy_config_name'; ",
" (file & line not available)",
"Warning: Undefined variable 'deploy_config_name'; ",
"Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile
/base/docker.pp:181:5 on node overcloud-controller-0.localdomain"

]
}

NO MORE HOSTS LEFT *************************************************************************************************
to retry, use: --limit @/home/stack/tripleo-VOVet0-config/deploy_steps_playbook.retry

PLAY RECAP *********************************************************************************************************
192.168.24.6 : ok=56 changed=2 unreachable=0 failed=1
 

Here we can see the same error is reproduced directly via ansible, and we made use of the --limit option to only run tasks on the overcloud-controller-0 node.  We could also have added --tags to limit the tasks further (see tripleo-heat-templates for which tags are supported).

If the error were ansible related, this would be a good way to debug and test any potential fixes to the ansible tasks, and in the upcoming Rocky release there are plans to switch to this model of deployment by default.

 

Debugging via Puppet directly

Since this error seems to be puppet related, the next step is to reproduce it on the host (obviously the steps above often yield enough information to identify the puppet error, but this assumes you need to do more detailed debugging directly via puppet):

Firstly we log on to the node, and look at the files in the /var/lib/tripleo-config directory.

(undercloud) [stack@undercloud tripleo-VOVet0-config]$ ssh heat-admin@192.168.24.6
Warning: Permanently added '192.168.24.6' (ECDSA) to the list of known hosts.
Last login: Fri Feb 9 14:30:02 2018 from gateway
[heat-admin@overcloud-controller-0 ~]$ cd /var/lib/tripleo-config/
[heat-admin@overcloud-controller-0 tripleo-config]$ ls
docker-container-startup-config-step_1.json docker-container-startup-config-step_4.json puppet_step_config.pp
docker-container-startup-config-step_2.json docker-container-startup-config-step_5.json
docker-container-startup-config-step_3.json docker-container-startup-config-step_6.json
 

The puppet_step_config.pp file is the manifest applied by ansible on the baremetal host

We can debug any puppet host configuration by running puppet apply manually. Note that hiera is used to control the step value, this will be at the same value as the failing step, but it can also be useful sometimes to manually modify this for development testing of different steps for a particular service.

[root@overcloud-controller-0 tripleo-config]# hiera -c /etc/puppet/hiera.yaml step
1
[root@overcloud-controller-0 tripleo-config]# cat /etc/puppet/hieradata/config_step.json
{"step": 1}[root@overcloud-controller-0 tripleo-config]# puppet apply --debug puppet_step_config.pp
...
Error: Evaluation Error: Error while evaluating a Resource Statement, Unknown resource type: 'ugeas' at /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp:181:5 on node overcloud-controller-0.localdomain
 

Here we can see the problem is a typo in the /etc/puppet/modules/tripleo/manifests/profile/base/docker.pp file at line 181, I look at the file, fix the problem (ugeas should be augeas) then re-run puppet apply to confirm the fix.

Note that with puppet module fixes you will need to get the fix either into an updated overcloud image, or update the module via deploy artifacts for testing local forks of the modules.

That's all for today, but in a future post, I will cover the new container architecture, and share some debugging approaches I have found helpful when deployment failures are container related.

by Unknown (noreply@blogger.com) at February 09, 2018 05:04 PM

December 11, 2017

James Slagle

TripleO and Ansible deployment (Part 1)

In the Queens release of TripleO, you’ll be able to use Ansible to apply the
software deployment and configuration of an Overcloud.

Before jumping into some of the technical details, I wanted to cover some
background about how the Ansible integration works along side some of the
existing tools in TripleO.

The Ansible integration goes as far as offering an alternative to the
communication between the existing Heat agent (os-collect-config) and the Heat
API. This alternative is opt-in for Queens, but we are exploring making it the
default behavior for future releases.

The default behavior for Queens (and all prior releases) will still use the
model where each Overcloud node has a daemon agent called os-collect-config
that periodically polls the Heat API for deployment and configuration data.
When Heat provides updated data, the agent applies the deployments, making
changes to the local node such as configuration, service management,
pulling/starting containers, etc.

The Ansible alternative instead uses a “control” node (the Undercloud) running
ansible-playbook with a local inventory file and pushes out all the changes to
each Overcloud node via ssh in the typical Ansible fashion.

Heat is still the primary API, while the parameter and environment files that
get passed to Heat to create an Overcloud stack remain the same regardless of
which method is used.

Heat is also still fully responsible for creating and orchestrating all
OpenStack resources in the services running on the Undercloud (Nova servers,
Neutron networks, etc).

This sequence diagram will hopefully provide a clear picture:
https://slagle.fedorapeople.org/tripleo-ansible-arch.png

Replacing the application and transport layer of the deployment with Ansible
allows us to take advantage of features in Ansible that will hopefully make
deploying and troubleshooting TripleO easier:

  • Running only specific deployments
  • Including/excluding specific nodes or roles from an update
  • More real time sequential output of the deployment
  • More robust error reporting
  • Faster iteration and reproduction of deployments

Using Ansible instead of the Heat agent is easy. Just include 2 extra cli args
in the deployment command:

-e /path/to/templates/environments/config-download-environment.yaml \
--config-download

Once Heat is done creating the stack (will be much faster than usual), a
separate Mistral workflow will be triggered that runs ansible-playbook to
finish the deployment. The output from ansible-playbook will be streamed to
stdout so you can follow along with the progress.

Here’s a demo showing what a stack update looks like:

(I suggest making the demo fully screen or watch it here: https://slagle.fedorapeople.org/tripleo-ansible-deployment-1.mp4)

Note that we don’t get color output from ansible-playbook since we are
consuming the stdout from a Zaqar queue. However, in my next post I will go
into how to execute ansible-playbook manually, and detail all of the related
files (inventory, playbooks, etc) that are available to interact with manually.

If you want to read ahead, have a look at the official documentation:
https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/ansible_config_download.html

 

by James Slagle at December 11, 2017 03:19 PM

July 07, 2017

Julie Pichon

TripleO Deep Dive: Internationalisation in the UI

Yesterday, as part of the TripleO Deep Dives series I gave a short introduction to internationalisation in TripleO UI: the technical aspects of it, as well as a quick overview of how we work with the I18n team.

You can catch the recording on BlueJeans or YouTube, and below's a transcript.

~

Life and Journey of a String

Internationalisation was added to the UI during Ocata - just a release ago. Florian implemented most of it and did the lion's share of the work, as can be seen on the blueprint if you're curious about the nitty-gritty details.

Addition to the codebase

Here's an example patch from during the transition. On the left you can see how things were hard-coded, and on the right you can see the new defineMessages() interface we now use. Obviously new patches should directly look like on the right hand-side nowadays.

The defineMessages() dictionary requires a unique id and default English string for every message. Optionally, you can also provide a description if you think there could be confusion or to clarify the meaning. The description will be shown in Zanata to the translators - remember they see no other context, only the string itself.

For example, a string might sound active like if it were related to an action/button but actually be a descriptive help string. Or some expressions are known to be confusing in English - "provide a node" has been the source of multiple discussions on list and live so might as well pre-empt questions and offer additional context to help the translators decide on an appropriate translation.

Extraction & conversion

Now we know how to add an internationalised string to the codebase - how do these get extracted into a file that will be uploaded to Zanata?

All of the following steps are described in the translation documentation in the tripleo-ui repository. Assuming you've already run the installation steps (basically, npm install):

$ npm run build

This does a lot more than just extracting strings - it prepares the code for being deployed in production. Once this ends you'll be able to find your newly extracted messages under the i18n directory:

$ ls i18n/extracted-messages/src/js/components

You can see the directory structure is kept the same as the source code. And if you peek into one of the files, you'll note the content is basically the same as what we had in our defineMessages() dictionary:

$ cat i18n/extracted-messages/src/js/components/Login.json
[
  {
    "id": "UserAuthenticator.authenticating",
    "defaultMessage": "Authenticating..."
  },
  {
    "id": "Login.username",
    "defaultMessage": "Username"
  },
  {
    "id": "Login.usernameRequired",
    "defaultMessage": "Username is required."
  },
[...]

However, JSON is not a format that Zanata understands by default. I think the latest version we upgraded to, or the next one might have some support for it, but since there's no i18n JSON standard it's somewhat limited. In open-source software projects, po/pot files are generally the standard to go with.

$ npm run json2pot

> tripleo-ui@7.1.0 json2pot /home/jpichon/devel/tripleo-ui
> rip json2pot ./i18n/extracted-messages/**/*.json -o ./i18n/messages.pot

> [react-intl-po] write file -> ./i18n/messages.pot ✔️

$ cat i18n/messages.pot
msgid ""
msgstr ""
"POT-Creation-Date: 2017-07-07T09:14:10.098Z\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"MIME-Version: 1.0\n"
"X-Generator: react-intl-po\n"


#: ./i18n/extracted-messages/src/js/components/nodes/RegisterNodesDialog.json
#. [RegisterNodesDialog.noNodesToRegister] - undefined
msgid ""No Nodes To Register""
msgstr ""

#: ./i18n/extracted-messages/src/js/components/nodes/NodesToolbar/NodesToolbar.json
#. [Toolbar.activeFilters] - undefined
#: ./i18n/extracted-messages/src/js/components/validations/ValidationsToolbar.json
#. [Toolbar.activeFilters] - undefined
msgid "Active Filters:"
msgstr ""

#: ./i18n/extracted-messages/src/js/components/nodes/RegisterNodesDialog.json
#. [RegisterNodesDialog.addNew] - Small button, to add a new Node
msgid "Add New"
msgstr ""

#: ./i18n/extracted-messages/src/js/components/plan/PlanFormTabs.json
#. [PlanFormTabs.addPlanName] - Tooltip for "Plan Name" form field
msgid "Add a Plan Name"
msgstr ""
[...]

This messages.pot file is what will be automatically uploaded to Zanata.

Infra: from the git repo, to Zanata

The following steps are done by the infrastructure scripts. There's infra documentation on how to enable translations for your project, in our case as the first internationalised JavaScript project we had to update the scripts a little as well. This is useful to know if an issue happens with the infra jobs; debugging will probably bring you here.

The scripts live in the project-config infra repo and there are three files of interest for us:

In this case, upstream_translation_update.sh is the file of interest to us: it simply sets up the project on line 76, then sends the pot file up to Zanata on line 115.

What does "setting up the project" entails? It's a function in common_translations_update.sh, that pretty much runs the steps we talked about in the previous section, and also creates a config file to talk to Zanata.

Monitoring the post jobs

Post jobs run after a patch has already merged - usually to upload tarballs where they should be, update the documentation pages, etc, and also upload messages catalogues onto Zanata. Being a 'post' job however means that if something goes wrong, there is no notification on the original review so it's easy to miss.

Here's the OpenStack Health page to monitor 'post' jobs related to tripleo-ui. Scroll to the bottom - hopefully tripleo-ui-upstream-translation-update is still green! It's good to keep an eye on it although it's easy to forget. Thankfully, AJaeger from #openstack-infra has been great at filing bugs and letting us know when something does go wrong.

Debugging when things go wrong: an example

We had a couple of issues whereby a linebreak gets introduced into one of the strings, which works fine in JSON but breaks our pot file. If you look at the content from the bug (the full logs are no longer accessible):

2017-03-16 12:55:13.468428 | + zanata-cli -B -e push --copy-trans False
[...]
2017-03-16 12:55:15.391220 | [INFO] Found source documents:
2017-03-16 12:55:15.391405 | [INFO]            i18n/messages
2017-03-16 12:55:15.531164 | [ERROR] Operation failed: missing end-quote

You'll notice the first line is the last function we call in the upstream_translation_update.sh script; for debugging that gives you an idea of the steps to follow to reproduce. The upstream Zanata instance also lets you create toy projects, if you want to test uploads yourself (this can't be done directly on the OpenStack Zanata instance.)

This particular newline issue has popped up a couple of times already. We're treating it with band-aids at the moment, ideally we'd get a proper test on the gate to prevent it from happening again: this is why this bug is still open. I'm not very familiar with JavaScript testing and haven't had a chance to look into it yet; if you'd like to give it a shot that'd be a useful contribution :)

Zanata, and contributing translations

The OpenStack Zanata instance lives at https://translate.openstack.org. This is where the translators do their work. Here's the page for tripleo-ui, you can see there is one project per branch (stable/ocata and master, for now). Sort by "Percent Translated" to see the languages currently translated. Here's an example of the translator's view, for Spanish: you can see the English string on the left, and the translator fills in the right side. No context! Just strings.

At this stage of the release cycle, the focus would be on 'master,' although it is still early to do translations; there is a lot of churn still.

If you'd like to contribute translations, the I18n team has good documentation about how to go about how to do it. The short version: sign up on Zanata, request to join your language team, once you're approved - you're good to go!

Return of the string

Now that we have our strings available in multiple languages, it's time for another infra job to kick in and bring them into our repository. This is where propose_translation_update.sh comes in. We pull the po files from Zanata, convert them to JSON, then do a git commit that will be proposed to Gerrit.

The cleanup step does more than it might seem. It checks if files are translated over a certain ratio (~75% for code), which avoids adding new languages when there might only be one or two words translated (e.g. someone just testing Zanata to see how it works). Switching to your language and yet having the vast majority of the UI still appear in English is not a great user experience.

In theory, files that were added but are now below 40% should get automatically removed, however this doesn't quite work for JavaScript at the moment - another opportunity to help! Manual cleanups can be done in the meantime, but it's a rare event so not a major issue.

Monitoring the periodic jobs

Zanata is checked once a day every morning, there is an OpenStack Health page for this as well. You can see there are two jobs at the moment (hopefully green!), one per branch: tripleo-ui-propose-translation-update and tripleo-ui-propose-translation-update-ocata. The job should run every day even if there are no updates - it simply means there might not be a git review proposed at the end.

We haven't had issues with the periodic job so far, though the debugging process would be the same: figure out based on the failure if it is happening at the infra script stage or in one of our commands (e.g. npm run po2json), try to reproduce and fix. I'm sure super-helpful AJaeger would also let us know if he were to notice an issue here.

Automated patches

You may have seen the automated translations updates pop up on Gerrit. The commit message has some tips on how to review these: basically don't agonise over the translation contents as problems there should be handled in Zanata anyway, just make sure the format looks good and is unlikely to break the code. A JSON validation tool runs during the infra prep step in order to "prettify" the JSON blob and limit the size of the diffs, therefore once the patch  makes it out to Gerrit we know the JSON is well-formed at least.

Try to review these patches quickly to respect the translators' work. Not very nice to spend a lot of time on translating a project and yet not have your work included because no one was bothered to merge it :)

A note about new languages...

If the automated patch adds a new language, there'll be an additional step required after merging the translations in order to enable it: adding a string with the language name to a constants file. Until recently, this took 3 or 4 steps - thanks to Honza for making it much simpler!

This concludes the technical journey of a string. If you'd like to help with i18n tasks, we have a few related bugs open. They go from very simple low-hanging-fruits you could use to make your first contribution to the UI, to weird buttons that have translations available yet show in English but only in certain modals, to the kind of CI resiliency tasks I linked to earlier. Something for everyone! ;)

Working with the I18n team

It's really all about communication. Starting with...

Release schedule and string freezes

String freezes are noted on the main schedule but tend to fit the regular cycle-with-milestones work. This is a problem for a cycle-trailing project like tripleo-ui as we could be implementing features up to 2 weeks after the other projects, so we can't freeze strings that early.

There were discussions at the Atlanta PTG around whether the I18n should care at all about projects that don't respect the freeze deadlines. That would have made it impossible for projects like ours to ever make it onto the I18n official radar. The compromise was that cycle-trailing project should have a I18n cross-project liaison that communicates with the I18n PTL and team to inform them of deadlines, and also to ignore Soft Freeze and only do a Hard Freeze.

This will all be documented under an i18n governance tag; while waiting for it the notes from the sessions are available for the curious!

What's a String Freeze again?

The two are defined on the schedule: soft freeze means not allowing changes to strings, as it invalidates the translator's work and forces them to retranslate; hard freeze means no additions, changes or anything else in order to give translators a chance to catch up.

When we looked at Zanata earlier, there were translation percentages beside each language: the goal is always the satisfaction of reaching 100%. If we keep adding new strings then the goalpost keeps moving, which is discouraging and unfair.

Of course there's also an "exception process" when needed, to ask for permission to merge a string change with an explanation or at least a heads-up, by sending an email to the openstack-i18n mailing list. Not to be abused :)

Role of the I18n liaison

...Liaise?! Haha. The role is defined briefly on the Cross-Projects Liaison wiki page. It's much more important toward the end of the cycle, when the codebase starts to stabilise, there are fewer changes and translators look at starting their work to be included in the release.

In general it's good to hang out on the #openstack-i18n IRC channel (very low traffic), attend the weekly meeting (it alternates times), be available to answer questions, and keep the PTL informed of the I18n status of the project. In the case of cycle-trailing projects (quite a new release model still), it's also important to be around to explain the deadlines.

A couple of examples having an active liaison helps with:

  • Toward the end or after the release, once translations into the stable branch have settled, the stable translations get copied into the master branch on Zanata. The strings should still be fairly similar at that point and it avoids translators having to re-do the work. It's a manual process, so you need to let the I18n PTL know when there are no longer changes to stable/*.
  • Last cycle, because the cycle-trailing status of tripleo-ui was not correctly documented, a Zanata upgrade was planned right after the main release - which for us ended up being right when the codebase had stabilised enough and several translators had planned to be most active. Would have been solved with better, earlier communication :)

Post-release

After the Ocata release, I sent a few screenshots of tripleo-ui to the i18n list so translators could see the result of their work. I don't know if anybody cared :-) But unlike Horizon, which has an informal test system available for translators to check their strings during the RC period, most of the people who volunteered translations had no idea what the UI looked like. It'd be cool if we could offer a test system with regular string updates next release - maybe just an undercloud on the new RDO cloud? Deployment success/failures strings wouldn't be verifiable but the rest would, while the system would be easier to maintain than a full dev TripleO environment - better than nothing. Perhaps an idea for the Queens cycle!

The I18n team has a priority board on the Zanata main page (only visible when logged in I think). I'm grateful to see TripleO UI in there! :) Realistically we'll never move past Low or perhaps Medium priority which is fair, as TripleO doesn't have the same kind of reach or visibility that Horizon or the installation guides do. I'm happy that we're included! The OpenStack I18n team is probably the most volunteer-driven team in OpenStack. Let's be kind, respect string freezes and translators' time! \o/

</braindump>

by jpichon at July 07, 2017 11:45 AM

March 02, 2017

Julie Pichon

OpenStack Pike PTG: TripleO, TripleO UI | Some highlights

For the second part of the PTG (vertical projects), I mainly stayed in the TripleO room, moving around a couple of times to attend cross-project sessions related to i18n.

Although I always wish I understood more/everything, in the end my areas of interest (and current understanding!) in TripleO are around the UI, installing and configuring it, the TripleO CLI, and the tripleo-common Mistral workflows. Therefore the couple of thoughts in this post are mainly relevant to these - if you're looking for a more exhaustive summary of the TripleO discussions and decisions made during the PTG, I recommend reading the PTL's excellent thread about this on the dev list, and the associated etherpads.

Random points of interest

  • Containers is the big topic and had multiple sessions dedicated to it, both single and cross-projects. Many other sessions ended up revisiting the subject as well, sometimes with "oh that'll be solved with containers" and sometimes with "hm good but that won't work with containers."
  • A couple of API-breaking changes may need to happen in Tripleo Heat Templates (e.g. for NFV, passing a role mapping vs a role name around). The recommendation is to get this in as early as possible (by the first milestone) and communicate it well for out of tree services.
  • When needing to test something new on the CI, look at the existing scenarios and prioritise adding/changing something there to test for what you need, as opposed to trying to create a brand new job.
  • Running Mistral workflows as part of or after the deployment came up several times and was even a topic during a cross-project Heat / Mistral / TripleO sessions. Things can get messy, switching between Heat, Mistral and Puppet. Where should these workflows live (THT, tripleo-common)? Service-specific workflows (pre/post-deploy) are definitely something people want and there's a need to standardise how to do that. Ceph's likely to be the first to try their hand at this.
  • One lively cross-project session with OpenStack Ansible and Kolla was about parameters in configuration files. Currently whenever a new feature is added to Nova or whatever service, Puppet and so on need to be updated manually. The proposal is to make a small change to oslo.config to enable it to give an output in machine-readable YAML which can then be consumed (currently the config generated is only human readable). This will help with validations, and it may help to only have to maintain a structure as opposed to a template.
  • Heat folks had a feedback session with us about the TripleO needs. They've been super helpful with e.g. helping to improve our memory usage over the last couple of cycles. My takeaway from this session was "beware/avoid using YAQL, especially in nested stacks." YAQL is badly documented and everyone ends up reading the source code and tests to figure out how to things. Bringing Jinja2 into Heat or some kind of way to have repeated patterns from resources (e.g. based on a file) also came up and was cautiously acknowledged.
  • Predictable IP assignment on the control plane is a big enough issue that some people are suggesting dropping Neutron in the undercloud over it. We'd lose so many other benefits though, that it seems unlikely to happen.
  • Cool work incoming allowing built-in network examples to Just Work, based on a sample configuration. Getting the networking stuff right is a huge pain point and I'm excited to hear this should be achievable within Pike.

Python 3

Python 3 is an OpenStack community goal for Pike.

Tripleo-common and python-tripleoclient both have voting unit tests jobs for Python 3.5, though I trust them only moderately for a number of reasons. For example many of the tests tend to focus on the happy path and I've seen and fixed Python 3 incompatible code in exceptions several times (no 'message' attribute seems easy to get caught into), despite the unit testing jobs being all green. Apparently there are coverage jobs we could enable for the client, to ensure the coverage ratio doesn't drop.

Python 3 for functional tests was also brought up. We don't have functional tests in any of our projects and it's not clear the value we would get out of it (mocking servers) compared to the unit testing and all the integration testing we already do. Increasing unit test coverage was deemed a more valuable goal to pursue for now.

There are other issues around functional/integration testing with Python 3 which will need to be resolved (though likely not during Pike). For example our integration jobs run on CentOS and use packages, which won't be Python 3 compatible yet (cue SCL and the need to respin dependencies). If we do add functional tests, perhaps it would be easier to have them run on a Fedora gate (although if I recall correctly gating on Fedora was investigated once upon a time at the OpenStack level, but caused too many issues due to churn and the lack of long-term releases).

Another issue with Python 3 support and functional testing is that the TripleO client depends on Mistral server (due to the Series Of Unfortunate Dependencies I also mentioned in the last post). That means Mistral itself would need to fully support Python 3 as well.

Python 2 isn't going anywhere just yet so we still have time to figure things out. The conclusions, as described in Emilien's email seem to be:

  • Improve the unit test coverage
  • Enable the coverage job in CI
  • Investigate functional testing for python-tripleoclient to start with, see if it makes sense and is feasible

Sample environment generator

Currently environment files in THT are written by hand and quite inconsistent. This is also important for the UI, which needs to display this information. For example currently the environment general description is in a comment at the top of the file (if it exists at all), which can't be accessed programmatically. Dependencies between environment files are not described either.

To make up for this, currently all that information lives in the capabilities map but it's external to the template themselves, needs to be updated manually and gets out of sync easily.

The sample environment generator to fix this has been out there for a year, and currently has two blockers. First, it needs a way to determine which parameters are private (that is, parameters that are expected to be passed in by another template and shouldn't be set by the user).

One way could be to use a naming convention, perhaps an underscore prefix similar to Python. Parameter groups cannot be used because of a historical limitation, there can only be one group (so you couldn't be both Private and Deprecated). Changing Heat with a new feature like Nested Groups or generic Parameter Tags could be an option. The advantage of the naming convention is that it doesn't require any change to Heat.

From the UI perspective, validating if an environment or template is redefining parameters already defined elsewhere also matters. Because if it has the same name, then it needs to be set up with the same value everywhere or it's uncertain what the final value will end up as.

I think the second issue was that the general environment description can only be a comment at the moment, there is no Heat parameter to do this. The Heat experts in the room seemed confident this is non-controversial as a feature and should be easy to get in.

Once the existing templates are updated to match the new format, the validation should be added to CI to make sure that any new patch with environments does include these parameters. Having "description" show up as an empty string when generating a new environment will make it more obvious that something can/should be there, while it's easy to forget about it with the current situation.

The agreement was:

  • Use underscores as a naming convention to start with
  • Start with a comment for the general description

Once we get the new Heat description attribute we can move things around. If parameter tags get accepted, likewise we can automate moving things around. Tags would also be useful to the UI, to determine what subset of relevant parameters to display to the user in smaller forms (easier to understand that one form with several dozens of fields showing up all at once). Tags, rather than parameter groups are required because of the aforementioned issue: it's already used for deprecation and a parameter can only have one group.

Trusts and federation

This was a cross-project session together with Heat, Keystone and Mistral. A "Trust" lets you delegate or impersonate a user with a subset of their rights. From my experience in TripleO, this is particularly useful with long running Heat stacks as a authentication token expires after a few hours which means you lose the ability to do anything in the middle of an operation.

Trusts have been working very well for Heat since 2013. Before that they had to encrypt the user password in order to ask for a new token when needed, which all agreed was pretty horrible and not anything people want to go back to. Unfortunately with the federation model and using external Identity Providers, this is no longer possible. Trusts break, but some kind of delegation is still needed for Heat.

There were a lot of tangents about security issues (obviously!), revocation, expiration, role syncing. From what I understand Keystone currently validates Trusts to make sure the user still has the requested permissions (that the relevant role hasn't been removed in the meantime). There's a desire to have access to the entire role list, because the APIs currently don't let us check which role is necessary to perform a specific action. Additionally, when Mistral workflows launched from Heat get in, Mistral will create its own Trusts and Heat can't know what that will do. In the end you always kinda end up needing to do everything. Karbor is running into this as well.

No solution was discovered during the session, but I think all sides were happy enough that the problem and use cases have been clearly laid out and are now understood.

TripleO UI

Some of the issues relevant to the UI were brought up in other sessions, like standardising the environment files. Other issues brought up were around plan management, for example why do we use the Mistral environment in addition to Swift? Historically it looks like it's because it was a nice drop-in replacement for the defunct TripleO API and offered a similar API. Although it won't have an API by default, the suggestion is to move to using a file to store the environment during Pike and have a consistent set of templates: this way all the information related to a deployment plan will live in the same place. This will help with exporting/importing plans, which itself will help with CLI/UI interoperability (for instance there are still some differences in how and when the Mistral environment is generated, depending on whether you deployed with the CLI or the UI).

A number of other issues were brought up around networking, custom roles, tracking deployment progress, and a great many other topics but I think the larger problems around plan management was the only expected to turn into a spec, now proposed for review.

I18n and release models

After the UI session I left the TripleO room to attend a cross-project session about i18n, horizon and release models. The release model point is particularly relevant because the TripleO UI is a newly internationalised project as of Ocata and the first to be cycle-trailing (TripleO releases a couple of weeks after the main OpenStack release).

I'm glad I was able to attend this session. For one it was really nice to collaborate directly with the i18n and release management team, and catch up with a couple of Horizon people. For second it turns out tripleo-ui was not properly documented as cycle-trailing (fixed now!) and that led to other issues.

Having different release models is a source of headaches for the i18n community, already stretched thin. It means string freezes happen at different times, stable branches are cut at different points, which creates a lot of tracking work for the i18n PTL to figure which project is ready and do the required manual work to update Zanata upon branching. One part of the solution is likely to figure out if we can script the manual parts of this workflow so that when the release patch (which creates the stable branch) is merged, the change is automatically reflected in Zanata.

For the non-technical aspects of the work (mainly keeping track of deadlines and string freezes) the decision was that if you want to be translated, then you need to respect the same deadlines than the cycle-with-milestones projects do on the main schedule, and if a project doesn't want to - if it doesn't respect the freezes or cut branches when expected, then they will be dropped from the i18n priority dashboard in Zanata. This was particularly relevant for Horizon plugins, as there's about a dozen of them now with various degrees of diligence when it comes to doing releases.

These expectations will be documented in a new governance tag, something like i18n:translated.

Obviously this would mean that cycle-trailing projects would likely never be able to get the tag. The work we depend on lands late and so we continue making changes up to two weeks after each of the documented deadlines. ianychoi, the i18n PTL seemed happy to take these projects under the i18n wing and do the manual work required, as long as there is an active i18n liaison for the project communicating with the i18n team to keep them informed about freezes and new branches. This seemed to work ok for us during Ocata so I'm hopeful we can keep that model. It's not entirely clear to me if this will also be documented/included in the governance tag so I look forward to reading the spec once it is proposed! :)

In the case of tripleo-ui we're not a priority project for translations nor looking to be, but we still rely on the i18n PTL to create Zanata branches and merge translations for us, and would like to continue with being able to translate stable branches as needed.

CI Q&A

The CI Q&A session on Friday morning was amazingly useful and unanimously agreed it should be moved to the documentation (not done yet). If you've ever scratched your head about something related to TripleO CI, have a look at the etherpad!

by jpichon at March 02, 2017 09:55 AM


Last updated: May 15, 2019 12:08 AM

TripleO: OpenStack Deployment   Documentation | Code Reviews | CI Status | Zuul Queue | Planet