Planet TripleO

Subscriptions

March 20, 2017

Juan Antonio Osorio

Testing containerized OpenStack services with kolla

Note that the following instructions are for Fedora 25 as that’s what I’m currently running.

Setup

First, make sure that you have docker installed in your system.

sudo dnf install -y docker

This might not be the most secure thing; but for development purposes I added my user to the docker group, so I can run docker commands without sudo.

groupadd docker
usermod -a -G docker <my user>

After this, you might want to open a new session, and make sure that in that session your user is in the docker group.

Having done this, make sure you enable docker on your system:

sudo systemctl enable docker

or restart it if it was running already:

sudo systemctl restart docker

Now, having done this, we need to get the kolla repository, where the image definitions are declared:

git clone git://git.openstack.org/openstack/kolla
# We should also move to the directory
cd kolla

next, we need to generate a base/sample configuration. We can easily do this thanks to oslo-config-generator:

tox -e genconfig

This will generate a sample configuration in ./etc/kolla/kolla-build.conf

Now, since I don’t want to install kolla in my system, I’ll just rely on the virtualenv that tox creates:

tox -e py27
source .tox/py27/bin/activate

Building images with kolla

One can just build all the images with:

kolla-build

or:

./kolla/cmd/build.py

However, in my case, at the moment I only want to test out the keystone image. So I do:

./kolla/cmd/build.py keystone

This will build the keystone-related images, and some that I didn’t really want to build (such as barbican and barbican-keystone-listener). This will give a very verbose output and take a while. So go fetch a coffee/beer or your beverage of choice.

We can check which images were built using the usual docker CLI:

$ docker images
kolla/centos-binary-keystone-ssh                 4.0.0               42b10416796b        8 minutes ago       718.9 MB
kolla/centos-binary-keystone-fernet              4.0.0               77ca80589ec6        8 minutes ago       699.5 MB
kolla/centos-binary-barbican-keystone-listener   4.0.0               9d5b536d599a        9 minutes ago       640.8 MB
kolla/centos-binary-keystone                     4.0.0               5b9e0c7085f0        9 minutes ago       677.4 MB
kolla/centos-binary-keystone-base                4.0.0               689e1df70317        9 minutes ago       677.4 MB
kolla/centos-binary-barbican-base                4.0.0               af286e15b9e7        10 minutes ago      619.5 MB
kolla/centos-binary-openstack-base               4.0.0               a2eae63c41f1        11 minutes ago      588.6 MB
kolla/centos-binary-base                         4.0.0               cbdeca2ecfab        15 minutes ago      397.8 MB
docker.io/centos                                 7                   98d35105a391        4 days ago          192.5 MB

The images are based by default on the centos image. Which is fine for me. However, if you would like to use another base (such as ubuntu) you can specify that with the -b parameter.

kolla-build -b ubuntu

Testing it out with kolla-ansible

To test it out, for now, I decided to try out kolla-ansible. So, we need to install some relevant packages, clone the repo, and run tox so it can build a virtual environment for us with all the dependencies installed:

dnf install python-docker-py ansible
git clone git://git.openstack.org/openstack/kolla-ansible
cd kolla-ansible
tox -e py27
source .tox/py27/bin/activate
# For some reason, it seems that kolla is missing from here, so I install
# it inside the virtual environment.
pip install kolla

Now, to try out the keystone container, I need the rsyslog and mariadb containers as well. so lets build them

kolla-build mariadb rsyslog

Now, I have a config file such as the following:

kolla_base_distro: "centos"
kolla_install_type: "binary"

# This is the interface with an ip address you want to bind mariadb and keystone too
network_interface: "enp0s25"
# Set this to an ip address that currently exists on interface "network_interface"
kolla_internal_address: "192.168.X.X"

# Easy way to change debug to True, though not required
openstack_logging_debug: "True"

# For your information, but these default to "yes" and can technically be removed
enable_keystone: "yes"
enable_mariadb: "yes"

# Builtins that are normally yes, but we set to no
enable_glance: "no"
enable_haproxy: "no"
enable_heat: "no"
enable_memcached: "no"
enable_neutron: "no"
enable_nova: "no"
enable_rabbitmq: "no"
enable_horizon: "no"

I placed this file in the home directory in a directory I named kolla: ~/kolla/globals.yml

Remember to change the internal IP address to the relevant one in your system.

We also need a passwords file, so, lets get the base for it and generate the relevant passwords:

# Copy the default empty passwords file in the kolla-ansible repository
cp etc/kolla/passwords.yml ~/kolla/
# Generate passwords
kolla-genpwd -p ~/kolla/passwords.yml

Even if we don’t run kolla-ansible as root, we still need the /etc/kolla directory. So one has to create it and give your user permissions to it.

sudo mkdir /etc/kolla
sudo chown -R $USER:$USER /etc/kolla

This is becaue there are binds from this directory to the containers taking place. And also, we’ll be able to get the admin credentials for the openstack services in a file that’ll subsequently be created in this directory.

Finally, we can issue the deploy:

./tools/kolla-ansible --configdir ~/kolla/ \
    --passwords ~/kolla/passwords.yml deploy

After this finishes we can now see several containers running:

$ docker ps
CONTAINER ID        IMAGE                                     COMMAND             CREATED             STATUS              PORTS               NAMES
5e9e41b0b404        kolla/centos-binary-keystone:4.0.0        "kolla_start"       8 minutes ago       Up 8 minutes                            keystone
a84ab9997e3a        kolla/centos-binary-mariadb:4.0.0         "kolla_start"       9 minutes ago       Up 9 minutes                            mariadb
1687512f6984        kolla/centos-binary-cron:4.0.0            "kolla_start"       9 minutes ago       Up 9 minutes                            cron
bb3235e0a880        kolla/centos-binary-kolla-toolbox:4.0.0   "kolla_start"       9 minutes ago       Up 9 minutes                            kolla_toolbox
b23c48b04c53        kolla/centos-binary-fluentd:4.0.0         "kolla_start"       9 minutes ago       Up 9 minutes                            fluentd

Now, we can get the admin credentials:

./tools/kolla-ansible --configdir ~/kolla/ \
    --passwords ~/kolla/passwords.yml post-deploy

This will create admin-openrc.sh in the /etc/kolla directory. So we can do a simple test to see that keystone is running by doing the following.

source /etc/kolla/admin-openrc.sh
openstack user list

And this should print an admin user.

To log into the keystone container. I can do the following:

docker exec -ti keystone /bin/bash

Note about docker’s storage

At some point after writing this blog post I ran into the issue where I destroyed the kolla-ansible deployment and attempted to create it again, with it finally failing. This apparently was because of the kolla-toolbox container not finding the storage file it needed to.

Talking to the kolla community I was pointed to this blog recommending not to use the devicemapper driver. And after following the recommendations there, using OverlayFS instead for my case, destroying the deployment and the images, and rebooting my system; I was able to deploy again successfully.

References

Special thanks to Adam Young. His blog has always been really useful!

The kolla-ansible bits are mostly based this blog post of his.

The rest has been based on the kolla and kolla-ansible documentation.

March 20, 2017 01:15 PM

March 13, 2017

Juan Antonio Osorio

Deploying a containerized overcloud

Deploying a containerized overcloud is a matter of adding the environments/docker.yaml environment to the overcloud deployment.

Now, for developing, one is likely to want to use local images instead of the ones from dockerhub. So we’ll set heat parameters in an environment file. This, fortunately, is already done for us by quickstart. So we’ll see the file containers-default-parameters.yaml with something like the following:

parameter_defaults:
  DockerNamespace: 192.168.24.1:8787/tripleoupstream
  DockerNamespaceIsRegistry: true

Of course, the IP address depends on your undercloud deployment.

Now, this requires one to upload the images to the local registry. So, if we created our deployment using a recent version of tripleo-quickstart, there is already a utility script that we can use for this purpose: upload_images_to_local_registry.py

Note that to run the aforementioned script, one needs to either do it with root privileges (via sudo) or add a docker group and subsequently a user to it; for instance, the stack user. So choose depending on your security requirements.

TLDR;

Do your regular openstack deploy but add the following environments:

  • tripleo-heat-templates/environments/docker.yaml
  • $HOME/containers-default-parameters.yaml

Note

Currently (now that I’m writing this post) HA deployments are not available. So don’t try to use the pacemaker environment cause that will fail.

March 13, 2017 03:55 PM

March 03, 2017

Steven Hardy

Developing Mistral workflows for TripleO

During the newton/ocata development cycles, TripleO made changes to the architecture so we make use of Mistral (the OpenStack workflow API project) to drive workflows required to deploy your OpenStack cloud.

Prior to this change we had workflow defined inside python-tripleoclient, and most API calls were made directly to Heat.  This worked OK but there was too much "business logic" inside the client, which doesn't work well if non-python clients (such as tripleo-ui) want to interact with TripleO.

To solve this problem, number of mistral workflows and custom actions have been implemented, which are available via the Mistral API on the undercloud.  This can be considered the primary "TripleO API" for driving all deployment tasks now.

Here's a diagram showing how it fits together:

Overview of Mistral integration in TripleO


Mistral workflows and actions

There are two primary interfaces to mistral, workflows which are a yaml definition of a process or series of tasks, and actions which are a concrete definition of how to do a specific task (such as call some OpenStack API).

Workflows and actions can defined directly via the mistral API, or a wrapper called a workbook.  Mistral actions are also defined via a python plugin interface, which TripleO uses to run some tasks such as running jinja2 on tripleo-heat-templates prior to calling Heat to orchestrate the deployment.

Mistral workflows, in detail

Here I'm going to show how to view and interact with the mistral workflows used by TripleO directly, which is useful to understand what TripleO is doing "under the hood" during a deployment, and also for debugging/development.

First we view the mistral workbooks loaded into Mistral - these contain the TripleO specific workflows and are defined in tripleo-common


[stack@undercloud ~]$ . stackrc 
[stack@undercloud ~]$ mistral workbook-list
+----------------------------+--------+---------------------+------------+
| Name | Tags | Created at | Updated at |
+----------------------------+--------+---------------------+------------+
| tripleo.deployment.v1 | <none> | 2017-02-27 17:59:04 | None |
| tripleo.package_update.v1 | <none> | 2017-02-27 17:59:06 | None |
| tripleo.plan_management.v1 | <none> | 2017-02-27 17:59:09 | None |
| tripleo.scale.v1 | <none> | 2017-02-27 17:59:11 | None |
| tripleo.stack.v1 | <none> | 2017-02-27 17:59:13 | None |
| tripleo.validations.v1 | <none> | 2017-02-27 17:59:15 | None |
| tripleo.baremetal.v1 | <none> | 2017-02-28 19:26:33 | None |
+----------------------------+--------+---------------------+------------+

The name of the workbook constitutes a namespace for the workflows it contains, so we can view the related workflows using grep (I also grep for tag_node to reduce the number of matches).


[stack@undercloud ~]$ mistral workflow-list | grep "tripleo.baremetal.v1" | grep tag_node
| 75d2566c-13d9-4aa3-b18d-8e8fc0dd2119 | tripleo.baremetal.v1.tag_nodes | 660c5ec71ce043c1a43d3529e7065a9d | <none> | tag_node_uuids, untag_nod... | 2017-02-28 19:26:33 | None |
| 7a4220cc-f323-44a4-bb0b-5824377af249 | tripleo.baremetal.v1.tag_node | 660c5ec71ce043c1a43d3529e7065a9d | <none> | node_uuid, role=None, que... | 2017-02-28 19:26:33 | None | 

When you know the name of a workflow, you can inspect the required inputs, and run it directly via a mistral execution, in this case we're running the tripleo.baremetal.v1.tag_node workflow, which modifies the profile assigned in the ironic node capabilities (see tripleo-docs for more information about manual tagging of nodes)


[stack@undercloud ~]$ mistral workflow-get tripleo.baremetal.v1.tag_node
+------------+------------------------------------------+
| Field | Value |
+------------+------------------------------------------+
| ID | 7a4220cc-f323-44a4-bb0b-5824377af249 |
| Name | tripleo.baremetal.v1.tag_node |
| Project ID | 660c5ec71ce043c1a43d3529e7065a9d |
| Tags | <none> |
| Input | node_uuid, role=None, queue_name=tripleo |
| Created at | 2017-02-28 19:26:33 |
| Updated at | None |
+------------+------------------------------------------+
[stack@undercloud ~]$ ironic node-list
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
| 30182cb9-eba9-4335-b6b4-d74fe2581102 | control-0 | None | power off | available | False |
| 19fd7ea7-b4a0-4ae9-a06a-2f3d44f739e9 | compute-0 | None | power off | available | False |
+--------------------------------------+-----------+---------------+-------------+--------------------+-------------+
[stack@undercloud ~]$ mistral execution-create tripleo.baremetal.v1.tag_node '{"node_uuid": "30182cb9-eba9-4335-b6b4-d74fe2581102", "role": "test"}'
+-------------------+--------------------------------------+
| Field | Value |
+-------------------+--------------------------------------+
| ID | 6a141065-ad6e-4477-b1a8-c178e6fcadcb |
| Workflow ID | 7a4220cc-f323-44a4-bb0b-5824377af249 |
| Workflow name | tripleo.baremetal.v1.tag_node |
| Description | |
| Task Execution ID | <none> |
| State | RUNNING |
| State info | None |
| Created at | 2017-03-03 09:53:10 |
| Updated at | 2017-03-03 09:53:10 |
+-------------------+--------------------------------------+

At this point the mistral workflow is running, and it'll either succeed or fail, and also create some output (which in the TripleO model is sometimes returned to the Ux via a Zaqar queue).  We can view the status, and the outputs (truncated for brevity):


[stack@undercloud ~]$ mistral execution-list | grep  6a141065-ad6e-4477-b1a8-c178e6fcadcb
| 6a141065-ad6e-4477-b1a8-c178e6fcadcb | 7a4220cc-f323-44a4-bb0b-5824377af249 | tripleo.baremetal.v1.tag_node | | <none> | SUCCESS | None | 2017-03-03 09:53:10 | 2017-03-03 09:53:11 |
[stack@undercloud ~]$ mistral execution-get-output 6a141065-ad6e-4477-b1a8-c178e6fcadcb
{
"status": "SUCCESS",
"message": {
...

So that's it - we ran a mistral workflow, it suceeded and we looked at the output, now we can see the result looking at the node in Ironic, it worked! :)


[stack@undercloud ~]$ ironic node-show 30182cb9-eba9-4335-b6b4-d74fe2581102 | grep profile
| | u'cpus': u'2', u'capabilities': u'profile:test,cpu_hugepages:true,boot_o |

 

Mistral workflows, create your own!

Here I'll show how to develop your own custom workflows (which isn't something we expect operators to necessarily do, but is now part of many developers workflow during feature development for TripleO).

First, we create a simple yaml definition of the workflow, as defined in the v2 Mistral DSL - this example lists all available ironic nodes, then finds those which match the "test" profile we assigned in the example above:


This example uses the mistral built-in "ironic" action, which is basically a pass-through action exposing the python-ironicclient interfaces.  Similar actions exist for the majority of OpenStack python clients, so this is a pretty flexible interface.

Now we can now upload the workflow (not wrapped in a workbook this time, so we use workflow-create), run it via execution create, then look at the outputs - we can see that  the matching_nodes output matches the ID of the node we tagged in the example above - success! :)

[stack@undercloud tripleo-common]$ mistral workflow-create shtest.yaml 
+--------------------------------------+-------------------------+----------------------------------+--------+--------------+---------------------+------------+
| ID | Name | Project ID | Tags | Input | Created at | Updated at |
+--------------------------------------+-------------------------+----------------------------------+--------+--------------+---------------------+------------+
| 2b8f2bea-f3dd-42f0-ad16-79987c75df4d | test_nodes_with_profile | 660c5ec71ce043c1a43d3529e7065a9d | <none> | profile=test | 2017-03-03 10:18:48 | None |
+--------------------------------------+-------------------------+----------------------------------+--------+--------------+---------------------+------------+
[stack@undercloud tripleo-common]$ mistral execution-create test_nodes_with_profile
+-------------------+--------------------------------------+
| Field | Value |
+-------------------+--------------------------------------+
| ID | 2392ed1c-96b4-4787-9d11-0f3069e9a7e5 |
| Workflow ID | 2b8f2bea-f3dd-42f0-ad16-79987c75df4d |
| Workflow name | test_nodes_with_profile |
| Description | |
| Task Execution ID | <none> |
| State | RUNNING |
| State info | None |
| Created at | 2017-03-03 10:19:30 |
| Updated at | 2017-03-03 10:19:30 |
+-------------------+--------------------------------------+
[stack@undercloud tripleo-common]$ mistral execution-list | grep 2392ed1c-96b4-4787-9d11-0f3069e9a7e5
| 2392ed1c-96b4-4787-9d11-0f3069e9a7e5 | 2b8f2bea-f3dd-42f0-ad16-79987c75df4d | test_nodes_with_profile | | <none> | SUCCESS | None | 2017-03-03 10:19:30 | 2017-03-03 10:19:31 |
[stack@undercloud tripleo-common]$ mistral execution-get-output 2392ed1c-96b4-4787-9d11-0f3069e9a7e5
{
"matching_nodes": [
"30182cb9-eba9-4335-b6b4-d74fe2581102"
],
"available_nodes": [
"30182cb9-eba9-4335-b6b4-d74fe2581102",
"19fd7ea7-b4a0-4ae9-a06a-2f3d44f739e9"
]
}

Using this basic example, you can see how to develop workflows which can then easily be copied into the tripleo-common workbooks, and integrated into the TripleO deployment workflow.

In a future post, I'll dig into the use of custom actions, and how to develop/debug those.

by Steve Hardy (noreply@blogger.com) at March 03, 2017 10:51 AM

March 02, 2017

Julie Pichon

OpenStack Pike PTG: TripleO, TripleO UI | Some highlights

For the second part of the PTG (vertical projects), I mainly stayed in the TripleO room, moving around a couple of times to attend cross-project sessions related to i18n.

Although I always wish I understood more/everything, in the end my areas of interest (and current understanding!) in TripleO are around the UI, installing and configuring it, the TripleO CLI, and the tripleo-common Mistral workflows. Therefore the couple of thoughts in this post are mainly relevant to these - if you're looking for a more exhaustive summary of the TripleO discussions and decisions made during the PTG, I recommend reading the PTL's excellent thread about this on the dev list, and the associated etherpads.

Random points of interest

  • Containers is the big topic and had multiple sessions dedicated to it, both single and cross-projects. Many other sessions ended up revisiting the subject as well, sometimes with "oh that'll be solved with containers" and sometimes with "hm good but that won't work with containers."
  • A couple of API-breaking changes may need to happen in Tripleo Heat Templates (e.g. for NFV, passing a role mapping vs a role name around). The recommendation is to get this in as early as possible (by the first milestone) and communicate it well for out of tree services.
  • When needing to test something new on the CI, look at the existing scenarios and prioritise adding/changing something there to test for what you need, as opposed to trying to create a brand new job.
  • Running Mistral workflows as part of or after the deployment came up several times and was even a topic during a cross-project Heat / Mistral / TripleO sessions. Things can get messy, switching between Heat, Mistral and Puppet. Where should these workflows live (THT, tripleo-common)? Service-specific workflows (pre/post-deploy) are definitely something people want and there's a need to standardise how to do that. Ceph's likely to be the first to try their hand at this.
  • One lively cross-project session with OpenStack Ansible and Kolla was about parameters in configuration files. Currently whenever a new feature is added to Nova or whatever service, Puppet and so on need to be updated manually. The proposal is to make a small change to oslo.config to enable it to give an output in machine-readable YAML which can then be consumed (currently the config generated is only human readable). This will help with validations, and it may help to only have to maintain a structure as opposed to a template.
  • Heat folks had a feedback session with us about the TripleO needs. They've been super helpful with e.g. helping to improve our memory usage over the last couple of cycles. My takeaway from this session was "beware/avoid using YAQL, especially in nested stacks." YAQL is badly documented and everyone ends up reading the source code and tests to figure out how to things. Bringing Jinja2 into Heat or some kind of way to have repeated patterns from resources (e.g. based on a file) also came up and was cautiously acknowledged.
  • Predictable IP assignment on the control plane is a big enough issue that some people are suggesting dropping Neutron in the undercloud over it. We'd lose so many other benefits though, that it seems unlikely to happen.
  • Cool work incoming allowing built-in network examples to Just Work, based on a sample configuration. Getting the networking stuff right is a huge pain point and I'm excited to hear this should be achievable within Pike.

Python 3

Python 3 is an OpenStack community goal for Pike.

Tripleo-common and python-tripleoclient both have voting unit tests jobs for Python 3.5, though I trust them only moderately for a number of reasons. For example many of the tests tend to focus on the happy path and I've seen and fixed Python 3 incompatible code in exceptions several times (no 'message' attribute seems easy to get caught into), despite the unit testing jobs being all green. Apparently there are coverage jobs we could enable for the client, to ensure the coverage ratio doesn't drop.

Python 3 for functional tests was also brought up. We don't have functional tests in any of our projects and it's not clear the value we would get out of it (mocking servers) compared to the unit testing and all the integration testing we already do. Increasing unit test coverage was deemed a more valuable goal to pursue for now.

There are other issues around functional/integration testing with Python 3 which will need to be resolved (though likely not during Pike). For example our integration jobs run on CentOS and use packages, which won't be Python 3 compatible yet (cue SCL and the need to respin dependencies). If we do add functional tests, perhaps it would be easier to have them run on a Fedora gate (although if I recall correctly gating on Fedora was investigated once upon a time at the OpenStack level, but caused too many issues due to churn and the lack of long-term releases).

Another issue with Python 3 support and functional testing is that the TripleO client depends on Mistral server (due to the Series Of Unfortunate Dependencies I also mentioned in the last post). That means Mistral itself would need to fully support Python 3 as well.

Python 2 isn't going anywhere just yet so we still have time to figure things out. The conclusions, as described in Emilien's email seem to be:

  • Improve the unit test coverage
  • Enable the coverage job in CI
  • Investigate functional testing for python-tripleoclient to start with, see if it makes sense and is feasible

Sample environment generator

Currently environment files in THT are written by hand and quite inconsistent. This is also important for the UI, which needs to display this information. For example currently the environment general description is in a comment at the top of the file (if it exists at all), which can't be accessed programmatically. Dependencies between environment files are not described either.

To make up for this, currently all that information lives in the capabilities map but it's external to the template themselves, needs to be updated manually and gets out of sync easily.

The sample environment generator to fix this has been out there for a year, and currently has two blockers. First, it needs a way to determine which parameters are private (that is, parameters that are expected to be passed in by another template and shouldn't be set by the user).

One way could be to use a naming convention, perhaps an underscore prefix similar to Python. Parameter groups cannot be used because of a historical limitation, there can only be one group (so you couldn't be both Private and Deprecated). Changing Heat with a new feature like Nested Groups or generic Parameter Tags could be an option. The advantage of the naming convention is that it doesn't require any change to Heat.

From the UI perspective, validating if an environment or template is redefining parameters already defined elsewhere also matters. Because if it has the same name, then it needs to be set up with the same value everywhere or it's uncertain what the final value will end up as.

I think the second issue was that the general environment description can only be a comment at the moment, there is no Heat parameter to do this. The Heat experts in the room seemed confident this is non-controversial as a feature and should be easy to get in.

Once the existing templates are updated to match the new format, the validation should be added to CI to make sure that any new patch with environments does include these parameters. Having "description" show up as an empty string when generating a new environment will make it more obvious that something can/should be there, while it's easy to forget about it with the current situation.

The agreement was:

  • Use underscores as a naming convention to start with
  • Start with a comment for the general description

Once we get the new Heat description attribute we can move things around. If parameter tags get accepted, likewise we can automate moving things around. Tags would also be useful to the UI, to determine what subset of relevant parameters to display to the user in smaller forms (easier to understand that one form with several dozens of fields showing up all at once). Tags, rather than parameter groups are required because of the aforementioned issue: it's already used for deprecation and a parameter can only have one group.

Trusts and federation

This was a cross-project session together with Heat, Keystone and Mistral. A "Trust" lets you delegate or impersonate a user with a subset of their rights. From my experience in TripleO, this is particularly useful with long running Heat stacks as a authentication token expires after a few hours which means you lose the ability to do anything in the middle of an operation.

Trusts have been working very well for Heat since 2013. Before that they had to encrypt the user password in order to ask for a new token when needed, which all agreed was pretty horrible and not anything people want to go back to. Unfortunately with the federation model and using external Identity Providers, this is no longer possible. Trusts break, but some kind of delegation is still needed for Heat.

There were a lot of tangents about security issues (obviously!), revocation, expiration, role syncing. From what I understand Keystone currently validates Trusts to make sure the user still has the requested permissions (that the relevant role hasn't been removed in the meantime). There's a desire to have access to the entire role list, because the APIs currently don't let us check which role is necessary to perform a specific action. Additionally, when Mistral workflows launched from Heat get in, Mistral will create its own Trusts and Heat can't know what that will do. In the end you always kinda end up needing to do everything. Karbor is running into this as well.

No solution was discovered during the session, but I think all sides were happy enough that the problem and use cases have been clearly laid out and are now understood.

TripleO UI

Some of the issues relevant to the UI were brought up in other sessions, like standardising the environment files. Other issues brought up were around plan management, for example why do we use the Mistral environment in addition to Swift? Historically it looks like it's because it was a nice drop-in replacement for the defunct TripleO API and offered a similar API. Although it won't have an API by default, the suggestion is to move to using a file to store the environment during Pike and have a consistent set of templates: this way all the information related to a deployment plan will live in the same place. This will help with exporting/importing plans, which itself will help with CLI/UI interoperability (for instance there are still some differences in how and when the Mistral environment is generated, depending on whether you deployed with the CLI or the UI).

A number of other issues were brought up around networking, custom roles, tracking deployment progress, and a great many other topics but I think the larger problems around plan management was the only expected to turn into a spec, now proposed for review.

I18n and release models

After the UI session I left the TripleO room to attend a cross-project session about i18n, horizon and release models. The release model point is particularly relevant because the TripleO UI is a newly internationalised project as of Ocata and the first to be cycle-trailing (TripleO releases a couple of weeks after the main OpenStack release).

I'm glad I was able to attend this session. For one it was really nice to collaborate directly with the i18n and release management team, and catch up with a couple of Horizon people. For second it turns out tripleo-ui was not properly documented as cycle-trailing (fixed now!) and that led to other issues.

Having different release models is a source of headaches for the i18n community, already stretched thin. It means string freezes happen at different times, stable branches are cut at different points, which creates a lot of tracking work for the i18n PTL to figure which project is ready and do the required manual work to update Zanata upon branching. One part of the solution is likely to figure out if we can script the manual parts of this workflow so that when the release patch (which creates the stable branch) is merged, the change is automatically reflected in Zanata.

For the non-technical aspects of the work (mainly keeping track of deadlines and string freezes) the decision was that if you want to be translated, then you need to respect the same deadlines than the cycle-with-milestones projects do on the main schedule, and if a project doesn't want to - if it doesn't respect the freezes or cut branches when expected, then they will be dropped from the i18n priority dashboard in Zanata. This was particularly relevant for Horizon plugins, as there's about a dozen of them now with various degrees of diligence when it comes to doing releases.

These expectations will be documented in a new governance tag, something like i18n:translated.

Obviously this would mean that cycle-trailing projects would likely never be able to get the tag. The work we depend on lands late and so we continue making changes up to two weeks after each of the documented deadlines. ianychoi, the i18n PTL seemed happy to take these projects under the i18n wing and do the manual work required, as long as there is an active i18n liaison for the project communicating with the i18n team to keep them informed about freezes and new branches. This seemed to work ok for us during Ocata so I'm hopeful we can keep that model. It's not entirely clear to me if this will also be documented/included in the governance tag so I look forward to reading the spec once it is proposed! :)

In the case of tripleo-ui we're not a priority project for translations nor looking to be, but we still rely on the i18n PTL to create Zanata branches and merge translations for us, and would like to continue with being able to translate stable branches as needed.

CI Q&A

The CI Q&A session on Friday morning was amazingly useful and unanimously agreed it should be moved to the documentation (not done yet). If you've ever scratched your head about something related to TripleO CI, have a look at the etherpad!

Tagged with: events, openstack, tripleo

by jpichon at March 02, 2017 09:55 AM

February 24, 2017

Carlos Camacho

Installing TripleO Quickstart

This is a brief recipe about how to manually install TripleO Quickstart in a remote 32GB RAM box and not dying trying it.

Now instack-virt-setup is deprecated :( :( :( so the manual process needs to evolve and use OOOQ (TripleO Quickstart).

This post is a brief recipe about how to provision the Hypervisor node and deploy an end-to-end development environment based on TripleO-Quickstart.

From the hypervisor run:

# In this dev. env. /var is only 50GB, so I will create
# a sym link to another location with more capacity.
# It will take easily more tan 50GB deploying a 3+1 overcloud
sudo mkdir -p /home/libvirt/
sudo ln -sf /home/libvirt/ /var/lib/libvirt
# Add default toor user
sudo useradd toor
echo "toor:toor" | chpasswd
echo "toor ALL=(root) NOPASSWD:ALL" | sudo tee -a /etc/sudoers.d/toor
sudo chmod 0440 /etc/sudoers.d/toor
su - toor
whoami

sudo yum groupinstall "Virtualization Host" -y
sudo yum install git -y

cd
mkdir .ssh
ssh-keygen -t rsa -N "" -f .ssh/id_rsa
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
sudo bash -c "cat .ssh/id_rsa.pub >> /root/.ssh/authorized_keys"
sudo bash -c "echo '127.0.0.1 127.0.0.2' >> /etc/hosts"
export VIRTHOST=127.0.0.2

ssh root@$VIRTHOST uname -a

Now, we can start deploying TripleO Quickstart by following:

git clone https://github.com/openstack/tripleo-quickstart
chmod u+x ./tripleo-quickstart/quickstart.sh
printf "\n\nSee:\n./tripleo-quickstart/quickstart.sh --help for a full list of options\n\n"
bash ./tripleo-quickstart/quickstart.sh --install-deps --release master

export VIRTHOST=127.0.0.2
export CONFIG=~/deploy-config.yaml

cat > $CONFIG << EOF

# undercloud_undercloud_hostname: undercloud.ratata-domain

control_memory: 8192
compute_memory: 6120
 
undercloud_memory: 10240
undercloud_vcpu: 4
undercloud_workers: 3

default_vcpu: 1

node_count: 4

overcloud_nodes:
  - name: control_0
    flavor: control
    virtualbmc_port: 6230
  - name: control_1
    flavor: control
    virtualbmc_port: 6231
  - name: control_2
    flavor: control
    virtualbmc_port: 6232
  - name: compute_0
    flavor: compute
    virtualbmc_port: 6233

topology: >-
  --control-scale 3

extra_args: >-
  --libvirt-type qemu
  -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --ntp-server pool.ntp.org

run_tempest: false
EOF

bash ./tripleo-quickstart/quickstart.sh \
                --no-clone \
                --clean \
                --release master \
                --teardown all \
                --tags all \
                -e @$CONFIG \
                $VIRTHOST

In the hypervisor run the following command to log-in in the Undercloud:

ssh -F /home/toor/.quickstart/ssh.config.ansible undercloud

# Add the TRIPLEO_ROOT var to stackrc 
# to use with tripleo-ci
echo "export TRIPLEO_ROOT=~" >> stackrc

source stackrc

At this point you should have your development environment deployed correctly.

Clone the tripleo-ci repo:

git clone https://github.com/openstack-infra/tripleo-ci

And, run the Overcloud pingtest:

~/tripleo-ci/scripts/tripleo.sh --overcloud-pingtest

Enjoy TripleOing (~˘▾˘)~

Note: I had to execute the deployment command 3/4 times to have an OK deployment, sometimes it just fails (i.e. timeout getting the images).

Note: From the host, virsh list --all will work only as the stack user.

Note: Each time you run the quickstart.sh command from the hypervisor the UC and OC will be nuked (--teardown all), you will see tasks like ‘PLAY [Tear down undercloud and overcloud vms] **’.

Note: If you delete the Overcloud i.e. using heat stack-delete overcloud you can re-deploy what you had by running the dynamically generated overcloud-deploy.sh script in the stack home folder from the UC.

Note: There are several options for TripleO Quickstart besides the basic virthost deployment, check them here: https://docs.openstack.org/developer/tripleo-quickstart/working-with-extras.html

Updated 2017/03/17: Bleh, had to execute several times the deployment command to have it working.. :/ I miss you instack-virt-setup

Updated 2017/03/16: The --config option seems to be broken, using instead -e @~/deploy-config.yaml.

Updated 2017/03/14: New workflow added.

Updated 2017/02/27: Working fine.

Updated 2017/02/23: Seems to work.

Updated 2017/02/23: instack-virt-setup is deprecatred :( moving to tripleo-quickstart.

by Carlos Camacho at February 24, 2017 12:00 AM

February 14, 2017

Ben Nemec

Improving TripleO CI Throughput

If you spend any significant amount of time working on TripleO, you have probably run into the dreaded CI queue. In this case that typically refers to the check-tripleo queue that runs OVB. Why does that queue back up more than the regular check queue and what can/have we done about it? That's what we're going to talk about here.

The Problems

Before we discuss solutions, it's important to understand the problems we face in scaling TripleO's OVB CI. It's a very different beast from the rest of OpenStack CI. Here's a (probably incomplete) list of how:

  • Our test environments are considerably larger than normal. An average TripleO OVB job makes use of 5 VMs. As of this writing, the most that any regular infra job uses is 3, and that's an experimental TripleO job. In general they max out at 2, and most use a single node. A TripleO environment averages around 35 GB of memory (generally our limiting factor) per test environment, as well as a lot of vcpus and disk.
  • Our test environments are also considerably more complex. Those 5 VMs are attached to some combination of 6 different neutron subnets. One of the VMs is also configured as an IPMI server that controls the others. We use Heat to deploy them in order to keep the whole thing manageable. This adds yet another layer of complexity because regular infra doesn't know how to deploy the Heat stacks, so we have to run private infrastructure to handle that.
  • Related to the previous point, out test environments have some unusual requirements on the host cloud. While some work has been done to reduce the number of ways our CI cloud is a snowflake, there are currently just 2 available clouds on which our jobs can run. And one of those is being used by TripleO developers and not available for CI. So we have exactly one cloud of ~30 128 GB compute nodes for TripleO CI.
  • In the interest of maximizing compute capacity, our CI cloud has a single, nonha controller. When a large number of test environments are being created and/or deleted at once, it can overload that controller and cause all kinds of strange (read: broken) behavior.
  • TripleO CI jobs are long. At this time, I estimate we average about 1 hour 50 minutes per job. And I'm happy with that number, if you can believe it. It used to be well above 2 hours. If that sounds bad, keep in mind that we're deploying not one, but two OpenStack clouds in that time. We're also simulating the baremetal deployment part of the process, which no other OpenStack CI jobs do.

Just for some context, at our peak utilization of TripleO CI I've seen as many as 750 jobs being run in a 24 hour period. You can do the math on the number of VMs, memory, and networks that involves. It also means even small regressions in performance can have a huge impact on our daily throughput. A 5 minute regression per job adds up to 62.5 extra hours per day spent running test jobs. The good news is that a 5 minute improvement has the same impact in the positive.

The Solutions

Best strap in tight for this section, because we've been busy.

One useful tool that I want to point out is a simple webapp I wrote to keep an eye on the check-tripleo queue: check-tripleo queue status. It can show other queues as well, but it was specifically designed for the tripleo queues so some things may not make sense elsewhere. It's also designed to be as compact as possible, and it may not be obvious what some of the numbers mean. If there's interest, I can write a more complete post about the tool itself.

There are two main categories of changes that helped our CI throughput: bug fixes and optimizations. I'll start with the bugs that were hurting performance.

Bugs

  • Five minute delay DHCP'ing isolated nics This has actually bitten us twice. It's a fairly long-standing bug that goes back to at least Mitaka and causes deployments to spend 5 minutes attempting to DHCP nics that will never get a response. Fixing it saved time in every single job we run.
  • IPA image build not skipped even if image already exists This bug crept in when we moved to a YAML-based image build system. There was an issue with the check for existing images that meant even when we could use cached images in CI, we were spending 10 minutes rebuilding the IPA image. This didn't affect every job (some can't use cached images), but it was a big time suck for the ones it did.
  • overcloud-full ramdisk being rebuilt twice Thanks to a recent change, we ended up with two different image elements doing forced ramdisk rebuilds during our image builds. This was a less serious performance hit, but it still saves 1.5-2 minutes per job when we have to build images.

Optimizations

  • Run with actual node count Due to a scheduler race betwee Nova and Ironic, we had previously added an extra node to each test environment so the scheduler could retry when it failed. This no longer seems to be necessary, and removing the extra node freed up around 20% of the resources from each test environment. It also makes environment creation and deletion faster because there is less to do.
  • Disable optional undercloud features in longer jobs The undercloud has grown a lot of new services over the past few cycles, and this has caused it to take an increasingly long time to install. Since we aren't exercising many of these features in CI anyway, there's no point deploying them in all jobs. This is saving around 10 minutes in the ha and updates jobs.
  • Deploy network envs appropriate for the job Not all of our jobs require the full 6 networks I discussed earlier. Since neutron-server is one of the biggest CPU users on the CI cloud controller, reducing the number of ports attached to the VMs was a big win in terms of controller capacity. It also reduces the time to create a test environment by a minute or more for some jobs. And in case that's not enough, this change will also allow us to test with bonded nics in CI.
  • Always use cached images in updates job The updates job is especially painful from a runtime perspective. Not only does it deploy two full clouds, but it also has to update one of them, which takes a significant amount of time as well. Since the updates job is never run in isolation and image builds for it are not job-specific, there's no reason we can't always use cached images. If an image build is broken by a patch it will be caught by one of the other jobs. This can save as much as 30+ minutes in updates jobs.
  • Parallelization wherever possible. There were a few patches related to this, but essentially there are some processes in CI (such as log collection) that were being run in serial. Since our VMs are typically going to be running on different compute nodes, there's really no benefit to that and running those processes in parallel can save significant amounts of time.
  • Use http delorean urls instead of https At some point, our default delorean repos (which is where we get OpenStack and friends) switched to using https by default. While this is good from a security perspective, it's bad from a CI perspective because it means we can't cache those packages. This is both slower and results in more bandwidth wasted on both ends. Note: As of this writing, the problem is only half fixed. Some of our repos force redirect http to https so there's nothing we can do on our end.

And I think this one deserves special notice: Clean up testenv if Jenkins instance goes away Previously we had an issue where test environments were being left around for some time after the job they were attached to had been killed. This can happen, for example, if a new patch set is pushed to a change that has jobs actively running on it. Zuul kills the active jobs on the old patch set and starts new ones on the new patch set. However, before this change we did not immediately clean up the test environment from the killed jobs. This was very problematic and caused us to exceed our capacity in the CI cloud on several occasions. It also meant we couldn't make full use of the capacity at other times because the more jobs we ran the more likely it was that this situation would occur. Since the patch, I have never seen us exceed our configured capacity for jobs, and the problem scenario has occurred 1300 times in the two weeks since the change merged. That's a lot of resources not wasted.

All of these optimizations combined have both reduced our job runtimes and allowed us to run more jobs at once. We've increased our concurrent job limit from 60 to 70, and the CI cloud is still under less load than it was before. We could probably go even higher, but since things are generally under control right now there's no need to push the limit. There's also diminishing returns (more jobs running at once means more load on the compute nodes, which leads to lower performance) and some existing limits in the cloud that would require downtime to change if we go much higher. It could be done if necessary, but so far it hasn't been.

It's also worth noting that the effort to keep the CI queues reasonable is ongoing. Even while we merged the changes discussed above, other changes happened that regressed CI performance. Some because they were adding new things to deploy that take more time, and some for unanticipated reasons. Unfortunately, performance regressions tend to get ignored until they become so painful that jobs time out. This is a bad approach because CI performance affects every developer working on TripleO, and I'm hoping we can do a better job of keeping things in good shape going forward.

And just to drive the previous point home, in the time since I started writing this post and publishing it, we've regressed the ha job performance enough to start causing job failures.

by bnemec at February 14, 2017 07:32 PM

January 31, 2017

Dougal Matthews

Interactive Mistral Workflows over Zaqar

It is possible to do some really nice automation with the Mistral Workflow engine. However, sometimes user input is required or desirable. I set about to write an interactive Mistral Workflow, one that could communicate with a user over Zaqar.

If you are not familiar with Mistral Workflows you may want to start here, here or here.

The Workflow

Okay, this is what I came up with.

---
version: '2.0'

interactive-workflow:

  input:
    - input_queue: "workflow-input"
    - output_queue: "workflow-output"

  tasks:

    request_user_input:
      action: zaqar.queue_post
      retry: count=5 delay=1
      input:
        queue_name: <% $.output_queue %>
        messages:
          body: "Send some input to '<% $.input_queue %>'"
      on-success: read_user_input

    read_user_input:
      pause-before: true
      action: zaqar.queue_pop
      input:
        queue_name: <% $.input_queue %>
      publish:
        user_input: <% task(read_user_input).result[0].body %>
      on-success: done

    done:
      action: std.echo output=<% $.user_input %>
      action: zaqar.queue_post
      retry: count=5 delay=1
      input:
        queue_name: <% $.output_queue %>
        messages:
          body: "You sent: '<% $.user_input %>'"

Breaking it down...

  1. The Workflow uses two queues one for input and one for output - it would be possible to use the same for both but this seemed simpler.

  2. On the first task, request_user_input, the Workflow sends a Zaqar message to the user requesting a message be sent to the input_queue.

  3. The read_user_input task pauses before it starts, see the pause-before: true. This means we can unpause the Workflow after we send a message. It would be possible to create a loop here that polls for messages, see below for more on this.

  4. After the input is provided, the Workflow must be un-paused manually. It then reads from the queue and sends the message back to the user via the output Zaqar queue.

See it in Action

We can demonstrate the Workflow with just the Mistral client. First you need to save it to a file and use the mistral workflow-create command to upload it.

First we trigger the Workflow execution.

$ mistral execution-create interactive-workflow
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| ID                | e8e2bfd5-3ae4-46db-9230-ada00a2c0c8c |
| Workflow ID       | bdd1253e-68f8-4cf3-9af0-0957e4a31631 |
| Workflow name     | interactive-workflow                 |
| Description       |                                      |
| Task Execution ID | <none>                               |
| State             | RUNNING                              |
| State info        | None                                 |
| Created at        | 2017-01-31 08:22:17                  |
| Updated at        | 2017-01-31 08:22:17                  |
+-------------------+--------------------------------------+

The Workflow will complete the first task and then move to the PAUSED state before read_user_input. This can be seen with the mistral execution-list command.

In this Workflow we know there will now be a message in Zaqar now. The Mistral action zaqar.queue_pop can be used to receive it...

$ mistral run-action zaqar.queue_pop '{"queue_name": "workflow-output"}'
{"result": [{"body": "Send some input to 'workflow-input'", "age": 4, "queue": {"_metadata": null, "client": null, "_name": "workflow-output"}, "href": null, "ttl": 3600, "_id": "589049397dcad341ecfb72cf"}]}

The JSON is a bit hard to read, but you can see the message body Send some input to 'workflow-input'.

Great. We can do that with another Mistral action...

$ mistral run-action zaqar.queue_post '{"queue_name": "workflow-input", "messages":{"body": {"testing": 123}}}'
{"result": {"resources": ["/v2/queues/workflow-input/messages/589049447dcad341ecfb72d0"]}}

After sending the message back to the requested Workflow we can unpause it. This can be done like this...

$ mistral execution-update -s RUNNING e8e2bfd5-3ae4-46db-9230-ada00a2c0c8c
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| ID                | e8e2bfd5-3ae4-46db-9230-ada00a2c0c8c |
| Workflow ID       | bdd1253e-68f8-4cf3-9af0-0957e4a31631 |
| Workflow name     | interactive-workflow                 |
| Description       |                                      |
| Task Execution ID | <none>                               |
| State             | RUNNING                              |
| State info        | None                                 |
| Created at        | 2017-01-31 08:22:17                  |
| Updated at        | 2017-01-31 08:22:38                  |
+-------------------+--------------------------------------+

Finally we can confirm it worked by getting a message back from the Workflow...

$ mistral run-action zaqar.queue_pop '{"queue_name": "workflow-output"}'
{"result": [{"body": "You sent: '{u'testing': 123}'", "age": 6, "queue": {"_metadata": null, "client": null, "_name": "workflow-output"}, "href": null, "ttl": 3600, "_id": "5890494f7dcad341ecfb72d1"}]}

You can see a new message is returned which shows the input we sent.

Caveats

As mentioned above, the main limitation here is that you need to manually unpause the Workflow. It would be nice if there was a way for the Zaqar message to automatically do this.

Polling for messages in the Workflow would be quite easy, with a retry loop and Mistral's continue-on. However, that would be quite resource intensive. If you wanted to do this, a Workflow task like this would probably do the trick.

  wait_for_message:
    action: zaqar.queue_pop
    input:
      queue_name: <% $.input_queue %>
    timeout: 14400
    retry:
      delay: 15
      count: <% $.timeout / 15 %>
      continue-on: <% len(task(wait_for_message).result) > 0 %>

The other limitation is that this Workflow now requires a specific interaction pattern that isn't obvious and documenting it might be a little tricky. However, I think the flexible execution it provides might be worthwhile in some cases.

by Dougal Matthews at January 31, 2017 07:40 AM

January 26, 2017

Ben Nemec

Setting a Root Password on overcloud-full

By default the overcloud-full image built as part of a TripleO deployment does not have a root password set. Sometimes it can be useful to set one, particularly if you're having network setup trouble that prevents you from ssh'ing into the instance after it's deployed. One simple way to do this is as follows:

sudo yum install -y libguestfs-tools
virt-customize -a overcloud-full.qcow2 --root-password password:password
. stackrc
openstack overcloud image upload --update-existing

This will install the necessary tools, set the root password, then upload it to the undercloud Glance. There are no doubt other ways to handle setting a password, but this one is pretty simple and doesn't require rebuilding the image from scratch.

Note that to set a different password, you change the "password" after the :. So the virt-customize call would look like:

virt-customize -a overcloud-full.qcow2 --root-password password:some-other-password

Tags:

by bnemec at January 26, 2017 06:27 PM

Carlos Camacho

OpenStack and services for BigData applications

Yesterday I had the opportunity of presenting together with Daniel Mellado a brief talk about OpenStack and it’s integration with services to support Big Data tools (OpenStack Sahara).

It was a combined talk for two Meetups MAD-for-OpenStack and Data-Science-Madrid.

The presentation is stored in GitHub.

Regrets:

  • We prepared a 1 hour presentation that had to be presented in 20min.
  • Wasn’t able to have access to our demo server.

by Carlos Camacho at January 26, 2017 12:00 AM

January 25, 2017

Dan Prince

Docker Puppet

Today TripleO leverages Puppet to help configure and manage the deployment of OpenStack services. As we move towards using Docker one of the big questions people have is how will we generate config files for those containers. We'd like to continue to make use of our mature configuration interfaces (Heat parameters, Hieradata overrides, Puppet modules) to allow our operators to seamlessly take the step towards a fully containerized deployment.

With the recently added composable service we've got everything we need. This is how we do it...

Install puppet into our base container image

Turns out the first thing you need of you want to generate config files with Puppet is well... puppet. TripleO uses containers from the Kolla project and by default they do not install Puppet. In the past TripleO uses an 'agent container' to manage the puppet installation requirements. This worked okay for the compute role (a very minimal set of services) but doesn't work as nicely for the broader set of OpenStack services because packages need to be pre-installed into the 'agent' container in order for config file generation to work correctly (puppet overlays the default config files in many cases). Installing packages for all of OpenStack and its requirements into the agent container isn't ideal.

Enter TripleO composable services (thanks Newton!). TripleO now supports composability and Kolla typically has individual containers for each service so it turns out the best way to generate config files for a specific service is to use the container for the service itself. We do this in two separate runs of a container: one to create config files, and the second one to launch the service (bind mounting/copying in the configs). It works really well.

But we still have the issue of how do we get puppet into all of our Kolla containers. We were happy to discover that Kolla supports a template-overrides mechanism (A jinja template) that allows you to customize how containers are built. This is how you can use that mechanism to add puppet into the Centos base image used for all the OpenStack docker containers generated by Kolla build scripts.

$ cat template-overrides.j2
{% extends parent_template %}
{% set base_centos_binary_packages_append = ['puppet'] %}

kolla-build --base centos --template-override template-overrides.j2

Control the Puppet catalog

A puppet manifest in TripleO can do a lot of things like installing packages, configuring files, starting a service, etc. For containers we only want to generate the config files. Furthermore we'd like to do this without having to change our puppet modules.

One mechanism we use is the --tags option for 'puppet apply'. This option allows you to specify which resources within a given puppet manifest (or catalog) should be executed. It works really nicely to allow you to select what you want out of a puppet catalog.

An example of this is listed below where we have a manifest to create a '/tmp/foo' file. When we run the manifest with the 'package' tag (telling it to only install packages) it does nothing at all.

$ cat test.pp 
file { '/tmp/foo':
  content => 'bar',
}
$ puppet apply --tags package test.pp
Notice: Compiled catalog for undercloud.localhost in environment production in 0.10 seconds
Notice: Applied catalog in 0.02 seconds
$ cat /tmp/foo
cat: /tmp/foo: No such file or directory

When --tags doesn't work

The --tags option of 'puppet apply' doesn't always give us the behavior we are after which is to generate only config files. Some puppet modules have custom resources with providers that can execute commands anyway. This might be a mysql query or an openstackclient command to create a keystone endpoint. Remember here that we are trying to re-use puppet modules from our baremetal configuration and these resources are expected to be in our manifests... we just don't want them to run at the time we are generating config files. So we need an alternative mechanism to suppress (noop out) these offending resources.

To do this we've started using a custom built noop_resource function that exists in puppet-tripleo. This function dynamically configures a default provider for the named resource. For mysql this ends up looking like this:

['Mysql_datadir', 'Mysql_user', 'Mysql_database', 'Mysql_grant', 'Mysql_plugin'].each |String $val| { noop_resource($val) }

Running a puppet manifest with this at the top will noop out any of the named resource types and they won't execute. Thus allowing puppet apply to complete and finish generating the config files within the specified manifest.

The good news is most of our services don't require the noop_resource in order to generate config files cleanly. But for those that do the interface allows us to effectively disable the resources we don't want to execute.

Putting it all together: docker-puppet.py

Bringing everything together in tripleo-heat-templates to create one container configuration interface that will allow us to configurably generate per-service config files. It looks like this:

  • manifest: the puppet manifest to use to generate config files (Thanks to composable services this is now per service!)
  • puppet_tags: the puppet tags to execute within this manifest
  • config_image: the docker image to use to generate config files. Generally we use the same image as the service itself.
  • config_volume: where to output the resulting config tree (includes /etc/ and some other directories).

And then we've created a custom tool to drive this per-service configuration called docker-puppet.py. The tool supports using the information above in a Json file format drive generation of the config files in a single action.

It ends up working like this:

Video demo: Docker Puppet

And thats it. Our config interfaces are intact. We generate the config files we need. And we get to carry on with our efforts to deploy with containers.

Links:

by Dan Prince at January 25, 2017 02:00 PM

January 23, 2017

James Slagle

Update on TripleO with already provisioned servers

In a previous post, I talked about using TripleO with already deployed and provisioned servers. Since that was published, TripleO has made a lot of progress in this area. I figured it was about time for an update on where the project is with this feature.

Throughout the Ocata cycle, I’ve had the chance to help make this feature more
mature and easier to consume for production deployments.

Perhaps most importantly, for pulling their deployment metadata from Heat, the servers are now configured to use a Swift Temporary URL instead of having to rely on a Keystone username and password.

Also, instead of having to bootstrap the servers with all the expected packages
and initial configuration that TripleO typically expects from instances that it
has deployed from pre-built images, you can now start with a basic CentOS image
installed with only the initial python-heat-agent packages and the agent
running.

There have also been other bug fixes and enhancements to enable this to work
with things such as network isolation and fixed predictable IP’s for all
networks.

I’ve started on some documentation that shows how to use this feature for
TripleO deployments: https://review.openstack.org/#/c/420369/
The documentation is still in progress, but I invite people to give it a try
and let me know how it works.

Using this feature, I’ve been able to deploy an Overcloud on 4 servers in a
remote lab from a virtualized Undercloud running in an entirely different lab.
There’s no L2 provisioning network connecting the 2 labs, and I don’t have
access to run a DHCP server on it anyway. The 4 Overcloud servers were
initially provisioned with the existing lab provisioning system
(cobbler/kickstart).

This flexibility helps build upon the composable nature of the
tripleo-heat-templates framework that we’ve been developing in TripleO
in that it allows integration with already existing provisioning environments.

Additionally, we’ve been using this capability extensively in our
Continuous Integration tests. Since TripleO does not have to be responsible for
provisioning the initial operating system on instances, we’ve been able to make
use of virtual instances provided by the OpenStack Infra project and
their managed Nodepool instance.

Like all other OpenStack CI jobs running in the standard check and gate queues,
our jobs are spread across several redundant OpenStack clouds. That means we
have a lot more virtual compute capacity for running tests than we previously
had available.

We’ve further been able to define job definitions using 2, 3, and 4 nodes in
the same test. These multinode tests, and the increased capacity, allow us to
test different deployment scenarios such as customized composable roles, and
recently, a job upgrading from the previous OpenStack release all the way to
master.

We’ve also scaled out our testing using scenario tests. Scenario tests allow us
to run a test with a specific configuration based on which files are actually
modified by the patch being tested. This allows the project to make
sure that patches affecting a given service are actually tested, since a
scenario test will be triggered deploying that service. This is important to
scaling our CI testing, because it’s unrealistic to expect to be able to deploy
every possible OpenStack service and test that it can be initially deployed, is
functional, and can be upgraded on every single TripleO patch.

If this is something you try out and have any feedback, I’d love to hear it and
see how we could improve this feature and make it easier to use.

by slagle at January 23, 2017 02:05 PM

January 12, 2017

Dougal Matthews

Calling Ansible from Mistral Workflows

I have spoken with a few people that were interested in calling Ansible from Mistral Workflows.

I finally got around to trying to make this happen. All that was needed was a very small and simple custom action that I put together, uploaded to github and also published to pypi.

Here is an example of a simple Mistral Workflow that makes use of these new actions.

---
version: 2.0

run_ansible_playbook:
  type: direct
  tasks:
    run_playbook:
      action: ansible-playbook
      input:
        playbook: path/to/playbook.yaml

Installing and getting started with this action is fairly simple. This is how I done it in my TripleO undercloud.

sudo pip install mistral-ansible-actions;
sudo mistral-db-manage populate;
sudo systemctrl restart openstack-mistral*;

There is one gotcha that might be confusing. The Mistral Workflow runs as the mistral user, this means that the user needs permission to access the Ansible playbook files.

After you have installed the custom actions, you can test it with the Mistral CLI. The first command should work without anything extra setup, the second requires you to create a playbook somewhere and provide access.

mistral run-action ansible '{"hosts": "localhost", "module": "setup"}'
mistral run-action ansible-playbook '{"playbook": "path/to/playbook.yaml"}'

The action supports a few other input parameters, they are all listed for now in the README in the git repo. This is a very young project, but I am curious to know if people find it useful and what other features it would need.

If you want to write custom actions, check out the Mistral documentation.

by Dougal Matthews at January 12, 2017 02:20 PM

December 16, 2016

Giulio Fidente

TripleO to deploy Ceph standlone

Here is a nice Christmas present: you can use TripleO for a standalone Ceph deployment, with just a few lines of YAML. Assuming you have an undercloud ready for a new overcloud, create an environment file like the following:

resource_registry:
  OS::TripleO::Services::CephMon: /usr/share/openstack-tripleo-heat-templates/puppet/services/ceph-mon.yaml
  OS::TripleO::Services::CephOSD: /usr/share/openstack-tripleo-heat-templates/puppet/services/ceph-osd.yaml

parameters:
  ControllerServices:
    - OS::TripleO::Services::CephMon
  CephStorageServices:
    - OS::TripleO::Services::CephOSD

and launch a deployment with:

openstack overcloud deploy --compute-scale 0 --ceph-storage-scale 1 -e the_above_env_file.yaml

The two lines from the environment file in resource_registry are mapping (and enabling) the CephMon and CephOSD services in TripleO while the lines in parameters are defining which services should be deployed on the controller and cephstorage roles.

This will bring up a two nodes overcloud with one node running ceph-mon and the other ceph-osd but the actual Christmas gift is that it implicitly provides and allows usage of all the features we already know about TripleO, like:

  • baremetal provisioning
  • network isolation
  • a web GUI
  • lifecycle management
  • ... containers
  • ... upgrades

For example, you can scale up the Ceph cluster with:

openstack overcloud deploy --compute-scale 0 --ceph-storage-scale 2 -e the_above_env_file.yaml

and this will provision a new Ironic node with the cephstorage role, configuring the required networks on it and updating the cluster config for the new OSDs. (Note the --ceph-storage-scale parameter going from 1 to 2 in the second example).

Even more interestingly is that the above will work for any service, not just Ceph, and new services can be added to TripleO with just some YAML and puppet, letting TripleO take care of a number of common issues in any deployment tool, for example:

  • supports multinode deployments
  • synchronizes and orders the deployment steps across different nodes
  • supports propagation of config data across different services

Time to try it and join the fun in #tripleo :)

by Giulio Fidente at December 16, 2016 10:00 PM

December 15, 2016

Julie Pichon

A Quick Introduction to Mistral Usage in TripleO (Newton) | For developers

Since Newton, Mistral has become a central component to the TripleO project, handling many of the operations in the back-end. I recently gave a short crash course on Mistral, what it is and how we use it to a few people and thought it might be useful to share some of my bag of tricks here as well.

What is Mistral?

It's a workflow service. You describe what you want as a series of steps (tasks) in YAML, and it will coordinate things for you, usually asynchronously.

Link: Mistral overview.

We are using it for a few reasons:

  • it lets us manage long-running processes (e.g. introspection) and track their state
  • it acts a common interface/API, that is currently used by both the TripleO CLI and UI thus avoiding duplication, and can also be consumed directly by external non-OpenStack consumers (e.g. ManageIQ).

Terminology

A workbook contains multiple workflows. (The TripleO workbooks live at https://github.com/openstack/tripleo-common/tree/master/workbooks).

A workflow contains a series of 'tasks' which can be thought of as steps. We use the default 'direct' type of workflow on TripleO, which means tasks are executed in the order written, moving around based on the on-success and on-error values.

Every task calls to an action (or to another workflow), which is where the work actually gets done.

OpenStack services are automatically mapped into actions thanks to the mappings defined in Mistral, so we get a ton of actions for free already.

Useful tip: with the following commands you can see locally which actions are available, for a given project.

$ mistral action-list | grep $projectname

You can of course create your own actions. Which we do. Quite a lot.

$ mistral action-list | grep tripleo

An execution is what an instance of a running workflow is called, once you started one.

Link: Mistral terminology (very detailed, with diagrams and examples).

Where the TripleO Mistral workflows live

https://github.com/openstack/tripleo-common/tree/master/workbooks
https://github.com/openstack/tripleo-common/tree/master/tripleo_common/actions

Let's look at a couple of examples.

A short one to start with, scaling down

https://github.com/openstack/tripleo-common/blob/156d2c/workbooks/scale.yaml#L8

It takes some input, starts with the 'delete_node' task and continues on to on-success or on-error depending on the action result.

Note: You can see we always end the workflow with send_message, which is a convention we use in the project. Even if an action failed and moves to on-error, the workflow itself should be successful (a failed workflow would indicate a problem at the Mistral level). We end with send_message because we want to let the caller know what was the result.

How will the consumer get to that result? We associate every workflow with a Zaqar queue. This is a TripleO convention, not a Mistral requirement. Each of our workflow takes a queue_name as input, and the clients are expected to listen to the Zaqar socket for that queue in order to receive the messages.

Another point, about the action itself on line 20: tripleo.scale.delete_node is a TripleO-specific action, as indicated in the name. If you were interested in finding the code for it, you should look at the entry_points in setup.cfg for tripleo-common (where all the workflows live):

https://github.com/openstack/tripleo-common/blob/156d2c/setup.cfg#L81

which would lead you to the code at:

https://github.com/openstack/tripleo-common/blob/156d2c/tripleo_common/actions/scale.py#L52

A bit more complex: node configuration

https://github.com/openstack/tripleo-common/blob/156d2c/workbooks/baremetal.yaml#L402

It's "slightly more complex" in that it has a couple more tasks, and it also calls to another workflow (line 426). You can see it starts with a call to ironic.node_list in its first task at line 417, which comes for free with Mistral. No need to reimplement it.

Debugging notes on workflows and Zaqar

Each workflow creates a Zaqar queue, to send progress information back to the client (CLI, web UI, other...).

Sometimes these messages get lost and the process hangs. It doesn't mean the action didn't complete successfully.

  • Check the Zaqar processes are up and running: $ sudo systemctl | grep zaqar (this has happened to me after reboots)
  • Check Mistral for any errored workflow: $ mistral execution-list
  • Check the Mistral logs (executor.log and engine.log are usually where the interesting errors are)
  • Ocata has timeouts for some of the commands now, so this is getting better

Following a workflow through its execution via CLI

This particular example will run somewhat fast so it's more of a "tracing back what happened afterwards."

$ openstack overcloud plan create my-new-overcloud
Started Mistral Workflow. Execution ID: 05d550f2-5d13-4782-be7f-a775a1d86a84
Default plan created

The CLI nicely tells you which execution ID to look for, so let's use it:

$ mistral task-list 05d550f2-5d13-4782-be7f-a775a1d86a84

+--------------------------------------+---------------------------------+--------------------------------------------+--------------------------------------+---------+------------------------------+
| ID                                   | Name                            | Workflow name                              | Execution ID                         | State   | State info                   |
+--------------------------------------+---------------------------------+--------------------------------------------+--------------------------------------+---------+------------------------------+
| c6e0fef0-4e65-4ee6-9ae4-a6d9e8451fd0 | verify_container_doesnt_exist   | tripleo.plan_management.v1.create_default_ | 05d550f2-5d13-4782-be7f-a775a1d86a84 | ERROR   | Failed to run action [act... |
|                                      |                                 | deployment_plan                            |                                      |         |                              |
| 72c1310d-8379-4869-918e-62eb04530e46 | verify_environment_doesnt_exist | tripleo.plan_management.v1.create_default_ | 05d550f2-5d13-4782-be7f-a775a1d86a84 | ERROR   | Failed to run action [act... |
|                                      |                                 | deployment_plan                            |                                      |         |                              |
| 74438300-8b18-40fd-bf73-62a1d90f71b3 | create_container                | tripleo.plan_management.v1.create_default_ | 05d550f2-5d13-4782-be7f-a775a1d86a84 | SUCCESS | None                         |
|                                      |                                 | deployment_plan                            |                                      |         |                              |
| 667c0e4b-6f6c-447d-9325-ab6c20c8ad98 | upload_to_container             | tripleo.plan_management.v1.create_default_ | 05d550f2-5d13-4782-be7f-a775a1d86a84 | SUCCESS | None                         |
|                                      |                                 | deployment_plan                            |                                      |         |                              |
| ef447ea6-86ec-4a62-bca2-a083c66f96d3 | create_plan                     | tripleo.plan_management.v1.create_default_ | 05d550f2-5d13-4782-be7f-a775a1d86a84 | SUCCESS | None                         |
|                                      |                                 | deployment_plan                            |                                      |         |                              |
| f37ebe9f-b39c-4f7a-9a60-eceb80782714 | ensure_passwords_exist          | tripleo.plan_management.v1.create_default_ | 05d550f2-5d13-4782-be7f-a775a1d86a84 | SUCCESS | None                         |
|                                      |                                 | deployment_plan                            |                                      |         |                              |
| 193f65fb-502a-4e4c-9a2d-053966500990 | plan_process_templates          | tripleo.plan_management.v1.create_default_ | 05d550f2-5d13-4782-be7f-a775a1d86a84 | SUCCESS | None                         |
|                                      |                                 | deployment_plan                            |                                      |         |                              |
| 400d7e11-aea8-45c7-96e8-c61523d66fe4 | plan_set_status_success         | tripleo.plan_management.v1.create_default_ | 05d550f2-5d13-4782-be7f-a775a1d86a84 | SUCCESS | None                         |
|                                      |                                 | deployment_plan                            |                                      |         |                              |
| 9df60103-15e2-442e-8dc5-ff0d61dba449 | notify_zaqar                    | tripleo.plan_management.v1.create_default_ | 05d550f2-5d13-4782-be7f-a775a1d86a84 | SUCCESS | None                         |
|                                      |                                 | deployment_plan                            |                                      |         |                              |
+--------------------------------------+---------------------------------+--------------------------------------------+--------------------------------------+---------+------------------------------+

This gives you an idea of what Mistral did to accomplish the goal. You can also map it back to the workflow defined in tripleo-common to follow through the steps and find out what exactly was run. It if the workflow stopped too early, this can give you an idea of where the problem occurred.

Side-node about plans and the ERRORed tasks above

As of Newton, information about deployment is stored in a "Plan" which is implemented as a Swift container together with a Mistral environment. This could change in the future but for now that is what a plan is.

To create a new plan, we need to make sure there isn't already a container or an environment with that name. We could implement this in an action in Python, or since Mistral already has commands to get a container / get an environment we can be clever about this and reverse the on-error and on-success actions compared to usual:

https://github.com/openstack/tripleo-common/blob/156d2c/workbooks/plan_management.yaml#L129

If we do get a 'container' then it means it already exists and the plan already exists, so we cannot reuse this name. So 'on-success' becomes the error condition.

I sometimes regret a little us going this way because it leaves exception tracebacks in the logs, which is misleading when folks go to the Mistral logs for the first time in order to debug some other issue.

Finally I'd like to end all this by mentioning the Mistral Quick Start tutorial, which is excellent. It takes you from creating a very simple workflow to following its journey through the execution.

How to create your own action/workflow in TripleO

Mistral documentation:

In short:

  • Start writing your python code, probably under tripleo_common/actions
  • Add an entry point referencing it to setup.cfg
  • /!\ Restart Mistral /!\ Action code is only taken in once Mistral starts

This is summarised in the TripleO common README (personally I put this in a script to easily rerun it all).

Back to deployments: what's in a plan

As mentioned earlier, a plan is the combination of a as a Swift container + Mistral environment. In theory this is an implementation detail which shouldn't matter to deployers. In practice knowing this gives you access to a few more debugging tricks.

For example, the templates you initially provided will be accessible through Swift.

$ swift list $plan-name

Everything else will live in the Mistral environment. This contains:

  • The default passwords (which is a potential source of confusion)
  • The parameters_default aka overriden parameters (this takes priority and would override the passwords above)
  • The list of enabled environments (this looks nicer for plans created from the UI, as they are all munged into one user-environment.yaml file when deploying from CLI - see bug 1640861)
$ mistral environment-get $plan-name

For example, with an SSL-deployment done from the UI:

$ mistral environment-get ssl-overcloud
+-------------+-----------------------------------------------------------------------------------+
| Field       | Value                                                                             |
+-------------+-----------------------------------------------------------------------------------+
| Name        | ssl-overcloud                                                                     |
| Description | <none>                                                                            |
| Variables   | {                                                                                 |
|             |     "passwords": {                                                                |
|             |         "KeystoneFernetKey1": "V3Dqp9MLP0mFvK0C7q3HlIsGBAI5VM1aW9JJ6c5lLjo=",     |
|             |         "KeystoneFernetKey0": "ll6gbwcbhyAi9jNvBnpWDImMmEAaW5dog5nRQvzvEz4=",     |
|             |         "HAProxyStatsPassword": "NXgvwfJ23VHJmwFf2HmKMrgcw",                      |
|             |         "HeatPassword": "Fs7K3CxR636BFhyDJWjsbAQZr",                              |
|             |         "ManilaPassword": "Kya6gr2zp2x8ApD6wtwUUMcBs",                            |
|             |         "NeutronPassword": "x2YK6xMaYUtgn8KxyFCQXfzR6",                           |
|             |         "SnmpdReadonlyUserPassword": "5a81d2d83ee4b69b33587249abf49cd672d08541",  |
|             |         "GlancePassword": "pBdfTUqv3yxpH3BcPjrJwb9d9",                            |
|             |         "AdminPassword": "KGGz6ApEDGdngj3KMpy7M2QGu",                             |
|             |         "IronicPassword": "347ezHCEqpqhmANK4fpWK2MvN",                            |
|             |         "HeatStackDomainAdminPassword": "kUk6VNxe4FG8ECBvMC6C4rAqc",              |
|             |         "ZaqarPassword": "6WVc8XWFjuKFMy2qP2qqqVk82",                             |
|             |         "MysqlClustercheckPassword": "M8V26MfpJc8FmpG88zu7p3bpw",                 |
|             |         "GnocchiPassword": "3H6pmazAQnnHj24QXADxPrguM",                           |
|             |         "CephAdminKey": "AQDloEFYAAAAABAAcCT546pzZnkfCJBSRz4C9w==",               |
|             |         "CeilometerPassword": "6DfAKDFdEFhxWtm63TcwsEW2D",                        |
|             |         "CinderPassword": "R8DvNyVKaqA44wRKUXEWfc4YH",                            |
|             |         "RabbitPassword": "9NeRMdCyQhekJAh9zdXtMhZW7",                            |
|             |         "CephRgwKey": "AQDloEFYAAAAABAACIfOTgp3dxt3Sqn5OPhU4Q==",                 |
|             |         "TrovePassword": "GbpxyPdnJkUCjXu4AsjmgqZVv",                             |
|             |         "KeystoneCredential0": "1BNiiNQjthjaIBnJd3EtoihXu25ZCzAYsKBpPQaV12M=",    |
|             |         "KeystoneCredential1": "pGZ4OlCzOzgaK2bEHaD1xKllRdbpDNowQJGzJHo6ETU=",    |
|             |         "CephClientKey": "AQDloEFYAAAAABAAoTR3S00DaBpfz4cyREe22w==",              |
|             |         "NovaPassword": "wD4PUT4Y4VcuZsMJTxYsBTpBX",                              |
|             |         "AdminToken": "hdF3kfs6ZaCYPUwrTzRWtwD3W",                                |
|             |         "RedisPassword": "2bxUvNZ3tsRfMyFmTj7PTUqQE",                             |
|             |         "MistralPassword": "mae3HcEQdQm6Myq3tZKxderTN",                           |
|             |         "SwiftHashSuffix": "JpWh8YsQcJvmuawmxph9PkUxr",                           |
|             |         "AodhPassword": "NFkBckXgdxfCMPxzeGDRFf7vW",                              |
|             |         "CephClusterFSID": "3120b7cc-b8ac-11e6-b775-fa163e0ee4f4",                |
|             |         "CephMonKey": "AQDloEFYAAAAABAABztgp5YwAxLQHkpKXnNDmw==",                 |
|             |         "SwiftPassword": "3bPB4yfZZRGCZqdwkTU9wHFym",                             |
|             |         "CeilometerMeteringSecret": "tjyywuf7xj7TM7W44mQprmaC9",                  |
|             |         "NeutronMetadataProxySharedSecret": "z7mb6UBEHNk8tJDEN96y6Acr3",          |
|             |         "BarbicanPassword": "6eQm4fwqVybCecPbxavE7bTDF",                          |
|             |         "SaharaPassword": "qx3saVNTmAJXwJwBH8n3w8M4p"                             |
|             |     },                                                                            |
|             |     "parameter_defaults": {                                                       |
|             |         "OvercloudControlFlavor": "control",                                      |
|             |         "ComputeCount": "2",                                                      |
|             |         "ControllerCount": "3",                                                   |
|             |         "OvercloudComputeFlavor": "compute",                                      |
|             |         "NtpServer": "my.ntp-server.example.com"                                  |
|             |     },                                                                            |
|             |     "environments": [                                                             |
|             |         {                                                                         |
|             |             "path": "overcloud-resource-registry-puppet.yaml"                     |
|             |         },                                                                        |
|             |         {                                                                         |
|             |             "path": "environments/inject-trust-anchor.yaml"                       |
|             |         },                                                                        |
|             |         {                                                                         |
|             |             "path": "environments/tls-endpoints-public-ip.yaml"                   |
|             |         },                                                                        |
|             |         {                                                                         |
|             |             "path": "environments/enable-tls.yaml"                                |
|             |         }                                                                         |
|             |     ],                                                                            |
|             |     "template": "overcloud.yaml"                                                  |
|             | }                                                                                 |
| Scope       | private                                                                           |
| Created at  | 2016-12-02 16:27:11                                                               |
| Updated at  | 2016-12-06 21:25:35                                                               |
+-------------+-----------------------------------------------------------------------------------+

Note: 'environment' is an overloaded word in the TripleO world, be careful. Heat environment, Mistral environment, specific templates (e.g. TLS/SSL, Storage...), your whole setup, ...

Bonus track

There is documentation on going from zero (no plan, no nodes registered) till running a deployment, directly using Mistral: http://tripleo.org/mistral-api/mistral-api.html.

Also, with the way we work with Mistral and Zaqar, you can switch between the UI and CLI, or even using Mistral directly, at any point in the process.

~

Thanks to Dougal for his feedback on the initial outline!

Tagged with: open-source, openstack, tripleo

by jpichon at December 15, 2016 11:09 AM

October 10, 2016

Steven Hardy

TripleO composable/custom roles

This is a follow-up to my previous post outlining the new composable services interfaces , which covered the basics of the new for Newton composable services model.

The final piece of the composability model we've been developing this cycle is the ability to deploy user-defined custom roles, in addition to (or even instead of) the built in TripleO roles (where a role is a group of servers, e.g "Controller", which runs some combination of services).

What follows is an overview of this new functionality, the primary interfaces, and some usage examples and a summary of future planned work.



Fully Composable/Custom Roles

As described in previous posts TripleO has for a long time provided a fixed architecture with 5 roles (where "roles" means groups of nodes) e.g Controller, Compute, BlockStorage, CephStorage and ObjectStorage.

This architecture has been sufficient to enable standardized deployments, but it's not very flexible.  With the addition of the composable-services model, moving services around between these roles becomes much easier, but many operators want to go further, and have full control of service placement on any arbitrary roles.

Now that the custom-roles feature has been implemented, this is possible, and operators can define arbitrary role types to enable fully composable deployments. When combined with composable services represents a huge step forward for TripleO flexibility! :)

Usage examples

To deploy with additional custom roles (or to remove/rename the default roles), a new interface has been added to the python-tripleoclient “overcloud deploy interface”, so you simply need to copy the default roles_data.yaml, modify to suit your requirements (for example by moving services between roles, or adding a new role), then do a deployment referencing the modified roles_data.yaml file:

cp /usr/share/openstack-tripleo-heat-templates/roles_data.yaml my_roles_data.yaml
<modify my_roles_data.yaml>
openstack overcloud deploy –templates -r my_roles_data.yaml


Alternatively you can copy the entire tripleo-heat-templates tree (or use a git checkout):

cp -r /usr/share/openstack-tripleo-heat-templates my-tripleo-heat-templates
<modify my-tripleo-heat-templates/roles_data.yaml>
openstack overcloud deploy –templates my-tripleo-heat-templates


Both approaches are essentially equivalent, the -r option simply overwrites the default roles_data.yaml during creation of the plan data (stored in swift on the undercloud), but it's slightly more convenient if you want to use the default packaged tripleo-heat-templates instead of constantly rebasing a copied tree.

So, lets say you wanted to deploy one additional node, only running the OS::TripleO::Ntp composable service, you'd copy roles_data.yaml, and append a list entry like this:

- name: NtpRole
  CountDefault: 1
  ServicesDefault:
    - OS::TripleO::Services::Ntp



(Note that in practice you'll probably also want some of the common services deployed on all roles, such as OS::TripleO::Services::Kernel, OS::TripleO::Services::TripleoPackages, OS::TripleO::Services::TripleoFirewall and OS::TripleO::Services::VipHosts)

 

Nice, so how does it work?


The main change made to enable custom roles is a pre-deployment templating step which runs Jinja2. We define a roles_data.yaml file(which can be overridden by the user), which contains a list of role names, and optionally some additional data related to default parameter values (such as the default services deployed on the role, and default count in the group)

The roles_data.yaml definitions look like this:

- name: Controller
CountDefault: 1
ServicesDefault:
  - OS::TripleO::Services::CACerts
  - OS::TripleO::Services::CephMon
    - OS::TripleO::Services::CinderApi
    - ...

The format is simply a yaml list of maps, with a mandatory “name” key in each map, and a number of optional FooDefault keys which set the parameter defaults for the role (as a convenience so the user won't have to specify it via an environment file during the overcloud deployment).

A custom mistral action is used to run Jinja2 when creating or updating a “deployment plan” (which is a combination of some heat templates stored in swift, and a mistral environment containing user parameters) – and this basically consumes the roles_data.yaml list of required roles, and outputs a rendered tree of Heat templates ready to deploy your overcloud.
Custom Roles, overview


There are two types of Jinja2 templates which are rendered differently, distinguished by the file extension/suffix:

foo.j2.yaml

This will pass in the contents of the roles_data.yaml list, and iterate over each role in the list, The resulting file in the plan swift container will be named foo.yaml.
Here's an example of the syntax used for j2 templating inside these files:

enabled_services:
list_join:
   - ','
{% for role in roles %}
   - {get_attr: [{{role.name}}ServiceChain, role_data, service_names]}
{% endfor %}

This example is from overcloud.j2.yaml, it does a jinja2 loop appending service_names for all roles *ServiceChain resources (which are also dynamically generated via a similar loop), which is then processed on deployment via a heat list_join function,

foo.role.j2.yaml

This will generate a file per-role, where only the name of the role is passed in during the templating step, with the resulting files being called rolename-foo.yaml. (Note that If you have a role which requires a special template, it is possible to disable this file generation by adding the pathto the j2_excludes.yaml file)

Here's an example of the syntax used in these files (taken from the role.role.j2.yaml file, which is our new definition of server for a generic role):

resources:
{{role}}:
type: OS::TripleO::Server
metadata:
os-collect-config:
command: {get_param: ConfigCommand}
properties:
image: {get_param: {{role}}Image}

As you can see, this simply allows use of a {{role}} placeholder, which is then substituted with the role name when rendering each file (one file per role defined in the roles_data.yaml list).


Debugging/Development tips

When making changes to either the roles_data.yaml, and particularly when making changes to the *.j2.yaml files in tripleo-heat-templates, it's often helpful to view the rendered templates before any overcloud deployment is attempted.

This is possible via use of the “openstack overcloud plan create” interface (which doesn't yet support the -r option above, so you have to copy or git clone the tree), combined with swiftclient:

openstack overcloud plan create overcloud –templates my_tripleo_heat_templates
mkdir tmp_templates && pushd tmp_templates
swift download overcloud

This will download the full tree of rendered files from the swift container (named “overcloud” due to the name passed to plan create), so you can e.g view the rendered overcloud.yaml that's generated by combining the overcloud.j2.yaml template with the roles_data.yaml file.

If you make a mistake in your *.j2.yaml file, the jinja2 error should be returned via the plan create command, but it can also be useful to tail -f /var/log/mistral/mistral-server.log for additional information during development (this shows the output logged from running jinja2 via the custom mistral action plugin).

Limitations/future work

These new interfaces allow for much greater deployment flexibility and choice, but there are a few remaining issues which will be addressed in future development cycles:
  1. All services managed by pacemaker are still tied to the Controller role. Thanks to the implementation of a more lightweight HA architecture during the Newton cycle, the list of services managed by pacemaker is considerably reduced, but there's still a number of services (DB & RPC services primarily) which are, and until the composable-ha blueprint is completed (hopefully during Ocata), these services cannot be moved to a non Controller role.
  2. Custom isolated networks cannot be defined. Since arbitrary roles types can now be defined, there may be a requirement to define arbitrary additional networks for network-isolation, but right now this is not possible.
  3. roles_data.yaml must be copied. As in the examples above, it's necessary to copy either roles_data.yaml, (or the entire tripleo-heat-templates tree), which means if the packaged roles_data.yaml changes (such as to add new services to the built-in roles), you must merge these changes in with your custom roles_data. In future we may add a convenience interface which makes it easier to e.g add a new role without having to care about the default role definitions.
  4. No model for dependencies between services.  Currently ensuring the right combination of services is deployed on specific roles is left to the operator, there's no validation of incompatible or inter-dependent services, but this may be addressed in a future release.

by Steve Hardy (noreply@blogger.com) at October 10, 2016 09:04 AM

September 08, 2016

Emilien Macchi

Scaling-up TripleO CI coverage with scenarios

TripleO CI up to eleven!

testing

 

When the project OpenStack started, it was “just” a set of services with the goal to spawn a VM. I remember you run everything on your laptop and test things really quickly.
The project has now grown, and thousands of features have been implemented, more backends / drivers are supported and new projects joined the party.
It makes testing very challenging because everything can’t be tested in CI environment.

TripleO aims to be an OpenStack installer, that takes care of services deployment. Our CI was only testing a set of services and a few plugins/drivers.
We had to find a way to test more services, more plugins, more drivers, in a efficient way, and without wasting CI resources.

So we thought that we could create some scenarios with a limited set of services, configured with a specific backend / plugin, and one CI job would deploy and test one scenario.
Example: scenario001 would be the Telemetry scenario, testing required services like Keystone, Nova, Glance, Neutron, but also Aodh, Ceilometer and Gnocchi.

Puppet OpenStack CI is using this model for a while and it works pretty well. We’re going to reproduce it into TripleO CI to have consistency.

 

How scenarios are run when patching TripleO?

We are using a feature in Zuul that allows to select which scenario we want to test, depending on the files we try to patch in a commit.
For example, if I submit a patch in TripleO Heat Templates and I try to modify “puppet/service/ceilometer-api.yaml” which is the composable service for Ceilometer-API, Zuul will trigger scenario001. See Zuul layout:

- name : ^gate-tripleo-ci-centos-7-scenario001-multinode.*$
  files:
    - ^puppet/services/aodh.*$
    - ^manifests/profile/base/aodh.*$
    - ^puppet/services/ceilometer.*$
    - ^manifests/profile/base/ceilometer.*$
    - ^puppet/services/gnocchi.*$
    - ^manifests/profile/base/gnocchi.*$

 

How can I bring my own service in a scenario?

The first step is to look at Puppet CI matrix and see if we already test the service in a scenario. If yes, please keep this number consistent with TripleO CI matrix. If not, you’ll need to pick a scenario, usually the less loaded to avoid performances issues.
Now you need to patch openstack-infra/project-config and specify the files that are deploying your service.
For example, if your service is “Zaqar”, you’ll add something like:

- name : ^gate-tripleo-ci-centos-7-scenario002-multinode.*$
  files:
    ...
    - ^puppet/services/zaqar.*$
    - ^manifests/profile/base/zaqar.*$

Everytime you’ll send a patch to TripleO Heat Templates in puppet/services/zaqar* files or in puppet-tripleo manifests/profile/base/zaqar*, scenario002 will be triggered.

Finally, you need to send a patch to openstack-infra/tripleo-ci:

  • Modify README.md to add the new service in the matrix.
  • Modify templates/scenario00X-multinode-pingtest.yaml and add a resource to test the service (in Zaqar, it could be a Zaqar Queue).
  • Modify test-environments/scenario00X-multinode.yaml and add the TripleO composable services and parameters to deploy the service.

Once you send the tripleo-ci patch, you can block it with -1 workflow to avoid accidental merge. Now go on openstack/tripleo-heat-templates and try to modify zaqar composable service by adding a comment or something you actually want to test. In the commit message, add “Depends-On: XXX” where XXX is the commit ID of the tripleo-ci patch. When you’ll send the patch, you’ll see that Zuul will trigger the appropriate scenario and your service will be tested.

 

 

What’s next?

  • Allow to extend testing outside pingtest. Some services, for example Ironic, can’t be tested with pingtest. Maybe run Tempest for a set of services would be something to investigate.
  • Zuul v3 is the big thing we’re all waiting to extend the granularity of our matrix. A current limitation current Zuul version (2.5) is that we can’t run scenarios in Puppet OpenStack modules CI because we don’t have a way combine both files rules that we saw before AND running the jobs for a specific project without files restrictions (ex: puppet-zaqar for scenario002). In other words, our CI will be better with Zuul v3 and we’ll improve our testing coverage by running the right scenarios on the right projects.
  • Extend the number of nodes. We currently use multinode jobs which deploy an undercloud and a subnode for overcloud (all-in-one). Some use-cases might require a third node (example with Ironic).

Any feedback on this blog post is highly welcome, please let me know if you want me to cover something more in details.

by Emilien at September 08, 2016 10:52 PM

August 26, 2016

Giulio Fidente

Ceph, TripleO and the Newton release

Time to roll up some notes on the status of Ceph in TripleO. The majority of these functionalities were available in the Mitaka release too but the examples work with code from the Newton release so they might not apply identical to Mitaka.

The TripleO default configuration

No default is going to fit everybody, but we want to know what the default is to improve from there. So let's try and see:

uc$ openstack overcloud deploy --templates tripleo-heat-templates -e tripleo-heat-templates/environments/puppet-pacemaker.yaml -e tripleo-heat-templates/environments/storage-environment.yaml --ceph-storage-scale 1
Deploying templates in the directory /home/stack/example/tripleo-heat-templates
...
Overcloud Deployed

Monitors go on the controller nodes, one per node, the above command is deploying a single controller though. First interesting thing to point out is:

oc$ ceph --version
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

Jewel! Kudos to Emilien for bringing support for it in puppet-ceph. Continuing our investigation, we notice the OSDs go on the cephstorage nodes and are backed by the local filesystem, as we didn't tell it to do differently:

oc$ ceph osd tree
ID WEIGHT  TYPE NAME                        UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.03999 root default
-2 0.03999     host overcloud-cephstorage-0
 0 0.03999         osd.0                         up  1.00000          1.00000

Notice we got SELinux covered:

oc$ ls -laZ /srv/data
drwxr-xr-x. ceph ceph system_u:object_r:ceph_var_lib_t:s0 .
...

And use CephX with autogenerated keys:

oc$ ceph auth list
installed auth entries:

client.admin
        key: AQC2Pr9XAAAAABAAOpviw6DqOMG0syeEYmX2EQ==
        caps: [mds] allow *
        caps: [mon] allow *
        caps: [osd] allow *
client.openstack
        key: AQC2Pr9XAAAAABAAA78Svmmt+LVIcRrZRQLacw==
        caps: [mon] allow r
        caps: [osd] allow class-read object_prefix rbd_children, allow rwx pool=volumes, allow rwx pool=backups, allow rwx pool=vms, allow rwx pool=images, allow rwx pool=metrics

But which OpenStack service is using Ceph? The storage-environment.yaml file has some informations:

uc$ grep -v '#' tripleo-heat-templates/environments/storage-environment.yaml | uniq

 resource_registry:
   OS::TripleO::Services::CephMon: ../puppet/services/ceph-mon.yaml
   OS::TripleO::Services::CephOSD: ../puppet/services/ceph-osd.yaml
   OS::TripleO::Services::CephClient: ../puppet/services/ceph-client.yaml

 parameter_defaults:
   CinderEnableIscsiBackend: false
   CinderEnableRbdBackend: true
   CinderBackupBackend: ceph
   NovaEnableRbdBackend: true
   GlanceBackend: rbd
   GnocchiBackend: rbd

The registry lines enable the Ceph services, the parameters instead are setting Ceph as backend for Cinder, Nova, Glance and Gnocchi. They can be configured to use other backends, see the comments in the environment file. Regarding the pools:

oc$ ceph osd lspools
0 rbd,1 metrics,2 images,3 backups,4 volumes,5 vms,

Despite the replica size set by default to 3, we only have a single OSD so with a single OSD the cluster will never get into HEALTH_OK:

oc$ ceph osd pool get vms size
size: 3

Good to know, now a new deployment with more interesting stuff.

A more realistic scenario

What makes it "more realistic"? We'll have enough OSDs to cover the replica size. We'll use physical disks for our OSDs (and journals) and not the local filesystem. We'll cope with a node with a different disks topology and we'll decrease the replica size for one of the pools.

Set a default disks map for the OSD nodes

Define a default configuration for the storage nodes, telling TripleO to use sdb for the OSD data and sdc for the journal:

ceph_default_disks.yaml
  parameter_defaults:
    CephStorageExtraConfig:
      ceph::profile::params::osds:
        /dev/sdb:
          journal: /dev/sdc

Customize the disks map for a specific node

For the node which has two (instead of a single) rotatory disks, we'll need a specific map. First get its system-uuid from the Ironic introspection data:

uc$ openstack baremetal introspection data save | jq .extra.system.product.uuid
"66C033FA-BAC0-4364-9E8A-3184B5952370"

then create the node specific map:

ceph_mynode_disks.yaml
  resource_registry:
    OS::TripleO::CephStorageExtraConfigPre: tripleo-heat-templates/puppet/extraconfig/pre_deploy/per_node.yaml

  parameter_defaults:
    NodeDataLookup: >
     {"66C033FA-BAC0-4364-9E8A-3184B5952370":
       {"ceph::profile::params::osds":
         {"/dev/sdb": {"journal": "/dev/sdd"},
          "/dev/sdc": {"journal": "/dev/sdd"}
         }
       }
     }

Fine tune pg_num, pgp_num and replica size for a pool

Finally, to override the replica size (and why not, PGs number) of the "vms" pool (where by default the Nova ephemeral disks go):

ceph_pools_config.yaml
  parameter_defaults:
    CephPools:
      vms:
        size: 2
        pg_num: 128
        pgp_num: 128

Zap all disks for the new deployment

We also want to clear and prepare all the non-root disks with a GPT label, which will allow us, for example, to repeat the deployment multiple times reusing the same nodes. The implementation of the disks cleanup script can vary, but we can use a sample script and wire it to the overcloud nodes via NodeUserData:

uc$ curl -O https://gist.githubusercontent.com/gfidente/42d3cdfe0c67f7c95f0c/raw/1f467c6018ada194b54f22113522db61ef944e20/ceph_wipe_disk.yaml

ceph_wipe_env.yaml:
  resource_registry:
    OS::TripleO::NodeUserData: ceph_wipe_disk.yaml

  parameter_defaults:
    ceph_disks: "/dev/sdb /dev/sdc /dev/sdd"

All the above environment files could have been merged in a single one but we split them out in multiple ones for clarity. Now the new deploy command:

uc$ openstack overcloud deploy --templates tripleo-heat-templates -e tripleo-heat-templates/environments/puppet-pacemaker.yaml -e tripleo-heat-templates/environments/storage-environment.yaml --ceph-storage-scale 3 -e ceph_pools_config.yaml -e ceph_mynode_disks.yaml -e ceph_default_disks.yaml -e ceph_wipe_env.yaml
Deploying templates in the directory /home/stack/example/tripleo-heat-templates
...
Overcloud Deployed

Here is our OSDs tree, with two instances running on the node with two rotatory disks (sharing the same journal disk):

oc$ ceph os tree
ID WEIGHT  TYPE NAME                        UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.03119 root default
-2 0.00780     host overcloud-cephstorage-1
 0 0.00780         osd.0                         up  1.00000          1.00000
-3 0.01559     host overcloud-cephstorage-2
 1 0.00780         osd.1                         up  1.00000          1.00000
 2 0.00780         osd.2                         up  1.00000          1.00000
-4 0.00780     host overcloud-cephstorage-0
 3 0.00780         osd.3                         up  1.00000          1.00000

and the custom PG/size values for for "vms" pool:

oc$ ceph osd pool get vms size
size: 2
oc$ ceph osd pool get vms pg_num
pg_num: 128

Another simple customization could have been to set the journals size. For example:

ceph_journal_size.yaml
  parameter_defaults:
    ExtraConfig:
      ceph::profile::params::osd_journal_size: 1024

Also we did not provide any customization for the crushmap but a recent addition from Erno makes it possible to disable global/osd_crush_update_on_start so that any customization becomes possible after the deployment is finished.

Also we did not deploy the RadosGW service as it is still a work in progress, expected for the Newton release. Submissions for its inclusion are on review.

We're also working on automating the upgrade from the Ceph/Hammer release deployed with TripleO/Mitaka to Ceph/Jewel, installed with TripleO/Newton. The process will be integrated with the OpenStack upgrade and again the submissions are on review in a series.

For more scenarios

The mechanism recently introduced in TripleO to make composable roles, discussed in a Steven's blog post, makes it possible to test a complete Ceph deployment using a single controller node too (hosting the OSD service as well), just by adding OS::TripleO::Services::CephOSD to the list of services deployed on the controller role.

And if the above still wasn't enough, TripleO continues to support configuration of OpenStack with a pre-existing, unmanaged Ceph cluster. To do so we'll want to customize the parameters in puppet-ceph-external.yaml and deploy passing that as argument instead. For example:

puppet-ceph-external.yaml
  resource_registry:
    OS::TripleO::Services::CephExternal: tripleo-heat-templates/puppet/services/ceph-external.yaml

  parameter_defaults:
    # NOTE: These example parameters are required when using Ceph External and must be obtained from the running cluster
    #CephClusterFSID: '4b5c8c0a-ff60-454b-a1b4-9747aa737d19'
    #CephClientKey: 'AQDLOh1VgEp6FRAAFzT7Zw+Y9V6JJExQAsRnRQ=='
    #CephExternalMonHost: '172.16.1.7, 172.16.1.8'

    # the following parameters enable Ceph backends for Cinder, Glance, Gnocchi and Nova
    NovaEnableRbdBackend: true
    CinderEnableRbdBackend: true
    CinderBackupBackend: ceph
    GlanceBackend: rbd
    GnocchiBackend: rbd
    # If the Ceph pools which host VMs, Volumes and Images do not match these
    # names OR the client keyring to use is not named 'openstack',  edit the
    # following as needed.
    NovaRbdPoolName: vms
    CinderRbdPoolName: volumes
    GlanceRbdPoolName: images
    GnocchiRbdPoolName: metrics
    CephClientUserName: openstack
    # finally we disable the Cinder LVM backend
    CinderEnableIscsiBackend: false

Come help in #tripleo @ freenode and don't forget to check the docs at tripleo.org! Some related topics are described there, for example, how to set the root device via Ironic for the nodes with multiple disks or how to push in ceph.conf additional arbitraty settings.

by Giulio Fidente at August 26, 2016 03:00 AM

August 04, 2016

Dan Prince

TripleO: onward dark owl

Onward dark owl

I was on PTO last week and started hacking on the beginnings of what could be a new Undercloud installer that:

  • Uses a single process Heat (heat-all)
  • It does not require MySQL, Rabbit
  • Uses noauth (no Keystone)
  • Drives the deployment locally via os-collect-config

The prototype ends up looking like this:

openstack undercloud deploy --templates=/root/tripleo-heat-templates

A short presentation of the reasons behind this and demo of the prototype is available here:

Video demo: TripleO onward dark owl

An etherpad with links to the code/patches is here:

Etherpad

by Dan Prince at August 04, 2016 10:00 PM

June 16, 2016

Marios Andreou

Deploying a stable/mitaka OpenStack with tripleo-docs (and grep, git-blame and git-log).


Deploying a stable/mitaka OpenStack with tripleo-docs (and grep, git-blame and git-log).

This post is about how I was able to mostly successfully follow the tripleo-docs, to deploy a stable/mitaka 3-control 1-compute development (virt) setup so I can ultimately test upgrading this to Newton.

I wasn’t sure there was something worth writing here, but then the same tools I used to address the two issues I hit deploying mitaka kept coming up during the week when trying to upgrade that environment. I’ve had to use a lot of grep and git blame/log to get to the bottom of issues I’m seeing trying to upgrade the undercloud from stable/mitaka to latest/newton.

The Newton upgrade work is ongoing and possibly worthy of a future post.

I guess this post is mostly about git blame, and using URI munging using the change-id to get to actual gerrit code reviews from an error/issue you are seeing.

For the record I deployed stable/mitaka following the instructions at tripleo-docs and setting stable/mitaka repos in appropriate places. For example, during the virt-setup and the undercloud installation I followed the ‘Stable Branch’ admonition and enabled mitaka repos like:

sudo curl -o /etc/yum.repos.d/delorean-mitaka.repo http://trunk.rdoproject.org/centos7-mitaka/current/delorean.repo
sudo curl -o /etc/yum.repos.d/delorean-deps-mitaka.repo http://trunk.rdoproject.org/centos7-mitaka/delorean-deps.repo

Then when building images I enabled the mitaka repo like:

export NODE_DIST=centos7
export USE_DELOREAN_TRUNK=1
export DELOREAN_TRUNK_REPO="http://trunk.rdoproject.org/centos7-mitaka/current/"
export DELOREAN_REPO_FILE="delorean.repo"

The two issues I hit:


The pebcak issue.

This issue is the pebcak issue because whilst there is indeed a bona-fide bug that I hit here, I only hit that because I had a nit in my deployment command.

My deployment command looked like this:

openstack overcloud deploy --templates --control-scale 3 --compute-scale 1
  --libvirt-type qemu
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml
-e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
-e network_env.yaml --ntp-server "pool.ntp.org"

Deploying like that ^^^ got me this:

The files ('overcloud-without-mergepy.yaml', 'overcloud.yaml') not found
in the /usr/share/openstack-tripleo-heat-templates/ directory

Err.. no I’m pretty sure those files are there (!)

# [stack@instack ~]$ ls -l /usr/share/openstack-tripleo-heat-templates/overcloud-without-mergepy.yaml
  lrwxrwxrwx. 1 root root 14 Jun 17 08:55 /usr/share/openstack-tripleo-heat-templates/overcloud-without-mergepy.yaml -> overcloud.yaml

So I know that message is very likely from the tripleoclient so I traced it. The code has actually already been fixed on master so grep gave me nothing there. However when I also tried against stable/mitaka:

[m@m python-tripleoclient]$ git checkout stable/mitaka
Switched to branch 'stable/mitaka'
[m@m python-tripleoclient]$ grep -rni "not found in the" ./*
./tripleoclient/v1/overcloud_deploy.py:414:  message = "The files {0} not
found in the {1} directory".format(

So then we can now use git blame to get to the code review that fixed it. Since we now know the file that error message comes from, we can use git blame against master branch. Since it is fixed on master, something must have fixed it:

[m@m python-tripleoclient]$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
[m@m python-tripleoclient]$ git blame tripleoclient/v1/overcloud_deploy.py

1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  382)     def _try_overcloud_deploy_with_compat_yaml(self, tht_root, stack,
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  383)                                                stack_name, parameters,
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  384)                                                environments, timeout):
7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  385)         messages = ['The following errors occurred:']
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  386)         for overcloud_yaml_name in constants.OVERCLOUD_YAML_NAMES:
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  387)             overcloud_yaml = os.path.join(tht_root, overcloud_yaml_name)
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  388)             try:
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  389)                 self._heat_deploy(stack, stack_name, overcloud_yaml,
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  390)                                   parameters, environments, timeout)
7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  391)             except six.moves.urllib.error.URLError as e:
7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  392)                 messages.append(str(e.reason))
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  393)             else:
1077cf13 tripleoclient/v1/overcloud_deploy.py        (Juan Antonio Osorio Robles 2015-12-04 09:29:16 +0200  394)                 return
7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  395)         raise ValueError('\n'.join(messages))

So the git blame may not display great above, but I see the following line as particularly interesting since it is different to stable/mitaka:

7a05679e tripleoclient/v1/overcloud_deploy.py        (James Slagle               2016-04-01 08:57:41 -0400  392)                 messages.append(str(e.reason))

So now we can use git log to see the actual commit and check it is the one we are looking for:

[m@m python-tripleoclient]$ git log 7a05679e
commit 7a05679ebc944e3bec6f20c194c40fae1cf39d8d
Author: James Slagle <jslagle@redhat.com>
Date:   Fri Apr 1 08:57:41 2016 -0400

Show correct missing files when an error occurs

This function was swallowing all missing file exceptions, and then
printing a message saying overcloud.yaml or
overcloud-without-mergepy.yaml were not found.

The problem is that the URLError could occur for any missing file, such
as a missing environment file, typo in a relative patch or filename,
etc. And in those cases, the error message is actually quite misleading,
especially if the overcloud.yaml does exist at the exact shown path.

This change makes it such that the actual missing file paths are shown
in the output.

Closes-Bug: 1584792
Change-Id: Id9a70cb50d7dfa3dde72eefe0a5eaea7985236ff

Now that sounds promising! So not only do we have the actual bug number, but we have the Change-Id. We can use that to get to the gerrit code review:

[m@m ~]$ gimmeGerrit Id9a70cb50d7dfa3dde72eefe0a5eaea7985236ff

Where gimmeGerrit is a bash alias in my .profile:

  2  gimme_gerrit() {$
  3      gerrit_url="http://review.openstack.org/#q,$1,n,z"$
  4      firefox $gerrit_url$
  5  }$
  93 alias gimmeGerrit=gimme_gerrit$

So from the review to master I just made a cherry-pick to stable/mitaka.

Now the reason I was seeing this issue in the first place, was because my deploy command was indeed wrong (it’s just that the error message was eaten by this particular bug). I was using ‘network_env.yaml’ but I had actually created network-env.yaml. Yes, much palmface, but if I hadn’t I wouldn’t have backported the fix so meh.


The overcloud needs moar memory bug.

It is more or less well known in the tripleo community that 4GB overcloud nodes will no longer cut it even in a virt environment, which is why we default to 5GB on current master instack-undercloud.

I was seeing OOM issues on the overcloud nodes with current stable/mitaka like:

16021:Jun 14 10:53:07 overcloud-controller-0 os-collect-config[2330]: u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mWarning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.\u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mWarning: Not collecting exported resources without storeconfigs\u001b[0m\n\u001b[1;31mError: /Stage[main]/Main/Pacemaker::Constraint::Base[storage_mgmt_vip-then-haproxy]/Exec[Creating order constraint storage_mgmt_vip-then-haproxy]: Could not evaluate: Cannot allocate memory - fork(2)\u001b[0m\n\u001b[1;31mError: /Stage[main]/Main/Pacemaker::Resource::Service[openstack-nova-novncproxy]/Pacemaker::Resource::Systemd[openstack-nova-novncproxy]/Pcmk_resource[openstack-nova-novncproxy]: Could not evaluate: Cannot allocate memory - /usr/sbin/pcs resource show openstack-nova-novncproxy > /dev/null 2>&1 2>&1\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Main/Pacemaker::Constraint::Base[nova-vncproxy-then-nova-api-constraint]/Exec[Creating order constraint nova-vncproxy-then-nova-api-constraint]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Main/Pacemaker::Constraint::Colocation[nova-api-with-nova-vncproxy-colocation]/Pcmk_constraint[colo-openstack-nova-api-clone-openstack-nova-novncproxy-clone]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Main/Pacemaker::Constraint::Base[nova-consoleauth-then-nova-vncproxy-constraint]/Exec[Creating order constraint nova-consoleauth-then-nova-vncproxy-constraint]: Skipping because of failed dependencies\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Main/Pacemaker::Constraint::Colocation[nova-vncproxy-with-nova-consoleauth-colocation]/Pcmk_constraint[

16313:Jun 14 10:53:07 overcloud-controller-0 os-collect-config[2330]:
Error: /Stage[main]/Sahara::Service::Api/Service[sahara-api]: Could not
evaluate: Cannot allocate memory - fork(2)
16314:Jun 14 10:53:07 overcloud-controller-0 os-collect-config[2330]:
Error: /Stage[main]/Haproxy/Haproxy::Instance[haproxy]/Haproxy::Config[haproxy]/Concat[/etc/haproxy/haproxy.cfg]/Exec[concat_/etc/haproxy/haproxy.cfg]:
Could not evaluate: Cannot allocate memory - fork(2)

Suspecting from previous experience this would be defaulted in instack-undercloud:

[m@m instack-undercloud]$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
[m@m instack-undercloud]$ grep -rni 'NODE_MEM' ./*
./scripts/instack-virt-setup:89:export NODE_MEM=${NODE_MEM:-5120}

[m@m instack-undercloud]$ git blame scripts/instack-virt-setup | grep  NODE_MEM
2dec7d75 (Carlos Camacho  2016-03-30 09:17:44 +0000  89) export NODE_MEM=${NODE_MEM:-5120}

So using git log to see more about 2dec7d75:

[m@m instack-undercloud]$ git log 2dec7d75
commit 2dec7d7521799c0323d076cd66ba71ebb444c706
Author: Carlos Camacho <ccamacho@redhat.com>
Date:   Wed Mar 30 09:17:44 2016 +0000

    Overcloud is not able to deploy with the default 4GB of RAM using instack-undercloud

    When deploying the overcloud with the default value of 4GB of RAM the overcloud fails throwing "Cannot allocate memory" errors.
    By increasing the default memory to 5GB the error is solved in instack-undercloud

    Change-Id: I29036edeebefc1959643a04c5396e72863fdca5f
    Closes-Bug: #1563750

So as in the case of the pebcak issue, gimmeGerrit yields the review so I then just cherrypicked that to stable/mitaka too.

June 16, 2016 03:00 PM

June 02, 2016

Marios Andreou

Monitoring a tripleo Overcloud upgrade


Monitoring a tripleo Overcloud upgrade

The tripleo overcloud upgrades workflow (WIP Docs) has been well tested for upgrades to stable/liberty. There is ongoing work to adapt this workflow for upgrades to stable/mitaka/newton (current master), as well as to change the process altogether and make it more composable.

This post is a description of the kinds of things I look for when monitoring a stable/liberty upgrade - verification points after a given step and some explanation in various points that may/not be helpful. I recently had to share a lot of this information as as part of a customer POC upgrade and thought it would be useful to have written down somewhere.

For reference, the overcloud being upgraded in the examples below was deployed like:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

Upgrade your undercloud.

The first thing to check and very likely have to re-instate is any post-create customizations you had to make to your undercloud, such as creation of a new ovs interface for talking to your overcloud nodes, or any custom IP routes. The undercloud upgrade will revert those and you’ll have to re-add/create them.

The upgrade to liberty delivers a new upgrade-non-controller.sh script for the undercloud, so you can check this:

[stack@instack ~]$ which upgrade-non-controller.sh
/bin/upgrade-non-controller.sh

Other than that I always just sanity check that services are running OK post upgrade:

[stack@instack ~]$ openstack-service status
MainPID=2107 Id=neutron-dhcp-agent.service ActiveState=active
MainPID=2106 Id=neutron-openvswitch-agent.service ActiveState=active
MainPID=1191 Id=neutron-server.service ActiveState=active
MainPID=1232 Id=openstack-glance-api.service ActiveState=active
MainPID=1172 Id=openstack-glance-registry.service ActiveState=active
MainPID=1201 Id=openstack-heat-api-cfn.service ActiveState=active

Execute the upgrade initialization step

This is called the initialization step since it sets up the repos on the overcloud nodes (for the upgrade we are going to) and delivers the upgrade script to the non-controller nodes. This step is instigated through the inclusion of the major-upgrade-pacemaker-init.yaml in the deployment command. For example:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

Once the heat stack has gone to UPDATE_COMPLETE you can check all non controller nodes for the presence of the newly delivered upgrade script tripleo_upgrade_node.sh:

[root@overcloud-novacompute-0 ~]# ls -l /root
-rwxr-xr-x. 1 root root 348 Jun  3 11:26 tripleo_upgrade_node.sh

One point to note is that the rpc version which we will use for pinning nova rpc during the upgrade is set in the compute upgrade script:

[root@overcloud-novacompute-0 ~]# cat tripleo_upgrade_node.sh
### DO NOT MODIFY THIS FILE
### This file is automatically delivered to the compute nodes as part of the
### tripleo upgrades workflow

# pin nova to kilo (messaging +-1) for the nova-compute service

crudini  --set /etc/nova/nova.conf upgrade_levels compute mitaka

yum -y install python-zaqarclient  # needed for os-collect-config
yum -y update

The line with the upgrade_levels compute above is actually written using the parameter we passed in the major-upgrade-pacemaker-init.yaml

You should also see the updated /etc/yum.repos.d/* on all overcloud nodes after this step so you can confirm that is all in order for the upgrade to proceed.


Upgrade controller nodes (and your entire pacemaker cluster)

(I skipped upgrading swift nodes, as it isn’t very interesting/much to say, see the WIP Docs for more or ping me).

This step will upgrade your controller nodes and during this process the entire cluster will be taken offline - this is normal. This step is instigated by including the major-upgrade-pacemaker.yaml environment file. For example:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

I typically observe the pacemaker cluster during the upgrade process. For example on controller 1 i have watch -d pcs status and on controller-2 I have watch -d pcs status | grep -ni stop -C 2. During the upgrade the pacemaker cluster goes down completely at some point, before yum packages are updated and then the cluster is brought back up.

Once you start to see pacemaker services go down it means that the code in major_upgrade_controller_pacemaker_1.sh is running and eventually the cluster is stopped completely.

Every 2.0s: pcs status | grep -ni stop -C2 -B1                                                               Fri Jun  3 11:52:07 2016

Error: cluster is not currently running on this node

At this point you can start to monitor /var/log/yum.log to see packages being upgraded.

[root@overcloud-controller-0 ~]# tail -f /var/log/yum.log
Jun 03 11:51:52 Updated: erlang-otp_mibs-18.3.3-1.el7.x86_64
Jun 03 11:51:52 Installed: python2-rjsmin-1.0.12-2.el7.x86_64
Jun 03 11:51:52 Updated: python-django-compressor-2.0-1.el7.noarch
Jun 03 11:51:53 Updated: ntp-4.2.6p5-22.el7.centos.2.x86_64
Jun 03 11:51:53 Updated: rabbitmq-server-3.6.2-3.el7.noarch

Once the cluster starts to come back online and services start then you know that major_upgrade_controller_pacemaker_2.sh is being executed.

After the stack is UPDATE_COMPLETE, you can check the rpc pin is set on nova.conf on all controllers:

[root@overcloud-controller-0 ~]# grep -rni upgrade -A 1 /etc/nova/*
/etc/nova/nova.conf:106:[upgrade_levels]
/etc/nova/nova.conf-107-compute = mitaka

Upgrade compute and ceph nodes

This uses the upgrade-non-controller.sh script, to execute the tripleo_upgrade_node.sh on each non controller node, for example:

[stack@instack ~]$ upgrade-non-controller.sh --upgrade overcloud-novacompute-0

On both node types you can check that the yum update has been executed successfully. Note that the tripleo_upgrade_node.sh script is customized for each node type, so they will be different between computes and ceph nodes for example. However in all cases there will at some point be a yum -y update. See the major_upgrade_compute.sh and major_update_ceph_storage.sh for more info on how else they might differ.

For compute nodes you can check that the upgrade_levels is set for the nova rpc pinning in /etc/nova/nova.conf (which in the case of computes is used by nova-compute itself, api/sched/conductor etc are on controller).

[root@overcloud-novacompute-0 ~]# grep -rni upgrade -A 1 /etc/nova/*
/etc/nova/nova.conf:106:[upgrade_levels]
/etc/nova/nova.conf-107-compute = mitaka

Upgrade converge - apply config deployment wide and restart things.

The last step in the upgrade workflow is where we re-apply the deployment-wide config as specified by the tripleo-heat-templates used in the deploy/upgrade commands. It is instigated by including the major-upgrade-pacemaker-converge.yaml environment file, for example:

openstack overcloud deploy --templates /home/stack/tripleo-heat-templates
  -e /home/stack/tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
  -e /home/stack/tripleo-heat-templates/environments/puppet-pacemaker.yaml
  --control-scale 3 --compute-scale 1 --libvirt-type qemu
  -e /home/stack/tripleo-heat-templates/environments/network-isolation.yaml
  -e /home/stack/tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml
  -e /home/stack/tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml
  -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org'

For both major-upgrade-pacemaker-init.yaml (upgrade initialisation) as well as major-upgrade-pacemaker.yaml (controller upgrade) we specify for the resource registry:

OS::TripleO::ControllerPostDeployment: OS::Heat::None
OS::TripleO::ComputePostDeployment: OS::Heat::None
OS::TripleO::ObjectStoragePostDeployment: OS::Heat::None
OS::TripleO::BlockStoragePostDeployment: OS::Heat::None
OS::TripleO::CephStoragePostDeployment: OS::Heat::None

which means that things like the controller-config-pacemaker.yaml do not happen for controllers during those steps. That is, application of the overcloud_**.pp manifests does not happen during upgrade initialisation or controller upgrade.

However for converge we simply do not override this in the major-upgrade-pacemaker-converge.yaml environement file so that the normal puppet manifests get applied for each node, delivering any config changes (e.g. updates to liberty had to deal with a rabbitmq password change causing issues such as this).

Since we are applying new config we need to make sure everything is restarted properly to pick this up so we use the pacemaker_resource_restart.sh after the normal puppet manifests are applied.

So during this step, the pacemaker cluster will first go into an “unmanaged” state and this is to be expected and not a cause for alarm. This is because as a matter of practice, before applying the controller puppet manifest, we set he cluster to maintenance mode (as we are going to write to the pacemaker resource definitions/constraints to the cib) like this which uses the script here.

After the manifest is applied we unset maintenance mode here.

You should then see services restarting as pacemaker_resource_restart.sh is being executed. Seeing all the services running again at this point is a good indication that the converge step is coming to an end successfully.

June 02, 2016 03:00 PM


Last updated: March 29, 2017 01:18 AM

TripleO: OpenStack Deployment   Documentation | Code Reviews | CI Status | CI Extended | Planet