Folsom & Beyond: Deploying OpenStack [Part 1: Overview]

If you’ve followed our other blogs, you know that until now our deployments of OpenStack have been limited, either by restricting the number of nodes, or by virtualizing the entire deployment. This series of blog posts will document our set up of a more realistic Cloud system using the OpenStack platform on Ubuntu 12.04 LTS.

This post will focus on our configuration.

Hardware Analysis

Network Components

  1. Router
  2. 8-port switch
  3. 6-port switch
  4. 1 Extra USB NIC

Node 1: Cloud Controller & Network Node

  1. 2 NICs
  2. 3gb ram
  3. 500gb HDD
  4. Quad-core processor

Node 2: Compute Node

  1. 1 NIC
  2. 16gb ram
  3. 6TB HDD
  4. Quad-core i7 processor

Node 3: Compute Node

  1. 1 NIC
  2. 3gb ram
  3. 500gb HDD
  4. Quad-core processor

Node 4: Storage Node

  1. 1 NIC
  2. 16gb ram
  3. 6TB HDD
  4. Quad-core processor

Network Analysis

Though we might add more NICs to our machines, at the moment our network looks like this:

Private Network

  • 192.168.100.0/24

Public Network

  • CIDR: 10.0.0.0/24
  • Floating IP Range: 10.0.0.50/31

Conclusion

Our setup has combined the Management and Data networks of this tutorial into one private network. For internet access before networking has been configured, we’ll be using a USB NIC for temporary access to our public network. When things are working though, our setup will look like it was described above.

Lastly, chances are we’ll expand our storage cluster so that any node with sufficient HDD space will be given Cinder (block-storage) functionality. This won’t happen until our setup is functional though, so this is way down the line.

Advertisements

Installing OpenStack, Quantum problems

During the following weeks we plan to expand more on the subject of setting up an OpenStack cloud using Quantum.
For now we have been experimenting with different Quantum functionality and settings.
At first Quantum might look like a black box, not due to its complexity, but because it deals with several different plugins and protocols that if a person is not very familiar with them it becomes hard to understand why Quantum is there in the first place.

In a nutshell Quantum has the role to provide an interface to configure the network of multiple VMs in a cluster.

In the last few years the lines between a system, network and virtualization admin have become really blury.
The classical unix admin is pretty much non existent now a days since most services are offered in the cloud in virtualized environments.
And since everything seems to be migrating over to the cloud some network principles that were applied into physical networks in the past some times don’t translate very well to virtualized networks.

Later we’ll have some posts explaining what technologies and techniques underlie the network configuration of a cloud, in our case focusing specifically on OpenStack and Quantum.

With that being said, below are a few errors that came up during the configuration of Quantum:

1. ERROR [quantum.agent.dhcp_agent] Unable to sync network state.

This is error is most likely caused due a misconfiguration of the rabbitmq server.
A few ways to debug the issue is to:
Check if the file /etc/quantum/quantum.conf in the controller node(where the quantum server is installed) has the proper rabbit credentials

By default rabbitmq runs on port 5672, so run:

netstat -an | grep 5672

and check if the rabbitmq server is up an running

On the network node(where the quantum agents are installed) also check if the /etc/quantum/quantum.conf have the proper rabbit credentials:

If you are running a multihost setup make sure the rabbit_host var points to the ip where the rabbit server is located.

Just to be safe check if you have a connection on the management networking by pinging all the hosts in the cluster and restart both the quantum and rabbitmq server as well the quantum agents.

2. ERROR [quantum.agent.l3_agent] Error running l3_nat daemon_loop

This error requires a very simple fix, however, it was very difficult to find information about the problem online.
Luckily, I found one thread on the mailing list of the fedora project explaining in more details the problem.

This is error is due to the fact that keystone authentication is not working.
A quick explanation – the l3 agent makes use of the quantum http client to interface with the quantum service.
This requires keystone authentication. If this fails then the l3 agent will not be able to communicate with the service.

To debug this problem check if the quantum server is up and running.
By default the server runs on port 9696

root@folsom-controller:/home/senecacd# netstat -an | grep 9696
tcp        0      0 0.0.0.0:9696            0.0.0.0:*               LISTEN
tcp        0      0 192.168.0.11:9696       192.168.0.12:40887      ESTABLISHED

If nothing shows up is because the quantum server is down, try restarting the service to see if the problems goes away:

quantum-server restart

You can also try to ping the quantum server from the network node(in a multihost scenario):

root@folsom-network:/home/senecacd# nmap -p 9696 192.168.0.11

Starting Nmap 5.21 ( http://nmap.org ) at 2013-01-28 08:07 PST
Nmap scan report for folsom-controller (192.168.0.11)
Host is up (0.00038s latency).
PORT     STATE SERVICE
9696/tcp open  unknown
MAC Address: 00:0C:29:0C:F0:8C (VMware)

Nmap done: 1 IP address (1 host up) scanned in 0.04 seconds

3.ERROR [quantum.agent.l3_agent] Error running l3_nat daemon_loop – rootwrap error

I didn’t come across this bug, but I found a few people running into this issue.
Kieran already wrote a good blog post explaining the problem and how to fix it

You can check the bug discussion here

4. Bad floating ip request: Cannot create floating IP and bind it to Port , since that port is owned by a different tenant.

This is just a problem of mixed credentials.
Kieran documented the solution for the issue here

There is also a post on the OpenStack wiki talking about the problem.

Conclusion

This should help fixing the problems that might arise with a Quantum installation.
If anybody knows about any other issues with Quantum or has any suggestions about the problems listed above please let us know!

Also check the official guide for other common errors and fixes

OpenStack High Availability Features

Until the moment we have been researching possible solutions to make an OpenStack Cloud deployment have as much high availability features as possible.

Before the Folsom release H.A features were not built in the OpenStack service components.
With a large number of requests from the OpenStack community, starting with the Folsom release H.A is being addressed as part of the project. The features are still being introduced and in test phase, there aren’t a lot of production deployments out there yet, but with the help and feedback of the community the OpenStack developers believe that by the time the next version is release (Grizzly) OpenStack H.A features will be automated and ready to get in production mode from the get go.

Getting into the details of the H.A features available in Folsom:
Instead of reinventing the wheel, OpenStack decided to go with a proven and robust H.A provider available in the market: Pacemaker was their choice. With more than half a decade of production deployments, Pacemaker is a proven solution when it comes to providing H.A features to a vast range of services.

Specifically looking at the technologies involved with OpenStack, the role of H.A would be to prevent:

  • System downtime — the unavailability of a user-facing service beyond a specified maximum amount of time, and
  • Data loss — the accidental deletion or destruction of data.

In the end the focus is to eliminate Single Points of Failures in the cluster architecture.
A few examples:

  • Redundancy of network components, such as switches and routers,
  • Redundancy of applications and automatic service migration,
  • Redundancy of storage components,
  • Redundancy of facility services such as power, air conditioning, fire protection, and others.

Pacemaker relies on the Corosync project for reliable cluster communications. Corosync implements the Totem single-ring ordering and membership protocol and provides UDP and InfiniBand based messaging, quorum, and cluster membership to Pacemaker.

An OpenStack high-availability configuration uses existing native Pacemaker RAs (such as those managing MySQL databases or virtual IP addresses), existing third-party RAs (such as for RabbitMQ), and native OpenStack RAs (such as those managing the OpenStack Identity and Image Services).

Even though high availability features exist for native OpenStack components and external services they are not automated in the project yet so there is a need for manual installation and configuration of whatever H.A features are needed in the cloud deployment

A quick summary of how a Pacemaker setup would look is:
pacemaker-cluster

PaceMaker creates a cluster of nodes and uses Corosync to establish a communication between them.

Besides working with RabbitMQ, Pacemaker can also bring H.A features to a MySQL cluster, the steps would be:

  • configuring a DRBD (Distributed Replicated Block Device) device for use by MySQL,
  • configuring MySQL to use a data directory residing on that DRBD device,
  • selecting and assigning a virtual IP address (VIP) that can freely float between cluster nodes,
  • configuring MySQL to listen on that IP address,
  • managing all resources, including the MySQL daemon itself, with the Pacemaker cluster manager.

 

More information can be found:
DRBD
RabbitMQ
Towards a highly available (HA) open cloud: an introduction to production OpenStack
Stone-IT
Corosync

Basic OpenStack Folsom Overview (working)

System Architecture

Built on a shared-nothing, messaging-based architecture. You can run all of
the major components on multiple servers including a compute controller,
volume controller, network controller, and object store (or image service).
A cloud controller communicates with the internal object store via HTTP
(Hyper Text Transfer Protocol), but it communicates with a scheduler, network
controller, and volume controller via AMQP (Advanced Message Queue Protocol).
To avoid blocking each component while waiting for a response, OpenStack
Compute uses asynchronous calls, with a call-back that gets triggered when a
response is received.

  1) Cloud Controller
    - Represents the global state and interacts with all other components. 
  2) API Server
    - Acts as the web services front end for the cloud controller
  3) Compute Controller
    - Provides server resources to Compute and typically contains the compute service itself
  4) Object Store
    - Optional, provides storage services
  5) Identity Service
    - Provides authentication and authorization services 
  6) Volume Controller
    - Provides fast and permanent block-level storage for the compute servers
  7) Network Contoller
    - Provides virtual networks to enable compute servers to interact with each other and with the public network
  8) Resource Scheduler
    - Selects the most suitable compute controller to host an instance

High-level Overview of OpenStack components:
Conceptual Architecture

  1) Dashboard (Horizon)
     - Modular Django web application
     - Administrator interface for OpenStack
     - Usually deployed via mod_wsgi (a python module that supports the Web Server Gateway Interface standard) in Apache
     - Code is separated into reusable python modules with most of the logic and presentation
     - Can be customer accessible
     - Can communicate w/ each service's public APIs
     - Can also administer functionality for other services (thru admin api endpoints)
  2) Compute (Nova)
     - Most distributed component of OpenStack
     - Turns end user API requests into running VMs
     [Nova-api] 
       * Accepts and responds to end user compute API calls
       * Initiates most VM orchestration (such as running an instance)
       * Enforces some policy (mostly quota checks)
     [Nova-compute]
       * Worker daemon that communicates with hypervisor API's to create and terminate instances
       * Updates nova database containing status of instances
     [Nova-schedule]
       * Conceptually simplest: Takes a request for an instance from the queue and determines which server host it should run on
     [queue]
       * Central hub for passing messages between daemons
       * Usually implemented w/ RabbitMQ but could be any AMPQ(Advanced Message Queuing Protocol) message queue
     [sql database] 
       * Stores most of the build-time + run-time states for a cloud infrastructure
       * Includes instance types, instances in use, networks avaliable and projects
       * Theoretically supports any database supported by SQL-Alchemy (Most common is sqlite3, MySQL, PostgreSQL)
     [nova-consoleauth nova-novncproxy nova-console]
       * Console services to allow end users to access their virtual instance's console through proxy.
  2) Object Store (Swift)
    - Built to be very distributed to prevent any single point of failure
    - Can use up to 3 servers (Account management server, Container management server, Object management server)
  3) Image Store (Glance)
    - [glance-api] Accepts API calls for image discovery retrieval and storage
    - [glance-registry] stores processes and retrieves metadata about images (size, type etc) from database
    - [glance database] stores image metadata
    - [storage repo] stores actual image files. 
       - Can be configured for to use Swift, normal filesystems, RADOS block devices, Amazon S3 and HTTP
    - [replication services] ensures consistency and availability through the cluster.
      *** GLANCE SERVES A CENTRAL ROLE TO OVERALL IaaS ***
  4) Identity (Keystone)
    - Single point of integration for OpenStack policy, catalog, token and authentication
    - [keystone] handles API requests and configurable authentication services
    - Each [keystone] function has a pluggable backend - most support LDAP, SQL or KVS(Key Value Stores)
  5) Network (Quantum)
    - Provides "network connectivity as a service" between interface devices managed by other OpenStack services (usually the Nova suite)
    - Allows users to create their own virtual networks and then attach interfaces to them
    - Highly configurable due to its plugin architecture
    - [quantum-server] accepts API requests and routes them to the appropriate plugin
    - [quantum-*-plugin] perform actual networking actions (
    - Supports plugins for Cisco virtual and physical switches, Nicira NVP product, NEC OpenFlow products, Open vSwitch, Linux bridging and the Ryu Network Operating System
    - Commonly uses an L3 agent and a DHCP agent in addition to the specific plug-in agent
    - Most installations use a messaging queue to route information between [quantum-server] and  agents in use
  6) Block Storage (Cinder)
    - Allows for manipulation of volumes, volume types and volume snapshots
    - [cinder-api] accepts API requests and routes them to [cinder-volume]
    - [cinder-volume] acts upon requests by recording to the Cinder database to maintain state and interacting with other processes through a message queue
      - Has driver support for storage providers: IBM, SolidFire, NetApp, Nexenta, Zadara, linux iSCSI and others
    - [cinder-scheduler] picks the optimal block storage provider node to create the volume on.
    - Mainly interacts with the [nova] suite, providing volumes for its instances

High-level Overview of Important OpenStack capabilities:

  1) Hypervisors
    - Supports KVM, LXC, QEMU, UML, VMWare ESX/ESXi 4.1 update 1 and Xen
  2) Users & Tenants(projects)
    - OpenStack is designed to be used by many different cloud computing consumers or customers, basically tenants on a shared system, using role-based access assignments
    - Roles control the actions that a user is allowed to perform, and are highly customizable
    - User's access to particular images is limited by tenant, but the username and password are assigned per user
    - Key pairs granting access to an instance are enabled per user, but quotas to control resource consumption across available hardware resources are per tenant

Storage on OpenStack

  1) Ephemeral Storage
    - Is associated with a single unique instance. Its size is defined by the template of the instance
    - Ceases to exist when the instance it is associated with is terminated permanently

  2) Volume Storage
    - Volumes are independent or any particular instance and are persistent
    - User created and, within quota and availability limits, may be of any arbitrary size
    - Do NOT provide concurrent access from multiple instances

OpenStack Network Infrastructure
Basic Network Overview

Quantum’s Role
Quantum's Role

Bugfix: Bad Floating IP Address – OpenStack Folsom Basic Install

I ran into a problem in the final step of this tutorial, where I had to assign a floating IP address to a newly created VM. Following the instructions gave me this error:

Bad floatingip request: Cannot create floating IP and bind it to Port ......., since that port is owned by a different tenant.

The problem was using the Controller Node to run the command, without properly re-sourcing the environmental variables involved. To do this, I needed to change the [novarc] file the tutorial had me create when I set up the Controller Node.

This:

export OS_TENANT_NAME=admin
export OS_USERNAME=admin
export OS_PASSWORD=password
export OS_AUTH_URL="http://localhost:5000/v2.0"
export SERVICE_ENDPOINT="http://localhost:35357/v2.0"
export SERVICE_TOKEN=password

needs to be changed to:

export OS_TENANT_NAME=demo
export OS_USERNAME=admin
export OS_PASSWORD=password
export OS_AUTH_URL="http://localhost:5000/v2.0"
export SERVICE_ENDPOINT="http://localhost:35357/v2.0"
export SERVICE_TOKEN=password

then re-sourced before running the command:

source novarc
quantum floatingip-create ...

And how!

Bugfix: OpenStack Quantum L3 Agent Rootwrap Error

When trying to set up my Network Node (see this tutorial) my /var/log/quantum/l3_agent.log shows this error:

2012-10-22 09:00:48 DEBUG [quantum.agent.linux.utils] Running command: sudo /usr/bin/quantum-rootwrap /etc/quantum/rootwrap.conf /sbin/iptables-save -t filter
2012-10-22 09:00:48 DEBUG [quantum.agent.linux.utils]
Command: ['sudo', '/usr/bin/quantum-rootwrap', '/etc/quantum/rootwrap.conf', '/sbin/iptables-save', '-t', 'filter']
Exit code: 99
Stdout: 'Unauthorized command: /sbin/iptables-save -t filtern'
Stderr: ''
2012-10-22 09:00:48 ERROR [quantum.agent.l3_agent] Error running l3_nat daemon_loop
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 170, in daemon_loop
    self.do_single_loop()
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 227, in do_single_loop
    self.process_router(ri)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 300, in process_router
    self.external_gateway_added(ri, ex_gw_port, internal_cidrs)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 398, in external_gateway_added
    ri.iptables_manager.apply()
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/iptables_manager.py", line 282, in apply
    root_helper=self.root_helper))
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/utils.py", line 55, in execute
    raise RuntimeError(m)
RuntimeError:
Command: ['sudo', '/usr/bin/quantum-rootwrap', '/etc/quantum/rootwrap.conf', '/sbin/iptables-save', '-t', 'filter']
Exit code: 99
Stdout: 'Unauthorized command: /sbin/iptables-save -t filtern'
Stderr: ''

This error has been well documented, but there hasn’t been a step by step guide to fixing it.

Luckily, there really is only one step!

Step 1: Edit quantum/agent/linux/iptables_manager.py

The problem is that the command that causes the error, sudo /usr/bin/quantum-rootwrap /etc/quantum/rootwrap.conf /sbin/iptables-save -t filter, cannot be an absolute path or the rootwrap won’t work. Specifically, /sbin/iptables-save -t filter cannot be absolute. For more details on the nature of the issue, check the bug report here. In any case, it’s a simple fix.

Change line 272 of /usr/lib/python2.7/dist-packages/quantum/agent/linux/iptables_manager.py from:

s = [('/sbin/iptables', self.ipv4)]

to

s = [('iptables', self.ipv4)]

And that’s it!

Basic Overview: OpenStack Quantum

[OpenStack Quantum]

Basic Network Abstractions
– [Network] | An isolated L2 segment, analogous to VLAN
– [Subnet] | Block of v4/v6 IP addresses and config states
– [Port] | Connection point for attaching a single device to a Quantum [Network]

Plugin Support
– [openvSwitch]
– [Cisco]
– [Linux Bridge]
– [Nicira NVP]
– [Ryu]
– [NEC OpenFlow]

High Level Overview

– [plugin agent] | quantum-*-agent
Runs on each hypervisor to perform local configuration of vswitches. * = variable, dependant on plugins used (see above)
– [dhcp agent] | quantum-dhcp-agent
Provides DHCP services to tenant networks. Agent is the same for all plugins
– [l3 agent] | quantum-l3-agent
Provides L3/NAT forwarding to provide external network access for VMs on tenant networks. Agent is the same for all plugins
– [agent interaction]
agents interact with the main [quantum-server] process through Remote Procedure Calls(RPC), with RabbitMQ or qpid, OR through standard Quantum API.
– [authentication]
Quantum relies on Keystone for authentication and authorization of all API requests.
– [nova]
Nova interacts w/ Quantum through its API.
“As part of creating a VM, nova-compute[, a nova process on the Controller Node,] communicates with the Quantum API to plug each virtual NIC on the VM into a particular Quantum network.”

Hardware Requirements
– In simple deployments, the Controller Node and Network Node can be combined.
– In more complex deployments, a dedicated Network Node will avoid CPU contention between packet forwarding performed by Quantum and other OpenStack services

Infrastructure Network Architecture